RavenDB at Scale: A Bitter-Sweet Symphony

In January 2014 we implemented RavenDB as the data store for a new project; for us the goal was to have a system that encrypts at rest whilst providing an easy way to perform full text searching across millions of rows.

As we use Windows Azure our options at the time were either to host a full SQL Server instance and use TDE alongside the in built full text searching functionality or look into tools like RavenDB. Windows Azure SQL Database did not support encryption or full text indexing when the decision was made; and Azure DocumentDB was still a while away.

After a series of spikes we decided to go for RavenDB; it includes “bundles” for encryption alongside a very easy replication set up (On Azure you must have 2 VM instances to maintain the SLA) and versioning of documents. It’s also built on top of Lucene.net and subsequently supports full text indexing across every document in the database.

Now we’ve been using RavenDB for over a year; 8 months or so of that in full production, and store more than 6 million documents in a single data store used day in day out by our clients. It’s been a roller-coaster of a ride and I wanted to share some of the experiences we’ve had, in a hope that others can learn from some of our mistakes.

Before We Start

The absolute best advice I can provide before using RavenDB is Understand CQRS and Eventual Consistency. I mean really understand it, don’t just read up on it but really do you homework and write spike projects showing off it’s advantages and pitfalls. If you expect RavenDB to work like a traditional ACID relational database you will fall flat on your face; it truly is a different mindset.

The best way to demonstrate this is to built a program in the following form:

  • Single RavenDB instance
  • 1 collection of documents; let’s say “Posts”
  • Write a unit test (RavenDB has some great support for this using it’s embedded instance) that does the following: - Stores 10,000 new “Post” documents in RavenDB as fast as possible
  • Then queries the last “Post” document inserted using a standard query (eg: .Query<Post>(x => x.Title = “Newsletter”); )

Something like this:

[Test] 
public void EventualConsistencyDemonstration() 
{ 
    using (var store = NewDocumentStore()) 
    { 
        for(var i = 1; i <= 10001; i++) 
        { 
            using (var session = store.OpenSession())
            { 
                session.Store(new Post { Title = "Newsletter " + i }); 
                session.SaveChanges();
            } 
        } 

        using (var session = store.OpenSession()) 
        { 
            var results = session.Query<Post>().Where(p => p.Title = "Newsletter 10000"); 
            Assert.IsTrue(results.Any()); } } }

One thing should be apparent immediately; your query operation will return no results and fail on the Assert. This brings up the fundamental design decision taken by Hibernating Rhinos when building RavenDB.

When you store new documents in RavenDB; they are written using ACID consistency; just like SQL Server. Unlike a traditional relational database however, the indexes and all query operations are BASE consistency.

Let’s say we have a table in SQL Server with 3 indexes present and we insert a record into it. Upon writing the record the associated index entries are written at the same time in the same transaction; so when the user comes to query that same table the indexes are already ready to server the request. The advantage is that the data is always consistent assuming you are reading committed transactions. The disadvantage is that as you add more indexes you are slowing down the writes to the database.

RavenDB works differently; writing a document to the database does not immediately update the indexes; instead they are updated asynchronously in a background thread. The advantage is that writes are lightning fast (Whilst still using ACID consistency) and it allows us to write pretty awesome – can’t believe this works – style map/reduce indexes. The disadvantage, and it’s a big one to consider, is that you can never assume your reads are up-to-date. If your indexes are still rebuilding in the background you will only get partial or inaccurate results when querying the data store.

Another important point to note is the idea of Static v.s. Dynamic indexes. Static indexes are ones we define and tell RavenDB about beforehand; they are written as C# classes (nice and strongly typed too) and are deployed to RavenDB at runtime using IndexCreation.CreateIndexes(typeof(MyIndexClass).Assembly, documentStore); (Normally in Application_Start, we’ll come to this in more detail later).

Dynamic indexes on the other hand are calculated on the fly by RavenDB. In the above test we used session.Query<Post>().Where(p => p.Title = “Newsletter 10001”); without telling it to use an existing index. Under the hood RavenDB will try to find an index that maps the fields requested and, if one doesn’t exist, will proceed to create a new index. However as the index has only just been created we have to wait for it to build before it returns non-stale data. Again we’ll address this later.

The Good

Firstly; RavenDB can do some pretty awesome things. Multi-map indexes, in-built replication and fail over, unit test helpers and more, the list goes on. There was never a time using RavenDB where there wasn’t a way to create a well performing solution to a problem, and the tooling it gives you for hosting across multiple servers is excellent.

Secondly; at scale for day-to-day operations it is still performing nicely. The system we have boasts an API for clients to bulk upsert data into the server that receives a fair amount of traffic with a second API for read operations that gets continually hammered throughout the day. At 6 million records plus, both the writes and searches are still lovely and fast; any inefficiencies I am confident are now due to business logic and the transport layer rather than the raw RavenDB queries.

Then there’s creating documents and indexes in code (at least initially; we’ll get to that later). It’s like EF code-first but so much less hassle, storing an item for example is:

session.Store(new Post { Title = "Newsletter" }); 
session.SaveChanges();

I haven’t had to tell RavenDB what a Post class is (no database migration scripts), it will generate a JSON document out of it and that’s the end of it. We’ve had no real de-serialization issues and the ability to store whole nested collections makes modelling data a whole lot more epic. Indexes are the same, nicely defined through strongly-typed code using Linq like so:

public class Posts_ByTitle : AbstractIndexCreationTask<Post> 
{ 
    public Posts_ByTitle() 
    { 
        Map = users => from post in posts select new 
        { 
            Title = post.Title 
        }; 
    } 
}

Getting started in RavenDB, getting up and running quickly, is hassle free and almost joyous to begin with.

The Bad

However we are talking about RavenDB at scale; and no matter how lovely and rapid it is to get started; there are some serious caveats to consider.

Revisiting eventual consistency; we experienced an issue early on that came out of not understanding CQRS. The dreaded WaitForNonStaleResults problem…

Let’s say we have a page that allows users to submit new “Post” entries using a form. Once created in the database the user is redirected to a list of all their posts in the system. Now it makes a lot of sense that the post they have just created is there at the top of the list on page refresh. However since migrating to RavenDB, even though the write and read operations are so much faster, you find that often the user is not shown the post.

Again this is because RavenDB is updating the indexes in the background; so the new Post, although it exists in the database, has not necessarily be indexed yet, so when we run a Query operation it doesn’t return in the results.

var results = session.Query<Post>()
    .Where(p => p.CreatedBy = "Bob")
    .OrderByDescending(p => p.CreatedDate)
    .Take(10);

For users viewing posts; having a short delay on the information being available is totally cool; they don’t know a new post has been submitted, if they see it in the list 10 minutes later it’s no biggie right? However the user who created the post; they are now unsure whether their information has actually been stored.

So how do we get around this? Well we started by making a huge mistake, you see RavenDB does provide a way around this in the form of WaitForNonStaleResults. What this does is tell the database that you want the up-to-date results from an index as of a given time stamp, like so:

var results = session.Query<Post>()
    .Where(p => p.CreatedBy = "Bob")
    .OrderByDescending(p => p.CreatedDate)
    .Customize(x => x.WaitForNonStaleResultsAsOfNow())
    .Take(10);

Problem solved right? Now the user is guaranteed to receive an up to date list of their results. Except….

Except well you’ve just created a massive scalability problem in your system. RavenDB will by default wait up to 15 seconds before throwing an error when the WaitForNonStaleResultsAsOfNow option is used. In day-to-day operations this is fine; however what if your system receives a large quantity of writes in a short window? Then the user is waiting a long time, with the potential for an error at the end, for their data to be returned.

Initially we had this on most read queries in the system and all it would take would be the following to start throwing errors all over the place and destroying the user experience:

  • Heavy write load in a short period of time
  • An index / indexes are rebuilding from scratch, reducing the time RavenDB can spend on the other indexes

My advice is that if you ever use WaitForNonStaleResultsAsOfNow then you need to reassess your approach. Instead; consider the following:

  • The difference between .Load<>() and .Query<>() in RavenDB. - .Load<>() uses the unique DocumentID and does not use an index. This will always return the latest data
  • .Query<>() always uses an index. The data must always be assumed stale
  • If you absolutely positively must return non-stale results, consider designing the appropriate data so you can always use DocumentID.
  • Employ caching and mocking of content; rather than depending on the database consistency - In this scenario we know the user has submitted an item
  • We can return a mocked version of that data until the database is up to date
  • This does however complicate the business logic when serving a request

We changed a lot of our upsert logic to use .Load<>() operations rather than depending on indexes, meaning we can be confident we are checking the existence of a document before attempting to insert it. For the read side we have removed the waiting completely; in 99% of scenarios the data being a few seconds out of date is completely fine. It’s the 1% that can cause funny user journeys.

Next up; the static v.s. dynamic indexes issue.

At first glance; the way RavenDB handles dynamic indexes is actually really cool. As stated earlier if you write a new query against a collection, it will automagically create a matching index if one doesn’t exist and add it to your database. If this index is queried enough by you application, it is promoted to a persisted index.

However in production this is one of the most dangerous features you could use. Let’s say you’ve got the aforementioned 6 million documents and a developer writes a new piece of code to query the collection using this logic.

var results = session.Query<Post>().Where(p => p.Category = category);

Now this is the first time the Category has ever been queried; so here’s the flow of events when you deploy this:

  1. You deploy the new application
  2. A user calls you “SearchByCategory” endpoint for the first time
  3. The above code runs to get a list of matching Post entries
  4. RavenDB sees the request and says “Cool, there’s no Category index, let’s built one”
  5. The results returned are 0; because the index has just started building
  6. The user scratches their head
  7. They retry the request
  8. They continue getting strange stale results

Wait, what?! Well because RavenDB has only just created the index; it’s not up to date. It’s now going to rebuild on the server, which if you have a very large collection can seriously take hours. Due to the index being created at runtime dynamically; the first user hitting that endpoint will never receive up to date results.

We have an issue now where “hidden” new queries in the code can kick off expensive rebuild operations and start reducing the performance of your database as a thread is taken up by building the new index.

Solution? Never…use…dynamic…indexes. Always define them as a static index in code; always deploy them using IndexCreation.CreateIndexes(typeof(MyIndexClass).Assembly, documentStore);. That way you know when the indexes your application depends on to run have come online and finished rebuilding.

The Ugly

Here’s the clincher for me and one of the issues that was detrimental to the RavenDB “image” in our organisation. Index deployment. The below piece is focused on RavenDB 2.5; we’ll revisit how RavenDB 3.0 improves on this at the end.

OK so you’ve been running RavenDB in production for 4 months now. Your collections are approaching 3-4 million documents and your main one has 5 indexes on it. For the purposes of this article let’s assume we have a single instance of RavenDB in production (In reality we had 2 load balanced instances which made this much easier as they could be isolated and updated in parallel without causing downtime).

A new requirement has come in for our “Post” collection that means we now need to filter by category as well as title. No biggie, let’s just update the static index we already have from:

public class Posts_Search : AbstractIndexCreationTask<Post> 
{ 
    public Posts_Search() 
    { 
        Map = users => from post in posts select new 
        { 
            Title = post.Title 
        }; 
    } 
}

To:

public class Posts_Search : AbstractIndexCreationTask<Post>
{
    public Posts_Search() 
    { 
        Map = users => from post in posts select new 
        { 
            Title = post.Title, 
            Category = post.Category 
        }; 
    } 
}

Then, as we deploy our indexes in the Global.asax -> Application_Start event using IndexCreation.CreateIndexes(typeof(MyIndexClass).Assembly, documentStore); we know this will be created when we first initialise the application.

So we get to deployment day; I deploy the project and the IIS application initialise process starts the app. RavenDB updates the index and boom; I have crippled the entire database. I do not joke.

You see in RavenDB 2.5 the following happens:

  1. The IndexCreation helper overwrites the “Posts_Search” index with the new definition
  2. This means the whole index has to rebuild from scratch (over millions of records this can take hours)
  3. Now whilst the index is rebuilding; any application code dependent on that index will return either no results or very stale results
  4. Also, and this is specific to RavenDB 2.5, none of your other indexes will update

That’s right, when 1 index is rebuilding from scratch, RavenDB does not update your other indexes until that one reports is up-to-date. So not only is any functionality dependent on the updated index now returning very stale data, all other parts of the system are affected and won’t show newly inserted information.

If you are running a single instance of RavenDB 2.5; there is no way around this. Sure you could write an index versioning system (we’ll get to that) however because any new index stops others from rebuilding, deploying a new one will degrade your user’s experience.

Here is where WaitForNonStaleResultsAsOfNow becomes truly dangerous; because in the above scenario your not just going to get stale data; if you use this query customisation your application will start to error all over the shop.

We were fortunate; we had 2 instances so wrote a large isolation procedure to take one box out of the load balanced set and update all of it’s indexes ready for the new version of the application, before failing over and updating the second box. This was however a relatively stressful process due to the time it took the indexes to rebuild and the manual nature of the isolation.

Now I mentioned that RavenDB 3.0 improves on this and it does, dramatically, to the point where the above point is almost mute. Hibernating Rhinos have done 2 things to change this process completely.

  • Indexes now rebuild in a round robin fashion; so deploying a new index will no longer stop the other indexes from updating - This allows us to write a versioning indexing system and deploy our changes prior to a release without having to isolate the VMs
  • Indexes can now be updated using the “side by side” approach outlined here- This effectively provides the aforementioned versioning system for us; allowing us to keep both the old index online whilst still building the new one in the background
  • In fact it addresses this issue directly; so I’m happy to see such a relevant change to the process from Ayende and the team

We started looking into an index versioning system with the aim that developers won’t have to configure it for each index (based on this great blog article here). The idea was to take index deployment out of the Application_Start event and instead use a separate program to deploy the indexes prior to the actual release. A hash was used to see if the index definition had changed and the hash was then applied to the index name for identification. With RavenDB 3.0 this is now a feasible approach with the round robin rebuilds; however the side-by-side rebuild takes care of the actual versioning so even less custom-rolled code is needed on our part.

The End

Ultimately what a lot of this comes down to is really understanding your tool and the approach it uses. Most of us come from a SQL Server background so it is a massive change in mindset to use a CQRS system and the majority of the issues we came across could be solved by changing the behaviour of the product.

The only “unsolvable” issue was that of index deployment which RavenDB have thankfully fixed in 3.0.

Would I use RavenDB again? If I was writing a home project, absolutely, it’s got a good developer experience and get’s out of your way quickly. If I was to recommend it for a large scale production system? I’d want the whole team to fully understand the caveats first and identify how they affect the planned functionality of the product.

Anything you disagree with? Maybe you want more details? Feel free to comment below.