Friday, August 3, 2012

CouchDB, Writing views in Erlang, Part 1

How it started, benchmarking erlang against javascript

It all started when we needed to write some views against couchdb, and were told that we've had problems with the reindex jobs taking down our production sites.  I've always heard that writing those views in erlang would result in faster re-index time, but had not heard any numbers around just how much faster they would be.  So I decided to test this theory.  I wrote the same view in two different design documents, one in javascript and one in erlang.  The view was simple, it would just output all documents of type "article" as key="publishDate", value=null.

The results were impressive.  Calling the view after an update (forcing a re-index) against around 12,000 documents and returning about 6,000, I recorded the entire time taken from the time the call was made to the time the results had been returned.  The javascript versions took approximately 45 seconds each time I ran it with a re-index.  The erlang version only took 10 seconds.  This is mighty compelling.

For the next few posts in this series, I will walk through a few examples of how to write couchdb views in erlang.  It is not easy to find information on how to do this past the easy use cases (by easy, I mean cases where the json document is relatively flat, no recursion or iteration, etc...).  In future posts I will explore more complicated examples.

The how:

First step is to enable Erlang in your couchdb settings.  Here is how to Enable Erlang Views in CouchDB.  You'll need to add this to your local.ini and restart couchdb:
[native_query_servers]
erlang = {couch_native_process, start_link, []}
After that is added, you'll now see a the drop down in your temporary view creator include erlang as an option.

Writing a View

First let me thank some other blog posts on getting me started.  I found great information in Echo Libre Blog on writing a view in erlang.  I like the compacted version better, and after time something like that will be easier to read and understand.  For an idea on how json documents are stored in couchdb, check out this Stack Overflow question on tuples in couchdb, especially the answer from Jan Lehnardt.  Starting couchdb in interactive mode is gold.  Pure gold.  Here's another great one Stack Overflow translating views from javascript to erlang.  The answer by Dustin is a great example of how to write a view in erlang.

Also, don't forget to use your couchdb log.  I usually have a tail on that log while I'm developing these views.  The log is located under
var/log/couchdb
I'm running the couchbase server on a Mac, and it is buried under
/Applications/Couchbase Single Server.app/Contents/Resources/couchdbx-core/var/log/couchdb
Let's write that view I mentioned before, where we want to just output publishDate as the key.  Here's the json:
{
  "_id" : "d7accdc6-cefb-420b-aa30-96d3597d3b91",
  "publishDate" : "2011-06-28T17:04:16Z",
  "status" : "Published"
}
and here's the code:
fun({Doc}) ->
  case proplists:get_value(<<"publishDate">>, Doc, <<>>) of
    <<>> -> ok;
    Pubdate ->
      Emit(Pubdate,null);    
    _ -> ok
  end
end.
Walking through this easy example, we are passing the publishDate value out of the incoming document into a case statement.  The last argument "<<>>" is just an empty string, and is the default that gets returned if there is not a key named "publishDate".  I like to do this so that I can test for both an empty value, and a non-existent key the same way.  Otherwise I need to handle both cases in my matches (both cases being undefined and empty string).

The case matches are evaluated in order, and the first one to match is the one that gets used.  In this example I match on an empty string first and do nothing if matched.  In the second case, I capture the value in an atom because I want to use the value later.  In this match, I do the Emit on the value returning a row.

Last match is just a catch all.  Erlang tends to get persnickety about pattern matches, and I don't want to blow up the whole view on something unexpected.

Let's get a little more complicated.  Say we want to check the "status" key in our json for a value of "Published" before we output the row.
fun({Doc}) ->
  case {
    proplists:get_value(<<"status">>, Doc, <<>>),
    proplists:get_value(<<"publishDate">>, Doc, <<>>)
  } of
    {_,<<>>} -> ok;
    {<<"Published">>, Pubdate} ->
      Emit(Pubdate,null);    
    _ -> ok
  end
end.
A couple of changes here.  The first is to change our case from a simple value to a tuple.  Keep in mind, a tuple is an object made up of several values.  Erlang uses them a lot to represent key/value pairs (more on that in a later post).  Also keep in mind when doing pattern matching in Erlang that you can match against complex types like this.  So each of our case matches will check the values inside of this tuple.

The first match is going to check for a status of anything and an empty pub date.  Remember, we still want to throw those away.  The second match is going to match any document that as a status == "Published" and capture the publishDate.  The last case again just throws out anything that doesn't match the first two; including rows that have a status of something other than "Published".

That's it for part 1 of writing views in Erlang.  Check back for Part 2 where I'll explore more complicated cases such as getting values deeper than one level in the json document, using regular expressions to extract values from a string, and emitting more than one row for each record.  The last one comes in handy when trying to extract multiple tags in a document into distinct rows.

No comments: