Tuesday, October 16, 2012

CouchDB, Part 2 - Data Structure and JSON

Couchdb stores the documents in a native erlang format.  When it runs a view function, it runs a json encode (out of the mochiweb framework) to convert the native erlang object into JSON that the JavaScript interpreter can parse.  It runs the function on that document, and takes the output of the emit and turns it back into a native erlang object.

In this post, I will do my best to explain how the native erlang object translates to JSON, and JSON to native erlang.  The reason this is important is to know how to get values out of non-trivial JSON objects.  Most examples show how to get one level deep (proplists:get_value).

First is a short explanation of erlang objects.  There are three things you need to know about how erlang structures data.  The tuple, the list, and the bit string.  See this reference manual for Erlang data types.

The Tuple
Tuples are defined as "Compound data type with a fixed number of terms."  You'll see later that they are very similar to Lists, except the number of terms is fixed.  As far as I can tell, when storing JSON objects in Erlang, Tuples are mostly used to represent a name/value pair.  The other use is to contain lists.

The List
Lists are defined as "Compound data type with a variable number of terms."  Lists are obviously very handy for representing arrays.  They are also used to hold all of the name/value pairs stored in Tuples.

Bit String
"A bit string is used to store an area of untyped memory."  In this specific example, all strings are stored as bit strings in couchdb.  Bit strings look like this:  <<"foobar">>.  Empty strings either look like this:  <<>>, or this:  <<"">>.  Bit strings actually have quite a few more uses, but for this post we'll just stick to using them for strings.

Putting it together
So how does a JSON object look in Erlang?  Let's start with a simple example.

In Javascript:


{
_id: "article_A9CC7889-B880-1CEE-2493-1B7605619241",
_rev: "21-500f57210e2856f2bc24368555f67209",
type: "article",
title: "Phelps Wins!",
uri: "/news/phelps-wins"
}

This would look like this in Erlang:


{
[
{<<"_id">>, <<"article_A9CC7889-B880-1CEE-2493-1B7605619241">>},
{<<"_rev">>, <<"21-500f57210e2856f2bc24368555f67209">>},
{<<"type">>, <<"article">>},
{<<"title">>, <<"Phelps Wins!">>},
{<<"uri">>, <<"/news/phelps-wins">>}
]
}

Pretty simple so far.  What about something a bit more complicated, like an article with a list of tags:

Javascript:


{
_id: "article_A9CC7889-B880-1CEE-2493-1B7605619241",
_rev: "21-500f57210e2856f2bc24368555f67209",
type: "article",
title: "Phelps Wins!",
uri: "/news/phelps-wins",
tags: [ "news", "olympics", "swim" ]
}

Erlang:
{[
{<<"_id">>, <<"article_A9CC7889-B880-1CEE-2493-1B7605619241">>},
{<<"_rev">>, <<"21-500f57210e2856f2bc24368555f67209">>},
{<<"type">>, <<"article">>},
{<<"title">>, <<"Phelps Wins!">>},
{<<"uri">>, <<"/news/phelps-wins">>},
{<<"tags">>, [<<"news">>, <<"olympics">>, <<"swim">>]}
]}


And with a property that has another object as a value:

Javascript:

{
_id: "article_A9CC7889-B880-1CEE-2493-1B7605619241",
_rev: "21-500f57210e2856f2bc24368555f67209",
type: "article",
title: "Phelps Wins!",
uri: "/news/phelps-wins",
tags: [ "news", "olympics", "swim" ],
metas: {
pubDate: "2012-10-02T15:11:34Z",
author: "asmith",
description: "Phelps wins again in the Olympics"
}
}

Erlang:
{
[
{<<"_id">>, <<"article_A9CC7889-B880-1CEE-2493-1B7605619241">>},
{<<"_rev">>, <<"21-500f57210e2856f2bc24368555f67209">>},
{<<"type">>, <<"article">>},
{<<"title">>, <<"Phelps Wins!">>},
{<<"uri">>, <<"/news/phelps-wins">>},
{<<"tags">>, [<<"news">>, <<"olympics">>, <<"swim">>]},
{<<"metas">>, {[
{<<"pubDate">>, <<"2012-10-02T15:11:34Z">> },
{<<"author">>, <<"asmith">> },
{<<"description">>, <<"Phelps wins again in the Olympics">> }
]}
]
}

There are a few simple rules when going from JSON to an Erlang object.  First, the entire structure should be a tuple, containing a list.  Each item in the list should be a tuple with 2 items.  The first should always be a bit string (property name), and the second (value) can be a bit string, list, or tuple.  If it is a list, it should be a list of bit strings (an array).  If it is a tuple, the tuple should contain a list, and each item in the list should be a tuple with 2 items.  Wash, rinse, repeat.


Read your couchdb log.  It is EXTREMELY verbose and will dump the object that the view failed on.  This helps greatly when trying to figure out how the object is structured.

In my next post, I'll show you some patterns I've used in Erlang to get data out of non-trivial objects, as well as some ways to protect your view against bad data.


No comments: