What is a Graph?
It is merely a way of structuring data using vertices (or nodes) and edges. Data is stored both at the vertices and on the edges themselves. Edges can be bi or uni directional. The flexibility comes from the fact that any edge and point to any node and many times, the information that we seek is related to the configuration of the graph, not just the data in the edges and nodes. In the graph above, we can see that Alice and Bob are both members of the Chess group and that they both know each other.
In the example, we use nouns as nodes and the verbs: knows, and is member as edges. This is a very common way of modeling natural language statements as graphs. Let's take a look at some fake tweets that I have contrived for an example:
@NikeFan - I love my Nike's. They are the best shoes.
@ShoeShopper - I just bought some Nikes, some Reeboks, and a Swiss Air.
@ReebokFan - I just hate Nike, their new lineup is horrible.
After some text analysis we could derive this graph:
As you can see, the edge "is" can be applied to either Shoe or Twitter users. This makes the graph easily extended as our knowledge increases. You can also see that if I want to find twitter users who have bought shoes, I would look for any node that "is" a shoe and see what nodes "bought" it while ensuring that node "is" also a Twitter User. This is where the graph database comes in. These types of databases are optimized for searching and traversing over relationships to resolve your queries. A traditional RDBMS could represent a graph as just a few tables, however resolving a graph query would take ages as each step in the search would involve scanning the same set of tables over and over. Then as your graph grows, your query times grow linearly or worse. With a graphing DB, there are many opportunities to limit where we search and enable the search to be done in parallel.
The Holy Grail - Graphing Databases on Hadoop
You would think that with all the power Hadoop provides, there should be at least one graph db on it, but at this time there are none. I think there are valid technical reasons for it. For one, Hadoop has a high latency. Even a one second response time would be unreasonable for simple queries and I doubt any hadoop job could be resolved in less than one sec. Part of the reason is that the inputs and outputs of a map reduce job are files and the underlying hadoop file system (HDFS) will log, replicate, and checkpoint the data to hard disk as the job runs. If we had a sort of Hadoop Lite system where we keep our partial data results only in memory, then we could reduce some of the latency. If a node fails, then we simply restart the computation from the point where we have data on disk, maybe even distributing that across the nodes that have already finished. The second issue is hadoop doesn't like the data elements that you process to be dependent on each other. With a graph query, that is very hard to do. Even if you could partition each node with disconnected graphs, they can still be easily connected in the future necessitating a costly reshuffle of the data. Worse yet, is that in time, graphs tend to be more connected, so eventually you'll have only one partition, not a very scalable option. The only solution I can come up is to allow the graphs to traverse nodes and when a query hits a border between machines, issue a new query that is fed back into another map reduce round. This continues until a solution is found. This means that if you have a solution that traverses through multiple machines, you would have to run multiple rounds of map reduce making the query super slow.
In the end, I expect we'll have to settle for a different platform other than hadoop to support graphing databases. We could still use hadoop to do inserts or query into a graph database, but as far as running the db on hadoop, that's not gonna happen in the foreseeable future.