It’s official: graph databases are a thing. That’s the consensus here on Big on Data among fellow contributors Andrew Brust and Tony Baer. When AWS enters a domain, it officially signals the upward slope of the hype cycle. It’s a bit like newfound land – first it’s largely unknown and inhabited by natives, then the pioneers show there are opportunities, then the heavyweights will try to colonize it.
The recent unveiling of AWS Neptune seems to have convinced even once self-proclaimed graph skeptics such as Brust and Baer. Why now, you ask? Much like Machine Learning for example, it’s not so much that there is a major breakthrough in the technology, rather it’s mostly a matter of maturation.
Hardware and software capabilities such as cheap storage and processing capacity in the cloud and on premise, understanding of the challenges in techniques for distributed indexing and querying of graphs and the realization of having big and connected enough datasets have all contributed to the perfect graph storm.
As always when some hitherto niche technology goes mainstream, there is the risk of misunderstanding or overhyping it to the point where it becomes a meaningless buzzword, a hammer applied to every problem regardless of whether it’s a nail. So, fair warning: if something does not feel like a graph then don’t try to force it to be.
Your videos are probably doing quite nicely living in the object store where you currently have them. A sales ledger system built using a relational database is probably doing just fine where it is and likewise a document store is quite possibly just the right place to be storing your documents. So “use the right tool for the job” remains as valid a phrase here as elsewhere.
That said, part of the reason behind graph’s appeal is that in many cases, it’s a natural way to model the world. More natural than the good old relational model? For certain domains and use cases, when the data you are storing is intrinsically linked by its nature, yes. For one thing, it certainly feels easier and performs better to query a graph database than a relational one for use cases involving many hops.
Having to go through a series of joins in relational algebra to do things such as finding friends of friends of friends is cumbersome to write and maintain and degrades performance. A graph model and query language can be more natural and efficient — but the key word in there is “can”. Not everything that looks like a graph is in fact a graph, and not all graphs come with the same querying facilities.
To quote Tony Baer: “I always felt graph was better suited being embedded under the hood because it was a strange new database without standards de facto or otherwise. But I’m starting to change my tune — every major data platform provider now has either a graph database or API/engine”. This highlights two important points: the difference between a native graph and a graph API, and the lack of standards.
Different people will use different definitions of engines and APIs, but in the end it’s all about data structures. If your database relies on data structures that are not a natural fit for a graph, and does not have all the right indexing in place, then although your queries may be easier to write using a graph API on top of it, their performance can only be as good as your database.
To give an example from the Microsoft world, quoting Andrew Brust: “The graph processing capabilities in SQL Server 2017 are clearly an abstraction layer and not native. Although node and edge table types are a real thing. But what about Cosmos DB? Graph is just one mode of operation, but I would still consider it native”.
This criterion is important, but not the only one, and making sense of a nascent market factoring everything in is not something that can be done in the context of an article. You can expect more extensive work on this from us in the near future, but if you still want a 10 minutes version of the Graph Database Landscape, you can see the one written by Yu Hu, TigerGraph’s CEO, in addition to our past coverage.
TigerGraph is among the graph databases we covered in 2017, alongside AllegroGraph, GraphDB and Neo4j. GraphDB and Neo4j are also among the graph databases officially listed on offer on AWS – although that does not mean that other graph databases cannot be deployed on AWS.
That’s the usual co-opetition scenario cloud providers and vendors have learned to live with, although in this case complicated may be more complicated than usual.
The other 2 graph databases given the official nod of support by AWS are JanusGraph and OrientDB. JanusGraph used to be Titan, and after Titan’s parent company Aurelius got acquired by DataStax Titan was forked as JanusGraph and is now backed by IBM, also a cloud provider. OrientDB was recently acquired too, by enterprise software vendor CallidusCloud.
Unboxing AWS Neptune
As for AWS Neptune (still in private beta), although we don’t expect to see an awful lot of disclosure in terms of internal workings we can note a couple of things.
As Tony Baer recently wrote, cloud storage becomes the de facto storage lake. On AWS, people use S3, and by now even have SQL querying facilities for it. Could AWS have built Neptune directly on S3, and would that make sense?
We don’t really know, but probably not. AWS speaks of the ability for continuous backup from Neptune to S3, which is quite telling. If S3 was the storage used for Neptune, S3 backups would be pointless as data would already be on S3, and all it would take would be to enable replication. But there is another hint there.
AWS is selling the option to use JanusGraph with Amazon DynamoDB as its storage backend. DynamoDB is a key-value database, and a key-value metaphor, and structure, lends itself well to graph. That is in fact what Titan and now JanusGraph are using as a back-end store for their graphs, so it makes sense for AWS to have built Neptune on DynamoDB.
To return to the Big on Data contributor graph showdown and quote Andrew Brust, “in the database world, everything comes down to key-value pairs. So if you have a database with that as the core construct, you have the potential to do almost anything you want. Although, out of the box, you may not be able to do much”.
So could it be that AWS Neptune really is an elaborate layer over DynamoDB that adds a graph metaphor and API to an underlying key-value store? That may sound like oversimplifying, but it seems plausible.
One could argue that Titan and its offsprings, JanusGraph and DSE Graph, are similar in nature, and AWS makes a point of emphasizing how Titan’s pluggable architecture makes it easy to start using DynamoDB without changing applications. But how efficient is that?
We don’t really have indicators of AWS Neptune’s performance at this point, although as one would expect AWS waxes lyrical about it and other vendors are quick to point to all the ins and outs of making distributed graphs work that AWS could possibly get wrong.
The fact however is that AWS is not really in the business of getting it wrong, and its sheer gravity makes it a force to be reckoned with. This is what the CEO of Neo4j, the leading graph database in market adoption currently, is saying. Other graph vendors are also acknowledging the fact that their market seems set to grow significantly, and preparing to fight in the face of increased competition.
Standards, too many or none
What we do know however about AWS Neptune, which brings us to the second important point — standards — is this: Neptune supports the popular graph query languages Apache TinkerPop Gremlin and W3C’s SPARQL, allowing users to easily build queries that efficiently navigate highly connected datasets.
In a world that seems to be lacking the equivalent of what SQL is in the relational world — a de facto standard for querying — this is pretty important. It means that Neptune offers maximum flexibility for its users, and it’s a move that is both smart and pragmatic from AWS.
In graph, there are competing models and query languages, and offering the ability to query Neptune using two of the most popular ones widens Neptune’s potential user base and use cases. AWS is not alone in this, but being vocal about it and making it easy to use could make a difference.
We have already covered SPARQL and some of the things it can be used for. In the next part of this mini-series on graph, we are going to focus on Apache TinkerPop, its querying language called Gremlin, its features and the role it can play in the graph database world and beyond.
Previous and related coverage
While Hadoop was designed around commodity infrastructure, Hadoop 3.x confronts the reality that too much cheap storage can get expensive.
While AI, IoT, and GDPR grab the headlines, don’t forget about the about the generational impact that cloud migration and streaming will have on big data implementations.
Bigdata and data center