AWS News Blog
Amazon Neptune – A Fully Managed Graph Database Service
Of all the data structures and algorithms we use to enable our modern lives, graphs are changing the world everyday. Businesses continuously create and ingest rich data with complex relationships. Yet developers are still forced to model these complex relationships in traditional databases. This leads to frustratingly complex queries with high costs and increasingly poor performance as you add relationships. We want to make it easy for you to deal with these modern and increasingly complex datasets, relationships, and patterns.
Hello, Amazon Neptune
Today we’re launching a limited preview of Amazon Neptune, a fast and reliable graph database service that makes it easy to gain insights from relationships among your highly connected datasets. The core of Amazon Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with milliseconds of latency. Delivered as a fully managed database, Amazon Neptune frees customers to focus on their applications rather than tedious undifferentiated operations like maintenance, patching, backups, and restores. The service supports fast-failover, point-in-time recovery, and Multi-AZ deployments for high availability. With support for up to 15 read replicas you can scale query throughput to 100s of thousands of queries per second. Amazon Neptune runs within your Amazon Virtual Private Cloud and allows you to encrypt your data at rest, giving you complete control over your data integrity in transit and at rest.
There are a lot of interesting features in this service but graph databases may be an unfamiliar topic for many of you so lets make sure we’re using the same vocabulary.
Graph Databases
A graph database is a store of vertices (nodes) and edges (relationships or connections) which can both have properties stored as key-value pairs. Graph databases are useful for connected, contextual, relationship-driven data. Some examples applications are social media networks, recommendation engines, driving directions, logistics, diagnostics, fraud detection, and genomic sequencing.
Amazon Neptune supports two open standards for describing and querying your graphs:
- Apache TinkerPop3 style Property Graphs queried with Gremlin. Gremlin is a graph traversal language where a query is a traversal made up of discrete steps following an edge to a node. Your existing tools and clients that are designed to work with TinkerPop allow you to quickly get started with Neptune.
- Resource Description Framework (RDF) queried with SPARQL. SPARQL is a declarative language based on Semantic Web standards from W3C. It follows a subject->predicate->object model. Specifically Neptune supports the following standards: RDF 1.1., SPARQL Query 1.1., SPARQL Update 1.1, and the SPARQL Protocol 1.1.
If you have existing applications that work with SPARQL or TinkerPop you should be able to start using Neptune by simply updating the endpoint your applications connect to.
Let’s walk through launching Amazon Neptune.
Launching Amazon Neptune
Start by navigating to the Neptune console then click “Launch Neptune” to start the launch wizard.
On this first screen you simply name your instance and select an instance type. Next we configure the advanced options. Many of these may look familiar to you if you’ve launched an instance-based AWS database service before, like Amazon Relational Database Service (RDS) or Amazon ElastiCache.
Amazon Neptune runs securely in your VPC and can create its own security group that you can add your EC2 instances to for easy-access.
Next, we are able to configure some additional options like the parameter group, port, and a cluster name.
On this next screen we can enable KMS based encryption-at-rest, failover priority, and a backup retention time.
Similar to RDS maintenance of the database can be handled by the service.
Once the instances are done provisioning you can find your connection endpoint on the Details page of the cluster. In my case it’s triton.cae1ofmxxhy7.us-east-1.rds.amazonaws.com
.
Using Amazon Neptune
As stated above there are two different query engines that you can use with Amazon Neptune.
To connect to the gremlin endpoint you can use the endpoint with /gremlin
to do something like:
curl -X POST -d '{"gremlin":"g.V()"}' https://your-neptune-endpoint:8182/gremlin
You can similarly connect to the SPARQL endpoint with /sparql
curl -G https://your-neptune-endpoint:8182/sparql --data-urlencode 'query=select ?s ?p ?o where {?s ?p ?o}'
Before we can query data we need to populate our database. Let’s imagine we’re modeling AWS re:Invent and use the bulk loading API to insert some data.
For Property Graph, Neptune supports CSVs stored in Amazon Simple Storage Service (S3) for loading node, node properties, edges, and edge properties.
A typical CSV for vertices looks like this:
~label,name,email,title,~id
Attendee,George Harrison,george@thebeatles.com,Lead Guitarist,1
Attendee,John Lennon,john@thebeatles.com,Guitarist,2
Attendee,Paul McCartney,paul@thebeatles.com,Lead Vocalist,3
The edges CSV looks something like this:
~label,~from,~to ,~id
attends,2,ARC307,attends22
attends,3,SRV422,attends27
Now to load a similarly structured CSV into Neptune we run something like this:
curl -H 'Content-Type: application/json' \
https://neptune-endpoint:8182/loader -d '
{
"source": "s3://super-secret-reinvent-data/vertex.csv",
"format": "csv",
"region": "us-east-1",
"accessKey": "AKIATHESEARENOTREAL",
"secretKey": "ThEseARE+AlsoNotRea1K3YSl0l1234coVFefE12"
}'
Which would return:
{
"status" : "200 OK",
"payload" : {
"loadId" : "2cafaa88-5cce-43c9-89cd-c1e68f4d0f53"
}
}
I could take that result and query the loading status: curl https://neptune-endpoint:8182/loader/2cafaa88-5cce-43c9-89cd-c1e68f4d0f53
{
"status" : "200 OK",
"payload" : {
"feedCount" : [{"LOAD_COMPLETED" : 1}],
"overallStatus" : {
"fullUri" : "s3://super-secret-reinvent-data/stuff.csv",
"runNumber" : 1,
"retryNumber" : 0,
"status" : "LOAD_COMPLETED",
"totalTimeSpent" : 1,
"totalRecords" : 987,
"totalDuplicates" : 0,
"parsingErrors" : 0,
"datatypeMismatchErrors" : 0,
"insertErrors" : 0
}
}
}
For this particular data serialization format I’d repeat this loading process for my edges as well.
For RDF, Neptune supports four serializations: Turtle, N-Triples, N-Quads, and RDF/XML. I could load all of these through the same loading API.
Now that I have my data in my database I can run some queries. In Gremlin, we write our queries as Graph Traversals. I’m a big Paul McCartney fan so I want to find all of the sessions he’s attending:
g.V().has("name","Paul McCartney").out("attends").id()
This defines a graph traversal that finds all of the nodes that have the property “name” with the value “Paul McCartney” (there’s only one!). Next it follows all of the edges from that node that are of the type “attends” and gets the ids of the resulting nodes.
==>ENT332
==>SRV422
==>DVC201
==>GPSBUS216
==>ENT323
Paul looks like a busy guy.
Hopefully this gives you a brief overview of the capabilities of graph databases. Graph databases open up a new set of possibilities for a lot of customers and Amazon Neptune makes it easy to store and query your data at scale. I’m excited to see what amazing new products our customers build.
– Randall
P.S. Major thanks to Brad Bebee and Divij Vaidya for helping to create this post!