Ingest data in bulk in the Azure Cosmos DB for Gremlin by using a bulk executor library

APPLIES TO: Gremlin

Graph databases often need to ingest data in bulk to refresh an entire graph or update a portion of it. Azure Cosmos DB, a distributed database and the backbone of the Azure Cosmos DB for Gremlin, is meant to perform best when the loads are well distributed. Bulk executor libraries in Azure Cosmos DB are designed to exploit this unique capability of Azure Cosmos DB and provide optimal performance. For more information, see Introducing bulk support in the .NET SDK.

In this tutorial, you learn how to use the Azure Cosmos DB bulk executor library to import and update graph objects into an Azure Cosmos DB for Gremlin container. During this process, you use the library to create vertex and edge objects programmatically and then insert multiple objects per network request.

Instead of sending Gremlin queries to a database, where the commands are evaluated and then executed one at a time, you use the bulk executor library to create and validate the objects locally. After the library initializes the graph objects, it allows you to send them to the database service sequentially.

By using this method, you can increase data ingestion speeds as much as a hundredfold, which makes it an ideal way to perform initial data migrations or periodic data movement operations.

The bulk executor library now comes in the following varieties.

.NET

Prerequisites

Before you begin, make sure that you have the following:

Clone

To use this sample, run the following command:

git clone https://github.com/Azure-Samples/azure-cosmos-graph-bulk-executor.git

To get the sample, go to .\azure-cosmos-graph-bulk-executor\dotnet\src\.

Sample


IGraphBulkExecutor graphBulkExecutor = new GraphBulkExecutor("MyConnectionString", "myDatabase", "myContainer");

List<IGremlinElement> gremlinElements = new List<IGremlinElement>();
gremlinElements.AddRange(Program.GenerateVertices(Program.documentsToInsert));
gremlinElements.AddRange(Program.GenerateEdges(Program.documentsToInsert));
BulkOperationResponse bulkOperationResponse = await graphBulkExecutor.BulkImportAsync(
    gremlinElements: gremlinElements,
    enableUpsert: true);

Execute

Modify the parameters, as described in the following table:

Parameter Description
ConnectionString Your service connection string, which you'll find in the Keys section of your Azure Cosmos DB for Gremlin account. It's formatted as AccountEndpoint=https://<account-name>.documents.azure.com:443/;AccountKey=<account-key>;.
DatabaseName, ContainerName The names of the target database and container.
DocumentsToInsert The number of documents to be generated (relevant only to synthetic data).
PartitionKey Ensures that a partition key is specified with each document during data ingestion.
NumberOfRUs Is relevant only if a container doesn't already exist and it needs to be created during execution.

Download the full sample application in .NET.

Java

Sample usage

The following sample application illustrates how to use the GraphBulkExecutor package. The samples use either the domain object annotations or the POJO (plain old Java object) objects directly. We recommend that you try both approaches to determine which one better meets your implementation and performance demands.

Clone

To use the sample, run the following command:

git clone https://github.com/Azure-Samples/azure-cosmos-graph-bulk-executor.git

To get the sample, go to .\azure-cosmos-graph-bulk-executor\java\.

Prerequisites

To run this sample, you need to have the following software:

  • OpenJDK 11
  • Maven
  • An Azure Cosmos DB account that's configured to use the Gremlin API

Sample

private static void executeWithPOJO(Stream<GremlinVertex> vertices,
                                        Stream<GremlinEdge> edges,
                                        boolean createDocs) {
        results.transitionState("Configure Database");
        UploadWithBulkLoader loader = new UploadWithBulkLoader();
        results.transitionState("Write Documents");
        loader.uploadDocuments(vertices, edges, createDocs);
    }

Configuration

To run the sample, refer to the following configuration and modify it as needed.

The /resources/application.properties file defines the data that's required to configure Azure Cosmos DB. The required values are described in the following table:

Property Description
sample.sql.host The value that's provided by Azure Cosmos DB. Ensure that you're using the .NET SDK URI, which you'll find in the Overview section of the Azure Cosmos DB account.
sample.sql.key You can get the primary or secondary key from the Keys section of the Azure Cosmos DB account.
sample.sql.database.name The name of the database within the Azure Cosmos DB account to run the sample against. If the database isn't found, the sample code creates it.
sample.sql.container.name The name of the container within the database to run the sample against. If the container isn't found, the sample code creates it.
sample.sql.partition.path If you need to create the container, use this value to define the partitionKey path.
sample.sql.allow.throughput The container will be updated to use the throughput value that's defined here. If you're exploring various throughput options to meet your performance demands, be sure to reset the throughput on the container when you're done with your exploration. There are costs associated with leaving the container provisioned with a higher throughput.

Execute

After you've modified the configuration according to your environment, run the following command:

mvn clean package 

For added safety, you can also run the integration tests by changing the skipIntegrationTests value in the pom.xml file to false.

After you've run the unit tests successfully, you can run the sample code:

java -jar target/azure-cosmos-graph-bulk-executor-1.0-jar-with-dependencies.jar -v 1000 -e 10 -d

Running the preceding command executes the sample with a small batch (1,000 vertices and roughly 5,000 edges). Use the command-line arguments in the following sections to tweak the volumes that are run and which sample version to run.

Command-line arguments

Several command-line arguments are available while you're running this sample, as described in the following table:

Argument                      Description
--vertexCount (-v) Tells the application how many person vertices to generate.
--edgeMax (-e) Tells the application the maximum number of edges to generate for each vertex. The generator randomly selects a number from 1 to the value you provide.
--domainSample (-d) Tells the application to run the sample by using the person and relationship domain structures instead of the GraphBulkExecutors, GremlinVertex, and GremlinEdge POJOs.
--createDocuments (-c) Tells the application to use create operations. If the argument isn't present, the application defaults to using upsert operations.

Detailed sample information

Person vertex

The person class is a simple domain object that's been decorated with several annotations to help a transformation into the GremlinVertex class, as described in the following table:

Class annotation Description
GremlinVertex Uses the optional label parameter to define all vertices that you create by using this class.
GremlinId Used to define which field will be used as the ID value. The field name on the person class is ID, but it isn't required.
GremlinProperty Used on the email field to change the name of the property when it's stored in the database.
GremlinPartitionKey Used to define which field on the class contains the partition key. The field name you provide should match the value that's defined by the partition path on the container.
GremlinIgnore Used to exclude the isSpecial field from the property that's being written to the database.

The RelationshipEdge class

The RelationshipEdge class is a versatile domain object. By using the field level label annotation, you can create a dynamic collection of edge types, as shown in the following table:

Class annotation Description
GremlinEdge The GremlinEdge decoration on the class defines the name of the field for the specified partition key. When you create an edge document, the assigned value comes from the source vertex information.
GremlinEdgeVertex Two instances of GremlinEdgeVertex are defined, one for each side of the edge (source and destination). Our sample has the field's data type as GremlinEdgeVertexInfo. The information provided by the GremlinEdgeVertex class is required for the edge to be created correctly in the database. Another option would be to have the data type of the vertices be a class that has been decorated with the GremlinVertex annotations.
GremlinLabel The sample edge uses a field to define the label value. It allows various labels to be defined, because it uses the same base domain class.

Output explained

The console finishes its run with a JSON string that describes the run times of the sample. The JSON string contains the following information:

JSON string Description
startTime The System.nanoTime() when the process started.
endTime The System.nanoTime() when the process finished.
durationInNanoSeconds The difference between the endTime and startTime values.
durationInMinutes The durationInNanoSeconds value, converted into minutes. The durationInMinutes value is represented as a float number, not a time value. For example, a value of 2.5 represents 2 minutes and 30 seconds.
vertexCount The volume of generated vertices, which should match the value that's passed into the command-line execution.
edgeCount The volume of generated edges, which isn't static and is built with an element of randomness.
exception Populated only if an exception is thrown when you attempt to make the run.

States array

The states array gives insight into how long each step within the execution takes. The steps are described in the following table:

Execution step Description
Build sample vertices The amount of time it takes to fabricate the requested volume of person objects.
Build sample edges The amount of time it takes to fabricate the relationship objects.
Configure database The amount of time it takes to configure the database, based on the values that are provided in application.properties.
Write documents The amount of time it takes to write the documents to the database.

Each state contains the following values:

State value Description
stateName The name of the state that's being reported.
startTime The System.nanoTime() value when the state started.
endTime The System.nanoTime() value when the state finished.
durationInNanoSeconds The difference between the endTime and startTime values.
durationInMinutes The durationInNanoSeconds value, converted into minutes. The durationInMinutes value is represented as a float number, not a time value. For example, a value of 2.5 represents 2 minutes and 30 seconds.

Next steps