Best practices: indexing, AGE EXPLAIN, and data load benchmarks

2025-05-19

Apache AGE supported by Azure Database for PostgreSQL, provides support for advanced graph processing and querying. However, achieving optimal query performance requires a thoughtful strategy for indexing and data loading. This guide outlines some best practices based on recent benchmarking results and technical insights.

Indexing in Apache AGE

Indexing is pivotal for improving query performance, especially in graph databases.

Default behavior

By default, Apache AGE doesn't create indexes for newly created graphs. This necessitates explicit creation of indexes based on the nature of your queries and dataset.

WHERE clause

In Apache AGE, the following queries are evaluated differently:

SELECT * FROM cypher('graph_name',
 $$
 MATCH (n:Customer {Name:'Alice'}) RETURN n
 $$)
AS (n agtype);

SELECT * FROM cypher('graph_name',
 $$
 MATCH (n:Customer) WHERE n.Name='Alice' RETURN n
 $$)
AS (n agtype);

To take full advantage of indexing, you must understand which types of indexes are utilized by queries with and without a WHERE clause.

EXPLAIN in Apache AGE

Unlike standard SQL, the EXPLAIN keyword in Cipher queries requires a different query format.

SELECT * FROM cypher('graph_name',
 $$
 EXPLAIN
 MATCH (n:Customer)
 WHERE n.Name='Alice'
 RETURN n
 $$)
AS (plan text);

QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on "Customer" n (cost=0.00..418.51 rows=43 width=32)
Filter: (agtype_access_operator(VARIADIC ARRAY[properties, '"Name"'::agtype]) = '"Alice"'::agtype)

To see the differences of query plans without WHERE clause:

SELECT * FROM cypher('graph_name',
 $$
 MATCH (n:Customer {Name:'Alice'}) RETURN n
 $$)
AS (n agtype);

QUERY PLAN
---------------------------------------------------------------
Seq Scan on "Customer" n (cost=0.00..396.56 rows=9 width=32)
Filter: (properties @> '{"Name": "Alice"}'::agtype)

Common index types

BTREE Index: Effective for exact matches and range queries. Recommended for use with columns like ID, start_id, and end_id in edge and vertex tables.
GIN Index: Useful for JSON fields, enabling efficient searches for key-value pairs in the properties column.

Use the following commands to create indexes for vertex and edge tables:

Vertex table:

CREATE INDEX ON graph_name."VLABEL" USING BTREE (id);
CREATE INDEX ON graph_name."VLABEL" USING GIN (properties);

Microsoft Edge table:

CREATE INDEX ON graph_name."ELABEL" USING BTREE (id);
CREATE INDEX ON graph_name."ELABEL" USING GIN (properties);

CREATE INDEX ON graph_name."ELABEL" USING BTREE (start_id);
CREATE INDEX ON graph_name."ELABEL" USING BTREE (end_id);

Indexing a specific key-value in properties

For targeted queries, a smaller, more efficient BTREE index can be created for specific keys within the properties column:

CREATE INDEX ON graph_name.label_name USING BTREE (agtype_access_operator(VARIADIC ARRAY[properties, '"KeyName"'::agtype]));

This approach avoids indexing unnecessary data, improving efficiency.

Query plan insights with EXPLAIN

The EXPLAIN keyword reveals how queries utilize indexes. Not all queries automatically use indexes, particularly those issued without a WHERE clause. Use EXPLAIN to verify index usage and optimize queries accordingly.

Benchmark observations

Recent tests highlight the impact of indexing on query performance.

Indexed vs. nonindexed queries

This section discusses the performance differences between indexed and nonindexed queries.

Sequential scans outperform index scans for queries retrieving entire tables.
Indexing significantly improves performance for join queries (for example, relationship counts).

Data loading best practices

Efficient data loading is crucial for large datasets.

The AGEFreighter library offers a streamlined process for data ingestion.

Data loading with AGEFreighter

AGEFreighter is a Python library designed to facilitate the loading of data into Apache AGE. It supports various data formats, including CSV, Avro, and Parquet, and provides a simple interface for loading data into AGE graphs.

Environment setup

Created an Azure Database for PostgreSQL flexible server with AGE enabled.
Python dependency management tool such as Poetry is recommended. Python 3.9 or later must be installed.
The AGEFreighter library (AGEFreighter PyPi) must be installed as a dependency:

poetry add agefreighter

Use the CSV data format for benchmarks

For the benchmarks, I used a CSV file to load data into AGE. The CSV format is widely supported and easy to work with, making it a good choice for data loading tasks.

Since the dataset used consists of legal cases and relationships between them, I structured my input CSV file as follows:

id,CaseID,start_vertex_type,end_CaseID,end_vertex_type
1,1005631,Case,5030916,Case
2,1005631,Case,5028652,Case
3,1005631,Case,996512,Case
4,1005631,Case,3413065,Case
5,1005631,Case,4912975,Case

Each row represents a relationship between two cases. CaseID refers to the starting case node, while end_CaseID refers to the connected case.

Use the data loading script for benchmarks

The following Python script was used to load my dataset into AGE. You might also refer to "Usage of CSVFreighter" section in AGEFreighter PyPi for a different example.

await instance.load(
 graph_name="CaseGraphFull",
 start_v_label="Case",
 start_id="CaseID",
 start_props=[],
 edge_type="REF",
 edge_props=[],
 end_v_label="Case",
 end_id="end_CaseID",
 end_props=[],
 csv_path="./cases.csv",
 use_copy=True,
 drop_graph=True,
 create_graph=True,
 progress=True,
)

As you can see, the graph_name and fields in the provided csv file is defined here. The use_copy=True parameter ensures efficient data loading. The drop_graph=True and create_graph=True parameters ensure a fresh start before loading new data.

Other data sources

AGEFreighter supports other formats such as MultiCSV, Avro, Parquet, Azure Storage, etc. which can be adapted based on data format requirements. You can get more information here: 'AGEFreighter PyPi'

Data loading performance benchmarks

Dataset size: 725K cases, 2.8M relationships.
Loading time: 83 seconds.

Efficient data loading is essential for handling large datasets effectively.

Note

While suitable for large files, this process might be less effective for smaller datasets due to preparation time.

Share via