Gremlin count value is delayed from actual results when setting TTL

Question

Gremlin count value is delayed from actual results when setting TTL

gmNetwrix 0

Hi,

Running into an issue with TTL property in cosmos db and it being ignored.

Uploaded a collection of vertices and edges or a graph and using Gremlin for traversals.

The vertices and edges have specific filters that allow for 100s to 1000s to be deleted easily by setting ttl on them.

e.g. g.V().has(filter).property("ttl",1) and g.E().has(filter).property("ttl",1)

When running those queries the vertices and edges are deleted.

However, when running a count against them the count is delayed by a significant period of time.

e.g. g.V().has(filter).Count()

This also shows in Azure Data Studio when connecting to those same rows via a nosql query.

e.g. select count(1) from c where c.Label = "edgeLabel" and c.Filter = filter

When testing against a larger ttl e.g. 1000 it definitely sets the values but the count still fails after the removal occurs for minutes (longer for larger deletions)

It looks like its not including the ttl on counts despite "An item will no longer appear in query responses immediately after the TTL expires, even if it hasn't yet been permanently deleted from the container." - https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/time-to-live

Following gremlin setting matching: https://learn.microsoft.com/en-us/azure/cosmos-db/gremlin/access-system-properties

Thanks

Oury Ba-MSFT 20,911 Reputation points Microsoft Employee Moderator

2024-09-09T16:19:42.4433333+00:00

@gmNetwrix Thank you for reaching out and sorry about the issue you are facing.

If the below responses don't help resolve the issue. I would suggest raising a support ticket. This might need debugging. Please do let us know if you don't have a support plan.

Regards,

Oury
gmNetwrix 0 Reputation points

2024-09-12T09:25:07.49+00:00

@Oury Ba-MSFT
Mentioned it further down but thought it would be worth replying direct to you.

Interesting solution I've ran into since writing the question:

Based on the concept Amira raised about querying for non-expired items.

What does work for me however (and it would only work on small ttl in the case of using ttl as a deletion e.g. "ttl" = 1) is: g.V().has(filter).hasNot('ttl').count(). I'm currently unsure whether this occurs because count() doesn't filter on "ttl" or because hasNot('ttl') impacts the index used.

The difference in time between g.V().has(filter).count() and g.V().has(filter).hasNot('ttl').count() is around 3-5 minutes pretty consistently after an update, tested against 100 Vertex/Edge updates and 10000 Vertex/Edge updates

1 answer

Your answer

Oury Ba-MSFT 20,911 Reputation points Microsoft Employee Moderator

2024-09-09T16:19:42.4433333+00:00

@gmNetwrix Thank you for reaching out and sorry about the issue you are facing.

If the below responses don't help resolve the issue. I would suggest raising a support ticket. This might need debugging. Please do let us know if you don't have a support plan.

Regards,

Oury
gmNetwrix 0 Reputation points

2024-09-12T09:25:07.49+00:00

@Oury Ba-MSFT
Mentioned it further down but thought it would be worth replying direct to you.

Interesting solution I've ran into since writing the question:

Based on the concept Amira raised about querying for non-expired items.

What does work for me however (and it would only work on small ttl in the case of using ttl as a deletion e.g. "ttl" = 1) is: g.V().has(filter).hasNot('ttl').count(). I'm currently unsure whether this occurs because count() doesn't filter on "ttl" or because hasNot('ttl') impacts the index used.

The difference in time between g.V().has(filter).count() and g.V().has(filter).hasNot('ttl').count() is around 3-5 minutes pretty consistently after an update, tested against 100 Vertex/Edge updates and 10000 Vertex/Edge updates

Answer 1

Possible Reasons for Delayed Count Results

Background Cleanup Delay: While the item may no longer be accessible in queries after TTL expires, the underlying deletion from the storage engine happens asynchronously. This could lead to a situation where the Gremlin traversal or SQL query still reflects the expired items for a brief period.
Indexing Lag: Even though the item is no longer available for standard queries after TTL expiration, there may be a delay in updating the indexing system, which could lead to stale count results. This is a common cause for count queries to be delayed.
Gremlin Query Optimization: Gremlin traversals, like g.V().has(filter).count(), may not always behave exactly like direct Cosmos DB SQL queries. This could also explain the discrepancy between the behavior in Gremlin and NoSQL queries.
TTL in Cosmos DB SQL Queries: Similarly, the SQL query SELECT COUNT(1) FROM c WHERE c.Label = "edgeLabel" might not immediately reflect TTL-expired records due to the lag between logical and physical deletion of the items in Cosmos DB. This is particularly true when the TTL is small, and you're performing large-scale deletions.

Suggestions to Resolve the Issue

Wait for Propagation: Given that Cosmos DB handles TTL expiration asynchronously, you may need to introduce a small wait time between setting TTL properties and executing count queries. This would give Cosmos DB enough time to process the TTL expiration across all nodes and update the count properly.
Manual Cleanup Option: If the delayed count poses a significant issue, you could manually remove the expired items instead of relying on TTL by using a g.V().has(filter).drop() or similar commands to ensure immediate removal and accurate count results.
Query for Non-Expired Items: Use a query that explicitly checks for non-expired items, such as ensuring a TTL value greater than the current time if available. For example, g.V().has(filter).has('ttl', gt(currentTime)).count().
Azure Support Contact: If the problem persists or significantly impacts your operations, it may be useful to raise a support ticket with Azure Cosmos DB, as there could be internal delays or optimizations specific to your cluster's settings.

Let me know if you'd like more information or examples for any of these suggestions!

gmNetwrix 0 Reputation points

2024-09-09T10:04:07.4566667+00:00

Hi,
My assumptions with it were it being either an indexing issue or that count was excluding a filter on ttl.

For the solutions,

Wait for Propagation: I would be running a count pretty soon after the ttl setting and the risk of not knowing if the count is updated is concerning. I have seen some counts last for over 15 minutes when the index should be up to date e.g: running these queries I would have force the index to be updated but its still including historical partial values
g.V().has(url, startingWith("www.microsoft").count() returns 150 g.V().has(url, startingWith("www.microsoft.com").count() returns 50
g.V().has(url, startingWith("www.microsoft.com").property("ttl", 30) g.V().has(url, startingWith("www.microsoft").property("ttl", 30) g.V().has(url, startingWith("www.microsoft").count() returns 50

Manual Cleanup Option: The drop has its own performance issues for large dataset but that is expected.

Query for Non-Expired Items: Unsure if you can run this filter as ttl is an int32 from the point when it was set so querying off a datetime wouldn't be possible. Microsoft also states that the queries should be doing this anyway in their documentation: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/time-to-live

Azure Support Contact: Might need to do this at some point for confirmation.

Thanks!
gmNetwrix 0 Reputation points

2024-09-12T09:20:09.99+00:00

Interesting solution I've ran into since writing the question:

Based on the concept Amira raised about querying for non-expired items. I did some testing where g.V().has(filter).has('ttl', gt(currentTime)).count() cannot work because ttl is an int32 from the last update not a datetime, so was throwing exceptions when testing.

What does work for me however (and it would only work on small ttl in the case of using ttl as a deletion e.g. "ttl" = 1) is: g.V().has(filter).hasNot('ttl').count().
I'm currently unsure whether this occurs because count() doesn't filter on "ttl" or because hasNot('ttl') impacts the index used.

Share via

Gremlin count value is delayed from actual results when setting TTL

1 answer

Possible Reasons for Delayed Count Results

Suggestions to Resolve the Issue

Your answer