When running streaming pipeline in Synapse, how to cache Cosmos DB and also keep it updating

Jiacheng Zhang 20 Reputation points
2024-01-09T20:01:05.87+00:00

Hi Team,

Good afternoon! Currently I use Synapse Analytics Spark Notebook to connect to EventHub for processing streaming data. In this process, we require joining data from EventHub with Cosmos DB to get additional columns. Currently, we use spark.read.format('cosmos.oltp') to establish the connection. However, accessing Cosmos DB each time for data retrieval isn't very efficient. I'm considering caching the Cosmos DB data and maintaining its update whenever new data arrives in Cosmos DB. I've heard that Cosmos DB Analytical Store might facilitate this. Could you confirm if it does? If so, how can I connect to it? What changes should I make in the current approach to directly connect to Cosmos DB? If not, could you advise if achieving this is possible? Thank you!

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
1,902 questions
0 comments No comments
{count} votes

Accepted answer
  1. Pinaki Ghatak 5,600 Reputation points Microsoft Employee Volunteer Moderator
    2024-01-10T07:20:20.2+00:00
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Sajeetharan 2,261 Reputation points Microsoft Employee
    2024-01-10T17:10:43.22+00:00

    Analytical store will be more efficient than reading from transactional store via cosmos.oltp in this use case.   But we currently do not expose Change Data Capture (CDC) via spark notebook. If you are looking to get incremental data as nee data arrives in Cosmos DB, you currently would have to use either ADF dataflow or synapse pipelines. To read from analytical store, you can use "cosmos.olap" instead of "cosmos.oltp"


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.