When running streaming pipeline in Synapse, how to cache Cosmos DB and also keep it updating

Question

When running streaming pipeline in Synapse, how to cache Cosmos DB and also keep it updating

Jiacheng Zhang 20

Hi Team,

Good afternoon! Currently I use Synapse Analytics Spark Notebook to connect to EventHub for processing streaming data. In this process, we require joining data from EventHub with Cosmos DB to get additional columns. Currently, we use spark.read.format('cosmos.oltp') to establish the connection. However, accessing Cosmos DB each time for data retrieval isn't very efficient. I'm considering caching the Cosmos DB data and maintaining its update whenever new data arrives in Cosmos DB. I've heard that Cosmos DB Analytical Store might facilitate this. Could you confirm if it does? If so, how can I connect to it? What changes should I make in the current approach to directly connect to Cosmos DB? If not, could you advise if achieving this is possible? Thank you!

Accepted answer

1 additional answer

Your answer

Answer 1

Greetings @Jiacheng Zhang

You’re correct in considering the use of Azure Cosmos DB’s Analytical Store for your scenario. The Analytical Store is a fully isolated column store that enables large-scale analytics against operational data in your Azure Cosmos DB, without any impact on your transactional workloads. It’s designed to address the complexity and latency challenges that occur with traditional ETL pipelines.

The Analytical Store can automatically sync your operational data into a separate column store, which is suitable for large-scale analytical queries. This means you can run near real-time large-scale analytics on your operational data.

To connect to the Analytical Store, you can use Azure Synapse Link. This allows you to directly link to the Analytical Store from Azure Synapse Analytics.

In terms of changes to your current approach, instead of using spark.read.format('cosmos.oltp'), you would use the serverless SQL pool in Azure Synapse Link. This allows you to analyze data in your Azure Cosmos DB containers that are enabled with Azure Synapse Link in near real-time without affecting the performance of your transactional workloads. The full SELECT surface area is supported through the OPENROWSET function.

Please note that you need to enable the Analytical Store on your Azure Cosmos DB containers. Also, ensure that your Azure Cosmos DB analytical storage is in the same region as the serverless SQL pool.

I hope this helps!

Answer 2

Sajeetharan 2,261 Microsoft Employee

Analytical store will be more efficient than reading from transactional store via cosmos.oltp in this use case. But we currently do not expose Change Data Capture (CDC) via spark notebook. If you are looking to get incremental data as nee data arrives in Cosmos DB, you currently would have to use either ADF dataflow or synapse pipelines. To read from analytical store, you can use "cosmos.olap" instead of "cosmos.oltp"

Jiacheng Zhang 20 Reputation points

2024-01-10T22:08:24.78+00:00

Hi Sajeetharan and Pinaki, Thanks so much for the reply! May I know if this means that if we just follow this link: https://learn.microsoft.com/en-us/azure/cosmos-db/configure-synapse-link (screenshot is attached) We would go to our current Cosmos DB, open the Azure Synapse Link tab, select the container that is used for querying and joining event hub data in our Synapse PySpark Notebook, enable it for the Synapse Link tab, and then proceed to enable Analytical Store Time to Live (as mentioned in the document, it will be shown after enabling Synapse Link). Following these steps, the existing Cosmos DB container will be changed to Cosmos DB Analytical Store container. After that, we would go back to our Synapse Spark Notebook, change the code of connection by changing 'xxxxxx cosmos.oltp xxxxx' to 'xxxxxx cosmos.oltp xxxxx' of this container. Would that be the correct sequence of steps to achieve the desired outcome? Thanks so much for the help!
Pinaki Ghatak 5,600 Reputation points Microsoft Employee Volunteer Moderator

2024-01-11T08:16:31.2533333+00:00

Greetings, @Jiacheng Zhang I'm having trouble understanding your query. Did you follow the instructions and suggestions that we provided previously? We would appreciate it if you could clarify your issue and provide more details about what you are trying to achieve. This will help us to assist you better and resolve your problem faster.
Harishga 6,000 Reputation points Microsoft External Staff

2024-01-12T08:35:49.05+00:00

Hi @Jiacheng Zhang
Just checking in to see if the above answer provided by @ Pinaki Ghatak helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Harishga 6,000 Reputation points Microsoft External Staff

2024-01-15T04:53:17.8233333+00:00

Hi @Jiacheng Zhang
Just checking in to see if the above answer provided by @ Pinaki Ghatak helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Jiacheng Zhang 20 Reputation points

2024-01-19T20:55:32.05+00:00

Hi Sajeetharan and Pinaki, Thanks so much for your answer! I have accepted the answer, thanks again! But I would like to have a follow up question: sorry I wasn't clear about our query at the beging, here is the cosmos query and process in synapse spark notebook, we have read streaming data from evenhub and also read data from cosmos db by:
Jia Zhang 60 Reputation points

2024-01-19T23:35:16.16+00:00

Sorry team, just found out my question is not fully posted. I'm sorry that I don't know why i can't input full question by text, so I save it as picture, thanks!

Share via

When running streaming pipeline in Synapse, how to cache Cosmos DB and also keep it updating

1 additional answer

Your answer