question

alxp avatar image
0 Votes"
alxp asked PRADEEPCHEEKATLA-MSFT commented

IoT source, streaming, bronze table correct setup?

I have a general question on if my setup is any good for my scenario below:

  • 1 streaming IoT source to Azure Eventhub

  • Raw format is in byte encoding and requires that it be decoded.

  • Need to build a scalable and reusable data structure

  • Output to 2 Azure cosmos DB collections, 1 streaming and 1 batch different set of data.

I have the following in mind

  • 1 streaming job 24/7 - Data from Azure eventhub --> append to Delta table table Bronze


  • 1 streaming job 24/7 - Select from Bronze Delta table, decode data and merge into --> Delta table Silver


  • 1 streaming job 24/7 - Select specifics from Silver Delta table, transform, aggregate and upsert into —> Azure Cosmos DB collection 1


  • 1 batch job once per day - Select specifics from Silver Delta table, transform, aggregate, upsert into —> Azure Cosmos DB collection 2



Is this the correct way of doing it?

Should I save the raw unencoded data in Bronze or should I instead decode and save it as raw?

What about schema and decode logic changes from IoT source?

Can/should you cohost the first 2 streaming jobs into 1 to save cost? Is the merge between bronze and silver to heavy to have in 1 job?



azure-databricksazure-data-lake-storageazure-event-hubsazure-iot
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @alxp,

Welcome to the Microsoft Q&A platform.

We are reaching out to the internal team to get more help on this, I will update you once we hear back from them.

0 Votes 0 ·

1 Answer

PRADEEPCHEEKATLA-MSFT avatar image
0 Votes"
PRADEEPCHEEKATLA-MSFT answered PRADEEPCHEEKATLA-MSFT commented

Hello @alxp,

• Yes, this is good and its correct way of doing.
• You can do either decodes the stream from IoT and then writes to Bronze. I think the question is do you want to be able to query the data directly in bronze or are you OK with decoding the payload when querying. If the data is large I wouldn't as it is likely to have a significant impact on query performance as you will have to call your decode function (like .from_json().cast('string')) for each record. Better to take the hit during the stream read as the data flow is going to be regulated by the data coming from the event hub.
• I would define the schema and pass it as part of the spark.readStream() function.
• I don't think so - you will probably want to select F type VMs or other compute optimized for the cluster.

Hope this helps. Do let us know if you any further queries.


Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.

· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @alxp,

Just checking in to see if the above answer helped. If this answers your query, do click Accept Answer and Up-Vote for the same. And, if you have any further query do let us know.

0 Votes 0 ·

Hello @alxp,

Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.
Take care & stay safe!

0 Votes 0 ·