IoT source, streaming, bronze table correct setup?

Alex 1 Reputation point
2021-03-25T07:29:40.983+00:00

I have a general question on if my setup is any good for my scenario below:

  • 1 streaming IoT source to Azure Eventhub
  • Raw format is in byte encoding and requires that it be decoded.
  • Need to build a scalable and reusable data structure
  • Output to 2 Azure cosmos DB collections, 1 streaming and 1 batch different set of data.

I have the following in mind

  • 1 streaming job 24/7 - Data from Azure eventhub --> append to Delta table table Bronze
  • 1 streaming job 24/7 - Select from Bronze Delta table, decode data and merge into --> Delta table Silver
  • 1 streaming job 24/7 - Select specifics from Silver Delta table, transform, aggregate and upsert into —> Azure Cosmos DB collection 1
  • 1 batch job once per day - Select specifics from Silver Delta table, transform, aggregate, upsert into —> Azure Cosmos DB collection 2

Is this the correct way of doing it?

Should I save the raw unencoded data in Bronze or should I instead decode and save it as raw?

What about schema and decode logic changes from IoT source?

Can/should you cohost the first 2 streaming jobs into 1 to save cost? Is the merge between bronze and silver to heavy to have in 1 job?

Azure IoT
Azure IoT
A category of Azure services for internet of things devices.
378 questions
Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,340 questions
Azure Event Hubs
Azure Event Hubs
An Azure real-time data ingestion service.
556 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,917 questions
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 77,081 Reputation points Microsoft Employee
    2021-03-26T10:55:43.393+00:00

    Hello @Alex ,

    • Yes, this is good and its correct way of doing.
    • You can do either decodes the stream from IoT and then writes to Bronze. I think the question is do you want to be able to query the data directly in bronze or are you OK with decoding the payload when querying. If the data is large I wouldn't as it is likely to have a significant impact on query performance as you will have to call your decode function (like .from_json().cast('string')) for each record. Better to take the hit during the stream read as the data flow is going to be regulated by the data coming from the event hub.
    • I would define the schema and pass it as part of the spark.readStream() function.
    • I don't think so - you will probably want to select F type VMs or other compute optimized for the cluster.

    Hope this helps. Do let us know if you any further queries.

    ------------

    Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.