Share via

Exposing Azure Real Time data in data lake to Other Cloud Vendors

V S Deepak 1 Reputation point
2022-08-23T12:26:01.22+00:00

Hi All,

We have a requirement of exposing the real time data that is coming from multiple sensors with latency of seconds. At the present the data is residing in Azure data lake. We need to expose these data to customer clouds that might be AWS, Azure ,GCP or any xyz cloud. Requirement is to build a generic solution so that we don't need to build a multiple solutions for each cloud vendors.

As of now we have two options Push vs Pull.

  1. Push Mechanism - Build ADF Pipelines which will push data to Customer SFTP On daily basis aggregating all the files per day, since its real time we will receive multiple files depending upon the threshold. This is more secure as we don't allow customers to talk to our data lake. Pipelines will be data driven based on config tables. Apart from egress charges per TB , also need to pay for Data factory instance. Downside is on cost and also we need to maintain a log for all vendors of when the data is pushed along with retry mechanism is case of failure.
  2. Pull Mechanism - Expose the Azure data lake data via Azure API management ,Similarly data would be pulled by client on a daily basis in a consolidated file. Along with egress we have API cost here which I suppose will be quite less when compared to ADF.

Let me know , which approach is more scalable as requirement changes with near real time to expose data to other vendors.

Many Thanks,
Deepak

Azure Data Lake Storage
Azure Data Lake Storage

An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.

Azure Data Factory
Azure Data Factory

An Azure service for ingesting, preparing, and transforming data at scale.


1 answer

Sort by: Most helpful
  1. Sander van de Velde | MVP 37,066 Reputation points MVP
    2022-08-24T06:16:25.82+00:00

    Hello @V S Deepak ,

    there as several approaches to tackle these real-time distribution requirements.

    If you want to keep the data lake approach, an IoT Hub can route (sets of) telemetry to an Azure storage blob container, both in JSON and AVRO format. You can play with the size of the blobs. If you trigger an Azure Function based on new files added, you can expose eg. a message on a (public) event hub, call a webhook or expose a generic cloud event using EventGrid.

    You can also expose cloud events for each incoming telemetry message which will give you real-time behavior. Check out EventGrid support for IoT Hub.

    234362-image.png

    If you want to change the telemetry first, you will need to add other azure logic between IoT Hub and your point of exposure.

    I do not recommend exposing the 'eventhub compatible endpoint' of the IoT Hub directly. Yes, it looks like an EventHub but it does not offer any configuration flexibility and control over scalability compared to the other services mentioned above.

    Was this answer helpful?

    1 person found this answer helpful.

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.