Captures logs from Azure Data Factory and inserts them into a Delta Table in Databricks

Question

Captures logs from Azure Data Factory and inserts them into a Delta Table in Databricks

Hanna 220

Good morning,

I need assistance in creating a project that captures logs from Azure Data Factory and inserts them into a Delta Table in Databricks. The key requirements for this project are as follows:

No Duplicate Logs: Ensuring that the logs are not duplicated.
Pre-Established Schema: The logs must follow a pre-established schema for the Delta Table.
Trigger-Based Ingestion: The process should be triggered automatically whenever a pipeline run occurs in Data Factory. This means that the log is generated and immediately sent to Databricks for future monitoring purposes.
Mandatory Storage in Databricks: The logs must be stored in Databricks for compliance and monitoring.

Could you please suggest a cost-effective, efficient, and scalable solution to achieve this?

Thank you!

Smaran Thoomu 24,260 Reputation points Microsoft External Staff Moderator

2024-07-18T05:42:00.2866667+00:00

Hi @Yohanna de Oliveira Cavalcanti

Thanks for the question and using MS Q&A platform.

Based on your use case, I believe this similar approach might be helpful. Please take a look at https://community.databricks.com/t5/data-engineering/you-need-to-pass-the-data-from-adf-to-tables-in-delta-table-or/td-p/58656

I hope this helps. Please let me know if you have any questions.
Smaran Thoomu 24,260 Reputation points Microsoft External Staff Moderator

2024-07-19T07:11:23.01+00:00

@Yohanna de Oliveira Cavalcanti We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Smaran Thoomu 24,260 Reputation points Microsoft External Staff Moderator

2024-07-22T13:25:51.6466667+00:00

@Yohanna de Oliveira Cavalcanti Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.

1 answer

Your answer

Smaran Thoomu 24,260 Reputation points Microsoft External Staff Moderator

2024-07-18T05:42:00.2866667+00:00

Hi @Yohanna de Oliveira Cavalcanti

Thanks for the question and using MS Q&A platform.

Based on your use case, I believe this similar approach might be helpful. Please take a look at https://community.databricks.com/t5/data-engineering/you-need-to-pass-the-data-from-adf-to-tables-in-delta-table-or/td-p/58656

I hope this helps. Please let me know if you have any questions.
Smaran Thoomu 24,260 Reputation points Microsoft External Staff Moderator

2024-07-19T07:11:23.01+00:00

@Yohanna de Oliveira Cavalcanti We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Smaran Thoomu 24,260 Reputation points Microsoft External Staff Moderator

2024-07-22T13:25:51.6466667+00:00

@Yohanna de Oliveira Cavalcanti Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.

Answer 1

To prevent duplicate logs, you can use unique identifiers for each log entry. This can be achieved by leveraging attributes such as the pipeline run ID, timestamp, and other relevant fields that uniquely identify each log. When writing to the Delta Table in Databricks, you can use the MERGE functionality, which helps in upserting records based on the unique identifier condition, ensuring no duplicates are stored.

What pre-established schema should be used?

The schema for the Delta Table should be defined based on the necessary log attributes from Azure Data Factory. This schema could include fields like runId, pipelineName, activityName, status, startTime, endTime, duration, and any error details. Establish this schema clearly to ensure consistency across all log entries.

How can we trigger the ingestion process automatically?

To automate the ingestion process, Azure Event Grid can be used to listen for events related to pipeline runs in Azure Data Factory. When a pipeline run completes, the Event Grid can trigger an Azure Function or Logic App, which then fetches the log data and sends it to Databricks. This setup ensures that log data is captured and sent to Databricks immediately after each pipeline run.

How should the logs be stored in Databricks for compliance and monitoring?

Logs should be written to a Delta Table in Databricks. Azure Functions or Logic Apps can be used to process and transform the log data into the predefined schema format before inserting it into the Delta Table. Delta Lake ACID transactions and data versioning capabilities ensure reliable and compliant storage of log data.

What is a cost-effective, efficient, and scalable solution for this?

Use Event Grid for event-driven architecture to capture pipeline run completions in real-time.
Deploy serverless Azure Functions to process log data. These functions can be triggered by Event Grid events and can scale automatically based on the load, making it cost-effective.
Set up Databricks notebooks or jobs to receive the processed log data from Azure Functions and insert it into the Delta Table. Databricks clusters can be configured to auto-scale, ensuring efficiency and cost management.
Utilize Delta Lake for robust data storage, providing features such as ACID transactions, scalable metadata handling, and time travel for data versioning.

Hanna 220 Reputation points

2024-07-18T00:13:16.35+00:00

@Amira Bedhiafi And how would I insert it into Databricks?

Share via

Captures logs from Azure Data Factory and inserts them into a Delta Table in Databricks

1 answer

Your answer