Low Level design for Error Logging for Data Pipeline

Relay 320 Reputation points
2025-07-30T11:58:56.73+00:00

Hello,

May someone please help me how to design Error Logging for the Data pipeline shown below.

logging

I have to design logging in Databricks CI Satellite EDLAP.

Can I do it in ADLS Gen2 Silver layer or do I need to have any other component.

Can Someone please help me how we can have folder structure and can we have delta table for logging. what all parameter we can log and how it's value can be capture.

I understand:

I can create a separate Delta table like error_logs where i can capture useful details such as: timestamp, table name, pipeline step, error message, source file, and maybe a JSON column to store the problematic row . I may Use try-except blocks in PySpark and append errors into this log table

Any Implementation link will be very helpful, kindly share.

Thanks a lot

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
0 comments No comments
{count} votes

Answer accepted by question author
  1. Venkat Reddy Navari 5,830 Reputation points Microsoft External Staff Moderator
    2025-07-30T17:14:23.09+00:00

    Hi Relay Thanks for the clarification since you’ve already created the error_logs Delta table, here are some practical recommendations for folder structure and implementation within your Databricks CI Satellite EDLAP:

    Folder Structure (in ADLS Gen2 Silver Layer)

    Recommended structure for organizing logs:

    
    /mnt/silver/logs/error_logs/            --> Main Delta table
    /mnt/silver/logs/error_logs/archive/    --> Optional: Archive old logs
    /mnt/silver/logs/error_logs/temp/       --> Optional: Temp/staging
    

    You can register the main path as a Delta table and manage archiving via time-based filters (e.g., partitioning by date).

    What to Log (Parameters)

    You’re on the right track. Suggested fields:

    • timestamp
    • pipeline_name, step_name
    • error_message, error_type
    • source_file, table_name
    • row_data (JSON string of failed row)
    • run_id or job_id

    Use partitionBy("date") if you're expecting large log volume.

    Capturing Errors in PySpark

    Wrap your processing steps with try-except blocks and append to the Delta log table:

    
    try:
        # Transformation logic
    except Exception as e:
        log_df = spark.createDataFrame([{
            "timestamp": datetime.now(),
            "pipeline_name": "CI_EDLAP",
            "step_name": "transform_step_1",
            "error_message": str(e),
            "error_type": type(e).__name__,
            "source_file": "file.json",
            "table_name": "target_table",
            "row_data": row_json,
            "run_id": run_id
        }])
        log_df.write.mode("append").format("delta").save("/mnt/silver/logs/error_logs/")
    
    

    Hope this helps. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.