Cleanse, transform, and load data into Unity Catalog
Intermediate
Data Engineer
Azure Databricks
Data engineering requires transforming raw data into clean, well-structured formats ready for analysis. This module explores techniques for profiling data quality, selecting appropriate column types, resolving duplicates and null values, applying filtering and aggregation transformations, combining datasets with joins and set operators, reshaping data through pivoting and denormalization, and loading transformed data using append, overwrite, and merge strategies.
Learning objectives
By the end of this module, you'll be able to:
- Profile data using SQL commands and data profiling features to assess data quality
- Choose appropriate column data types to optimize storage and ensure data integrity
- Identify and resolve duplicate, missing, and null values in datasets
- Apply filtering, grouping, and aggregation operations to transform data
- Combine datasets using joins and set operators like UNION, INTERSECT, and EXCEPT
- Reshape data using denormalization, pivot, and unpivot techniques
- Load transformed data into Unity Catalog tables using INSERT, MERGE, and overwrite operations
Prerequisites
The following prerequisites should be completed:
- Basic understanding of Azure Databricks and Unity Catalog concepts
- Familiarity with SQL and Python programming
- Knowledge of data engineering concepts such as data quality and transformations