Shape and clean data
Raw data rarely arrives ready for analysis. Tables might contain duplicate rows from repeated data loads, null values from optional form fields, inconsistent column names from source systems, or columns in the wrong data type for downstream calculations. If you query this data directly, aggregations overcount, joins fail silently, and reports display misleading results.
Shaping and cleaning data addresses these issues early in the pipeline, before they compound into bigger problems. In this unit, you walk through a realistic sequence of transformations on a sales dataset, using both Spark SQL and PySpark. Select your preferred syntax using the tabs. When you choose a tab, all code samples on this page switch to that language.
Read data from a lakehouse table
The first step in any notebook transformation is reading the data. With a default lakehouse pinned, you can reference tables by name without specifying a full path. This makes your code portable across environments.
Imagine the results show a sales table with 11 rows, including some that look identical and a few cells showing null where a region wasn't recorded. The following transformations clean up these issues one at a time.
Remove duplicate rows
Duplicate rows are common when data is loaded from multiple batches or when a source system retries failed writes. Duplicates inflate counts, sums, and averages. Removing them first ensures that every subsequent transformation operates on accurate data.
If you started with 11 rows and two were identical, deduplication reduces the count to 10. When rows differ in some fields but match on a business key like order_id, you can deduplicate on that specific column instead.
-- Remove exact duplicate rows
SELECT DISTINCT * FROM raw_sales
-- Or deduplicate by a business key (keeps one row per order_id)
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_date) AS rn
FROM raw_sales
) WHERE rn = 1
Handle null values
Null values cause different problems depending on where they appear. A null in a numeric column produces null results for any calculation that involves it. A null in a grouping column creates an "Unknown" group that's easy to overlook. Handling nulls explicitly ensures predictable behavior.
You have two main strategies: replace nulls with meaningful defaults, or drop rows that are unusable without a value. For example, a missing region can be filled with "Unknown," but a sales record without a customer_id can't be joined to customer data and might be better removed.
-- Replace nulls with default values
SELECT
order_id,
customer_id,
amount,
COALESCE(region, 'Unknown') AS region,
COALESCE(discount, 0) AS discount
FROM raw_sales
-- Or drop rows where a required column is null
SELECT * FROM raw_sales
WHERE customer_id IS NOT NULL
After replacing nulls, rows that previously showed blank regions now display "Unknown," and discount calculations use zero instead of propagating nulls.
Filter rows
Filtering narrows your dataset to the rows that matter for your analysis. This reduces data volume for downstream steps and ensures your results reflect the correct scope. For example, you might filter to a specific year, exclude test transactions, or focus on a region.
In PySpark, wrap each condition in parentheses and use & for AND and | for OR. This avoids operator precedence issues that can produce wrong results.
Select and rename columns
Source systems often use cryptic or inconsistent column names like cust_id, CUST_ID, or CustomerIdentifier across different tables. Selecting only the columns you need and renaming them to a consistent convention simplifies downstream joins and makes your data easier for colleagues (and AI tools like Copilot) to interpret.
SELECT
order_id,
customer_id,
product_id,
quantity,
unit_price,
order_date,
region
FROM raw_sales
Add calculated columns
Calculated columns derive new values that don't exist in the source data but are important for analysis. A line_total column (quantity times unit price) saves every downstream query from repeating the same calculation. Extracting date parts like year or month creates convenient grouping columns.
SELECT *,
quantity * unit_price AS line_total,
YEAR(order_date) AS order_year,
MONTH(order_date) AS order_month
FROM clean_sales
After this step, each row includes a pre-computed line_total and extracted date parts. A row with quantity 5 and unit_price 29.99 now shows a line_total of 149.95.
Create conditional columns
Conditional columns categorize data based on business rules. Instead of repeating CASE logic in every report, you define the categorization once in the transformation layer. This ensures consistent classification everywhere the data is consumed.
SELECT *,
CASE
WHEN line_total > 500 THEN 'High'
WHEN line_total > 100 THEN 'Medium'
ELSE 'Low'
END AS value_tier
FROM clean_sales
An order with a line_total of 149.95 is now categorized as "Medium," while a 199.99 order for a single premium device falls into the same tier. The thresholds reflect your business definitions and stay consistent across every downstream use.
Convert data types
Type mismatches happen when data is loaded from CSV files (where everything is a string) or when source systems use different precision than you need. A string column that looks like a number causes errors in math operations. A low-precision decimal rounds off values that matter for financial reporting. Converting types explicitly catches these issues at transformation time rather than in a downstream report.
SELECT
CAST(amount AS DECIMAL(10,2)) AS amount,
CAST(order_date AS DATE) AS order_date,
CAST(quantity AS INT) AS quantity
FROM raw_sales
Note
Spark SQL and PySpark support the same underlying data types. You can mix both languages in the same notebook using the %%sql magic command. Choose whichever syntax feels more natural for each task.
The cleaned dataset
After applying these transformations in sequence, your dataset has changed substantially:
| Before | After |
|---|---|
| 11 rows with duplicates | 10 unique rows |
| Null values in region and discount | Filled with "Unknown" and 0 |
| All original columns | Only needed columns, consistently named |
| No derived values | line_total, order_year, order_month, value_tier added |
| Raw data types from source | Explicit types for decimals, dates, integers |
Each transformation addressed a specific data quality problem. In production, you typically chain these steps together in a single notebook, verifying results at each stage before moving on.
You can now shape individual tables into clean, consistent formats. In the next unit, you learn to combine data from multiple tables and calculate aggregated results.