Combine and aggregate data

Completed

After shaping individual tables, you often need to combine data from multiple sources and calculate summary statistics. Joins bring together related tables, aggregations summarize data into meaningful metrics, and window functions compute values across related rows without collapsing the detail. These patterns form the core of most analytics transformations.

Continuing with the cleaned sales data from the previous unit, you now have 10 unique rows with consistent columns, no nulls, and calculated fields like line_total and value_tier. The following transformations combine this data with customer and product tables, then summarize it for analysis.

Join tables

Joins combine rows from two or more tables based on a shared column. Use joins to enrich transaction data with customer details, product names, or geographic information.

SELECT
    s.order_id,
    s.amount,
    c.customer_name,
    c.region
FROM sales s
INNER JOIN customers c ON s.customer_id = c.customer_id

Before the join, the sales table had customer_id values like "C001" but no customer names. After the inner join, each row now includes the matching customer_name and region from the customers table. If a sales row had a customer_id that didn't exist in the customers table, an inner join would exclude it. The result might look like this:

order_id amount customer_name region
1 149.95 Contoso Ltd East
3 199.99 Contoso Ltd East
2 99.98 Fabrikam Inc West
4 299.90 Northwind Traders North

Spark supports the same join types you know from standard SQL:

Join type Returns
INNER Rows that match in both tables
LEFT All rows from the left table, with matches from the right
RIGHT All rows from the right table, with matches from the left
FULL OUTER All rows from both tables, with nulls where no match exists
CROSS Every combination of rows from both tables

Choose the join type based on your data requirements. For most fact-to-dimension joins, inner or left joins are appropriate.

Aggregate data with grouping

Aggregations collapse rows into summary values. Use GROUP BY to calculate totals, averages, counts, and other metrics by category. To filter aggregated results, use HAVING in Spark SQL or .filter() after aggregation in PySpark.

-- Basic aggregation by region
SELECT
    region,
    COUNT(*) AS order_count,
    SUM(amount) AS total_sales,
    AVG(amount) AS avg_order_value
FROM sales
GROUP BY region

-- Filter aggregated results with HAVING
SELECT region, SUM(amount) AS total_sales
FROM sales
GROUP BY region
HAVING SUM(amount) > 10000

The 10 detail rows are now collapsed into one row per region. A result set might look like this:

region order_count total_sales avg_order_value
East 3 509.92 169.97
West 3 195.94 65.31
North 2 359.84 179.92
South 1 399.98 399.98
Unknown 1 49.99 49.99

Notice how the row count changed from 10 individual orders to 5 summary rows. Each row represents a region with its aggregated metrics.

Apply window functions

Window functions calculate values across a set of rows related to the current row, without collapsing the data into groups. Use them for rankings, running totals, and comparisons between rows.

SELECT
    customer_id,
    order_date,
    amount,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id ORDER BY order_date
    ) AS order_sequence,
    SUM(amount) OVER (
        PARTITION BY customer_id ORDER BY order_date
    ) AS running_total
FROM orders

Unlike aggregation, window functions keep all original rows. The result adds new columns alongside the existing detail:

customer_id order_date amount order_sequence running_total
C001 2024-01-15 149.95 1 149.95
C001 2024-02-10 199.99 2 349.94
C001 2024-03-15 119.96 3 469.90
C002 2024-01-20 99.98 1 99.98
C002 2024-03-01 47.97 2 147.95

Each customer's orders are numbered sequentially, and the running total accumulates within each customer's partition. The original row count stays the same.

The PARTITION BY clause defines the group of rows for the calculation. The ORDER BY clause determines the sequence within each partition. Common window functions include:

  • ROW_NUMBER() assigns a sequential number within each partition
  • RANK() assigns rank with gaps for ties
  • LAG() and LEAD() access values from previous or next rows
  • SUM(), AVG(), COUNT() over a window calculate running or sliding aggregates

Use common table expressions

Common table expressions (CTEs) let you break complex queries into named, readable steps. Each CTE defines a temporary result set you can reference in subsequent steps. In PySpark, you achieve the same result by chaining DataFrame operations, where each variable holds an intermediate result.

WITH monthly_sales AS (
    SELECT
        YEAR(order_date) AS year,
        MONTH(order_date) AS month,
        SUM(amount) AS total
    FROM sales
    GROUP BY YEAR(order_date), MONTH(order_date)
),
ranked AS (
    SELECT *,
        RANK() OVER (ORDER BY total DESC) AS sales_rank
    FROM monthly_sales
)
SELECT * FROM ranked WHERE sales_rank <= 5

The CTE approach first summarizes sales by month, then ranks the months. Only the top 5 months by total sales survive the final filter. A result might look like:

year month total sales_rank
2024 2 499.89 1
2024 4 449.96 2
2024 3 217.93 3
2024 1 249.93 4

Pivot data

Pivoting reshapes data from rows to columns, which is useful for creating cross-tabulation summaries. For example, you can transform quarterly sales rows into columns for side-by-side comparison. Specifying the pivot values (like Q1 through Q4) improves performance because Spark doesn't need to scan the data first to discover the distinct values.

SELECT * FROM (
    SELECT product, quarter, amount FROM sales
)
PIVOT (
    SUM(amount) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4')
)

The pivot transforms multiple rows per product into a single row with one column per quarter:

product Q1 Q2 Q3 Q4
Widget A 389.87 89.97 null null
Widget B 99.98 null null null
Premium Device 199.99 399.98 null null

Null values appear where a product had no sales in that quarter. This format is useful for reports that compare performance across time periods side by side.

Tip

When building complex transformations, start simple and build up. Test each step in a separate cell before combining them into a complete pipeline.

Your transformations are ready. In the next unit, you learn how to write the results to Delta tables with appropriate sizing and partitioning.