Knowledge Check

1.

A data team wants to run Spark SQL queries in a PySpark notebook. What do they need to add at the top of a cell to run SQL?

%spark

%%sql

%%pyspark

2.

An analytics engineer needs to replace null values in a discount column with zero using PySpark. Which method should they use?

df.dropna(subset=["discount"])

df.fillna({"discount": 0})

df.filter(col("discount").isNotNull())

3.

A team writes a nightly transformation that replaces all data in a gold-layer table with freshly processed results. Which write mode should they use?

append

overwrite

merge

4.

What does a window function provide that a standard GROUP BY aggregation does not?

It calculates aggregated values while keeping the individual row detail.

It runs faster than GROUP BY on large datasets.

It supports more aggregation functions than GROUP BY.

5.

A table has grown to contain many small Parquet files after weeks of incremental appends. Which command consolidates these files to improve query performance?

VACUUM

OPTIMIZE

ANALYZE TABLE

Feedback