Spark stage high I/O
Next, look at the I/O stats of the longest stage again:
How much data needs to be in an I/O column to be considered high? To figure this out, first start with the highest number in any of the given columns. Then consider the total number of CPU cores you have across all our workers. Generally each core can read and write about 3 MBs per second.
Divide your biggest I/O column by the number of cluster worker cores, then divide that by duration seconds. If the result is around 3 MB, then you’re probably I/O bound. That would be high I/O.
If you see a lot of input into your stage, that means you’re spending a lot of time reading data. First, identify what data this stage is reading. See Identifying an expensive read in Spark’s DAG.
After you identify the specific data, here are some approaches to speeding up your reads:
- Use Delta.
- Try Photon. It can help a lot with read speed, especially for wide tables.
- Make your query more selective so it doesn’t need to read as much data.
- Reconsider your data layout so that data skipping is more effective.
- If you’re reading the same data multiple times, use the Delta cache.
- If you’re doing a join, consider trying to get DFP working.
If you see a lot of output from your stage, that means you’re spending a lot of time writing data. Here are some approaches to resolving this:
- Are you rewriting a lot of data? See How to determine if Spark is rewriting data to check. If you are rewriting a lot of data:
- See if you have a merge that needs to be optimized.
- Use deletion vectors to mark existing rows as removed or changed without rewriting the Parquet file.
- Enable Photon if it isn’t already. Photon can help a lot with write speed.
If you’re not familiar with shuffle, this is the time to learn.
If you don’t see high I/O in any of the columns, then you need to dig deeper. See Slow Spark stage with little I/O.