Setting spark.local.dir to a location with more disk space may not resolve the issue cause it is already using 100+ GB of disk space.
What we are observing is that as the match phase begins for this volume of data, the RAM usage quickly reaches around 28 GB. After that, the system starts utilizing disk space, which fills up. Once the disk reaches 100% usage, we encounter the error shared with you earlier.
At the start of the process, disk usage is minimal (around 3–4%), so it’s clear that the spike that happens during execution is more than 100 GB of disk space.
Additionally, just to let you know, the model we are using to process 4 million rows for the match phase was trained on a subset of 100k rows.