Skip to main content

Zingg Community Icon

Zingg Community

Home
Events
Members

General
Help Zingg
Introduce Yourself
Zingg Databricks
Zingg Snowflake

⭐ Star Zingg
🐞 Submit an Issue
📚 Documentation

Powered by Tightknit

How to Reduce Data Skew When Scaling Spark Jobs to 20 Million Records

·Nov 21, 2024 05:04 PM

Getting some awful skews when trying to scale up to 20m records. I've tried:

1.
Scaling up the partitions
2.
Scaling down the partitions
3.
Scaling up the number of workers
4.
Using spark.sql.adaptive.forceOptimizeSkewedJoin

Is this expected? Are there any ideas for what we could do to reduce this skew?

3 comments

· Sorted by Oldest

Luke B.
·
skew might indicate that one (or a few) of the blocks of potential matches is too large. any chance you are using a field with a bunch of missing values? or maybe just a field where there is a very skewed distribution of values (i.e. some super common values)?
Sonal G.
·
What’s your training set match pair count and have you run train again on the bigger set(assuming you built a model on a smaller dataset)
Dara O.
·
Hmm... The similarity score for "first name" and "salutation" after matching have been relatively high. Salutation has a lot of nulls which may be an issue- but doesn't change much for matches... marked records in training data is only ~252 but I did run "train" over 0.1 subset of the total records as that ran very quickly