Hey All, I'm currently working with large tables stored in Hive. When using the interactive interface to explore the data, the process takes quite a long time, even though the environment is distributed. Additionally, when I try to use a sample, the dataset tends to be imbalanced, which prevents me from proceeding to the training phase. Do you have any suggestions on how to address this issue? Also, is there a way to bypass the interactive interface and move directly to the training phase? PS: I’m currently working with Zingg version 3.4 Thank you in advance.
Thank you for your response. The dataset contains 140,312 rows and 10 columns. As for the cluster: it runs Spark jobs in client mode via YARN, with a total capacity for this job of approximately 30 CPU cores and 160 GB of RAM for the executors, plus 12 GB of RAM for the driver. The value of labelDataSampleSize is set to 0.5.
That is not a lot of records and you have a great cluster. We run 5m ncVoter locally on our 4 core machines, 16gb ram 😉 Try reducing the labelDataSampleSize to 0.05 and see if that works. While adding trainingSamples manually is possible, in general the results are much poorer both in terms of accuracy and performance, so skipping the labeling is not what we recommend. HTH
Thank you Sonal G., that worked. I just have one last question. When I switch to interactive learning, all the blocks generated during the labeling phase only contain non-matching pairs. No matched pairs are returned, which results in an imbalanced dataset and prevents me from proceeding properly to the training phase. Do you have any advice on how to overcome this issue or how to ensure a more balanced sample during interactive labeling?
Hello Sonal G., I configured the labelDataSampleSize parameter to 0.05, which corresponds to a 5% sample of the dataset. Since making this change, the duration of the findTrainingData phase now ranges from 4 minutes (during the first cycle) to 34 minutes in subsequent cycles. I have three questions regarding this: Is it possible to further optimize the findTrainingData phase to reduce its duration? When setting labelDataSampleSize to 5%, does the training phase (train) rely solely on this sample, or does it still process the entire dataset despite the limitation?since the 5% of data is randomly selected by the model to infer blocking rules, I still wonder whether this sample is truly sufficient to contain between 40 and 50 matches. Thank you in advance😄.
The first time findTrainingData runs is different from subsequent runs since it has no previous labels to learn from. That’s the reason why you are seeing the variations in run time. You can absolutely go down to 0.01 or even experiment with a few different numbers to get the right kind of training samples which cover the breadth of the matches and non matches you see
a well trained model with your record size should be matched in minutes on a standard 8 core, 16gb ram. how many labels of matches have you got?
45/643 matchs
did your previous run complete? you can absolutely look at labeling more if the time taken is very high or the job did not succeed.
Yes, the training phase completed successfully before moving on to the matching phase. I tried increasing the label up to 80/839, but it didn't help, the process takes a long time without producing any output
What’s the settings of the cluster for the jvm?