Meriem

Commented on Improving Hive Large Table Performance and Managin...·Posted inHelp Zingg

·

--master yarn \ --deploy-mode client \ --executor-memory 16G \ --driver-memory 12G \ --executor-cores 3 \ --num-executors 10 \

Commented on Improving Hive Large Table Performance and Managin...·Posted inHelp Zingg

Meriem

·

Yes, the training phase completed successfully before moving on to the matching phase. I tried increasing the label up to 80/839, but it didn't help, the process takes a long time without producing any output

Commented on Improving Hive Large Table Performance and Managin...·Posted inHelp Zingg

Meriem

·

45/643 matchs

Commented on Improving Hive Large Table Performance and Managin...·Posted inHelp Zingg

Meriem

·

Thank you Sonal G. for your feedback and clarifications. I have now reached the matching stage, but I notice that it is taking too long, it has been running for over two hours without finishing. Is this normal?

Commented on Improving Hive Large Table Performance and Managin...·Posted inHelp Zingg

Meriem

·

Alright, thank you Sonal G. for your response. Please correct me if I'm mistaken: from what I understand, the Zingg model is initially trained on manually labeled data pairs, and then it's applied to the entire dataset to identify matches (match cmd). Is that accurate?

Commented on Improving Hive Large Table Performance and Managin...·Posted inHelp Zingg

Meriem

·

Hello Sonal G., I configured the labelDataSampleSize parameter to 0.05, which corresponds to a 5% sample of the dataset. Since making this change, the duration of the findTrainingData phase now ranges from 4 minutes (during the first cycle) to 34 minutes in subsequent cycles. I have three questions regarding this: Is it possible to further optimize the findTrainingData phase to reduce its duration? When setting labelDataSampleSize to 5%, does the training phase (train) rely solely on this sample, or does it still process the entire dataset despite the limitation?since the 5% of data is randomly selected by the model to infer blocking rules, I still wonder whether this sample is truly sufficient to contain between 40 and 50 matches. Thank you in advance😄.

Commented on Improving Hive Large Table Performance and Managin...·Posted inHelp Zingg

Meriem

·

Thank you Sonal G., that worked. I just have one last question. When I switch to interactive learning, all the blocks generated during the labeling phase only contain non-matching pairs. No matched pairs are returned, which results in an imbalanced dataset and prevents me from proceeding properly to the training phase. Do you have any advice on how to overcome this issue or how to ensure a more balanced sample during interactive labeling?

Commented on Improving Hive Large Table Performance and Managin...·Posted inHelp Zingg

Meriem

·

Thank you for your response. The dataset contains 140,312 rows and 10 columns. As for the cluster: it runs Spark jobs in client mode via YARN, with a total capacity for this job of approximately 30 CPU cores and 160 GB of RAM for the executors, plus 12 GB of RAM for the driver. The value of labelDataSampleSize is set to 0.5.

Posted in Help Zingg·

Meriem

·

Improving Hive Large Table Performance and Managing Imbalanced Samples in Zingg 3.4 Environment

Hey All, I'm currently working with large tables stored in Hive. When using the interactive interface to explore the data, the process takes quite a long time, even though the environment is distributed. Additionally, when I try to use a sample, the dataset tends to be imbalanced, which prevents me from proceeding to the training phase. Do you have any suggestions on how to address this issue? Also, is there a way to bypass the interactive interface and move directly to the training phase? PS: I’m currently working with Zingg version 3.4 Thank you in advance.

16Comments