Mehul B.

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Hi Sonal G., Sania G., I am going through the codebase of Zingg, and while the documentation mentions an incremental phase, I can't seem to find anything related to it in the code. Could you please let me know if the incremental phase is open-sourced?

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Mehul B.

Sania G. Thanks for the information. We already have a VM running for our other operations (which we're paying for), and we’re currently running Zingg on the same VM. So, we’re not sure if moving to the Enterprise plan is necessary for our use case. We just need some clarity on whether it’s possible to pass an altered clustered output from a previous run into a new incremental run, and if so, how to go about it.

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Mehul B.

Hey Sonal G., We are trying to add incremental data and run 'runIncremental' phase, but we don't want to use the original output file from the previous run. Instead, we plan to process the output by cleaning some false positive clusters records and concatenating all the parts of output files into a single file. We would then place this cleaned and consolidated file in the output path specified in the config. This config will then be referenced in the mergeconfig.json file for running the incremental phase (as mentioned in doc: https://docs.zingg.ai/latest/stepbystep/runincremental). Is this approach feasible, or do we have to use the original, unaltered output for the incremental phase to work?

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Mehul B.

Hi Sonal G. and Sania G. I'm encountering an issue with data processing. I sent 4,316,587 rows as input to Zingg, but the output only has 4,310,096 rows. I'm aware of a "non-standard CSV formatting" issue, which I resolved by manually adjusting the specific rows that were flagged. Despite this, I'm still missing 6,491 rows. Do you have any insights into what could be causing this discrepancy and how I can fix it?

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Mehul B.

"Also check that your traning data which is manually created has the right labels and z cluster values." This is a sample of our training.csv file (it contains unique id for rows, z_cluster ID and z_match) : Attached Excel screenshot below But in these logs (attached screenshot below) , you can see Training on positive pairs – 12693 Training on negative pairs – 27930 But in the training file, we have the following actual count for z_ismatch: 0 21309 1 21129 And we have unique 20K clusters. However, we also had the same logs earlier for our small dataset model, which gave us the correct output.

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Mehul B.

Okay, so just to clarify — do we need to re-run the training phase using the same model with the whole data, or should we start from scratch by identifying new training data and going through the labeling phase again? Also, if we proceed with either the training or labeling phase, can we expect to avoid the error we are currently encountering during the match phase?

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Mehul B.

Setting spark.local.dir to a location with more disk space may not resolve the issue cause it is already using 100+ GB of disk space. What we are observing is that as the match phase begins for this volume of data, the RAM usage quickly reaches around 28 GB. After that, the system starts utilizing disk space, which fills up. Once the disk reaches 100% usage, we encounter the error shared with you earlier. At the start of the process, disk usage is minimal (around 3–4%), so it’s clear that the spike that happens during execution is more than 100 GB of disk space. Additionally, just to let you know, the model we are using to process 4 million rows for the match phase was trained on a subset of 100k rows.

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Mehul B.

Hi Sonal G., Stepping in for Sanket here. We have completed labelling approximately 125 records, out of which we found 30 matches. Additionally, we are passing the training file, which contains 21,309 non-match rows and 21,130 match rows.