Sanket G.

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Yes as you can see in screenshot each cluster has different z_cluster values.

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Hi Sania G., We do not have any clusters with 3 or more records. Each cluster has only 2 records. If those two are the same customer we have set z_ismatch to 1 for both the rows. If those two are not the same customer we have set z_ismatch to 0 for both the rows. Is this correct approach ?

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Sanket G.

Is the smaller dataset representative of the larger one in terms of match fields and types? Yes Sonal G.

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Sanket G.

Hi Sonal G., The reason we went with a smaller dataset for training the model was that we couldn't find any matches in findTrainingData step when we used the entire dataset. Should we redo the training on the same model, or start over with an entirely new model?

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Sanket G.

Hi Sonal G. I switched to 16 cores, 32 gb ram as you suggested. But when running Match phase I see that first RAMgets full and then it starts writing to disk space , after some time even the disk gets full and the program fails. Currently I have 128gb of ROM. Here are few warnings and errors I am getting. :~/zingg$ ./scripts/zingg.sh --phase match --conf examples/gs/config.json WARN PipeUtil: Reading Pipe [name=full_data_cleaned, format=csv, preprocessors=null, props={header=false, location=examples/gs/full_data_cleaned.csv, delimiter=,}] INFO Matcher: Read 4311050 WARN PipeUtil: Reading Pipe [name=null, format=parquet, preprocessors=null, props={location=models/1300/model/block/zingg.block}] WARN BlockingTreeUtil: byte array back is [B@1dcc0bb8 WARN MemoryStore: Not enough space to cache rdd_73_57 in memory! (computed 586.2 MiB so far) WARN BlockManager: Persisting block rdd_73_83 to disk instead. WARN MemoryStore: Not enough space to cache rdd_73_74 in memory! (computed 4.6 GiB so far WARN BlockManager: Putting block rdd_73_63 failed due to exception java.io.IOException: No space left on device. WARN BlockManager: Block rdd_73_63 could not be removed as it was not found on disk or in memory ERROR Executor: Exception in task 63.0 in stage 36.0 (TID 943) java.io.IOException: No space left on device WARN TaskSetManager: Lost task 20.0 in stage 36.0 (TID 900) (gs-testing-zingai.internal.cloudapp.net executor driver): java.io.IOException: No space left on device Driver stacktrace: zingg.common.client.ZinggClientException: Job aborted due to stage failure: Task 20 in stage 36.0 failed 1 times, most recent failure: Lost task 20.0 in stage 36.0 (TID 900) (gs-testing-zingai.internal.cloudapp.net executor driver): java.io.IOException: No space left on device Do I need even more space or am I doing anything wrong ? In zingg.conf I have set following values spark.default.parallelism=120 spark.debug.maxToStringFields=200 spark.driver.memory=24g spark.executor.memory=24g Thank you!

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Sanket G.

Hi Sonal G. I am now trying to process 5 million records with labelDataSampleSize 0.01 And I always run out of memory with error. java.lang.OutOfMemoryError: Java heap space Do I need to spin up different VM with high memory or is there any parameters I can change. Currently I am working with ec2 which has 16gb of RAM. Thank you!

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Sanket G.

Actual number of records are around 80k. I am using labelDataSampleSize as 0.001 in my config file.

Commented on Issues with Zing AI fuzzy matching and training da...·Posted inHelp Zingg

Sanket G.

Thank you for the reply Sonal G. 1)Using exact match in Python, I found that 1.73% of records could be reduced by after retaining only one entry per duplicate. 2)Later, I used the rapidfuzz library in Python and identified that 5.72% of records could be reduced. For this analysis, I only used the name and address lines of individuals within the same zip code. (minimum 80% match on name and 60% match on address lines) But expect the number to not go higher than 10%. Would using the data from this second analysis as training data for Zingg help me identify more duplicates?

Posted in Help Zingg·

Sanket G.

Issues with Zing AI fuzzy matching and training data generation despite exact match config on province field

Hi Sonal G., I'm currently exploring Zing AI and have set it up on a single-machine environment. I'm following the same process you demonstrated in your video, and my config file includes the fields as shown below. Using an exact match on name, address lines, and pincode, I was able to identify around 1.73% duplicate records in python. However, I expected Zing's fuzzy matching to surface more duplicates. But at the very first step - while generating training data for labeling - I’m not seeing any valid matches. Despite setting the "province" to "exact match" in the training config, the examples returned are completely unrelated and clearly not matches. Here’s a snapshot of my current config file for reference: { "fieldDefinition":[ { "fieldName" : "id", "matchType" : "dont_use", "fields" : "id", "dataType": "int" }, { "fieldName" : "email", "matchType" : "fuzzy", "fields" : "email", "dataType": "string" }, { "fieldName" : "name", "matchType" : "fuzzy", "fields" : "name", "dataType": "string" }, { "fieldName" : "phone", "matchType" : "dont_use", "fields" : "phone", "dataType": "string" }, { "fieldName" : "country", "matchType" : "exact", "fields" : "country", "dataType": "string" }, { "fieldName" : "province", "matchType" : "exact", "fields" : "province", "dataType": "string" }, { "fieldName" : "zip", "matchType" : "fuzzy", "fields" : "zip", "dataType": "string" }, { "fieldName" : "city", "matchType" : "fuzzy", "fields" : "city", "dataType": "string" }, { "fieldName" : "address", "matchType" : "fuzzy", "fields" : "address", "dataType": "string" } ], I’ve retried the training data generation and labeling steps multiple times (by running same commands again and again) with no success. Could you help me understand what I might be missing? Thanks.

38Comments