Yes as you can see in screenshot each cluster has different z_cluster values.
Hi Sania G., We do not have any clusters with 3 or more records. Each cluster has only 2 records. If those two are the same customer we have set z_ismatch to 1 for both the rows. If those two are not the same customer we have set z_ismatch to 0 for both the rows. Is this correct approach ?
Hi Sonal G. I switched to 16 cores, 32 gb ram as you suggested. But when running Match phase I see that first RAMgets full and then it starts writing to disk space , after some time even the disk gets full and the program fails. Currently I have 128gb of ROM. Here are few warnings and errors I am getting. :~/zingg$ ./scripts/zingg.sh --phase match --conf examples/gs/config.json WARN PipeUtil: Reading Pipe [name=full_data_cleaned, format=csv, preprocessors=null, props={header=false, location=examples/gs/full_data_cleaned.csv, delimiter=,}] INFO Matcher: Read 4311050 WARN PipeUtil: Reading Pipe [name=null, format=parquet, preprocessors=null, props={location=models/1300/model/block/zingg.block}] WARN BlockingTreeUtil: byte array back is [B@1dcc0bb8 WARN MemoryStore: Not enough space to cache rdd_73_57 in memory! (computed 586.2 MiB so far) WARN BlockManager: Persisting block rdd_73_83 to disk instead. WARN MemoryStore: Not enough space to cache rdd_73_74 in memory! (computed 4.6 GiB so far WARN BlockManager: Putting block rdd_73_63 failed due to exception java.io.IOException: No space left on device. WARN BlockManager: Block rdd_73_63 could not be removed as it was not found on disk or in memory ERROR Executor: Exception in task 63.0 in stage 36.0 (TID 943) java.io.IOException: No space left on device WARN TaskSetManager: Lost task 20.0 in stage 36.0 (TID 900) (gs-testing-zingai.internal.cloudapp.net executor driver): java.io.IOException: No space left on device Driver stacktrace: zingg.common.client.ZinggClientException: Job aborted due to stage failure: Task 20 in stage 36.0 failed 1 times, most recent failure: Lost task 20.0 in stage 36.0 (TID 900) (gs-testing-zingai.internal.cloudapp.net executor driver): java.io.IOException: No space left on device Do I need even more space or am I doing anything wrong ? In zingg.conf I have set following values spark.default.parallelism=120 spark.debug.maxToStringFields=200 spark.driver.memory=24g spark.executor.memory=24g Thank you!
Hi Sonal G. I am now trying to process 5 million records with labelDataSampleSize 0.01 And I always run out of memory with error. java.lang.OutOfMemoryError: Java heap space Do I need to spin up different VM with high memory or is there any parameters I can change. Currently I am working with ec2 which has 16gb of RAM. Thank you!
Actual number of records are around 80k. I am using labelDataSampleSize as 0.001 in my config file.
Thank you for the reply Sonal G. 1)Using exact match in Python, I found that 1.73% of records could be reduced by after retaining only one entry per duplicate. 2)Later, I used the rapidfuzz library in Python and identified that 5.72% of records could be reduced. For this analysis, I only used the name and address lines of individuals within the same zip code. (minimum 80% match on name and 60% match on address lines) But expect the number to not go higher than 10%. Would using the data from this second analysis as training data for Zingg help me identify more duplicates?
Hi Sonal G., I'm currently exploring Zing AI and have set it up on a single-machine environment. I'm following the same process you demonstrated in your video, and my config file includes the fields as shown below. Using an exact match on name, address lines, and pincode, I was able to identify around 1.73% duplicate records in python. However, I expected Zing's fuzzy matching to surface more duplicates. But at the very first step - while generating training data for labeling - I’m not seeing any valid matches. Despite setting the "province" to "exact match" in the training config, the examples returned are completely unrelated and clearly not matches. Here’s a snapshot of my current config file for reference: { "fieldDefinition":[ { "fieldName" : "id", "matchType" : "dont_use", "fields" : "id", "dataType": "int" }, { "fieldName" : "email", "matchType" : "fuzzy", "fields" : "email", "dataType": "string" }, { "fieldName" : "name", "matchType" : "fuzzy", "fields" : "name", "dataType": "string" }, { "fieldName" : "phone", "matchType" : "dont_use", "fields" : "phone", "dataType": "string" }, { "fieldName" : "country", "matchType" : "exact", "fields" : "country", "dataType": "string" }, { "fieldName" : "province", "matchType" : "exact", "fields" : "province", "dataType": "string" }, { "fieldName" : "zip", "matchType" : "fuzzy", "fields" : "zip", "dataType": "string" }, { "fieldName" : "city", "matchType" : "fuzzy", "fields" : "city", "dataType": "string" }, { "fieldName" : "address", "matchType" : "fuzzy", "fields" : "address", "dataType": "string" } ], I’ve retried the training data generation and labeling steps multiple times (by running same commands again and again) with no success. Could you help me understand what I might be missing? Thanks.