Hey, we are trying to migrate from AWS Glue Record Matching to Zingg. When we were using Glue we trained a model with a small set of labels but then we negative enforcement labels when running FindMatches to make sure some records were never clustered together. Is it possible to do something like this in Zingg? Is it possible to send positive/negative enforcement labels when finding matches/clusters? We have more than 10M labels to ensure the right items are clustered together or separated depending on the case. Would be feasible to train a model with that amount of labels? Thank you!
Hi Alex, Thats a great question! The capability of negative and positive enforcement is available in the enterprise version but not in the community version. The 10m label data sounds wonderful. We have never tried training Zingg on such large data but there is nothing in the algorithms to stop that or fail at these sizes. The labelDataSampleSize may definitely need some adjustments, and I will be curious to hear how it goes for you. Hth
Thanks! We tried to use the 10M labels in Glue with 64 G.8X workers but it failed after 10min:
An error occurred while calling o166.execute. Job aborted due to stage failure: ResultStage 51 (collectAsList at BlockingTreeUtil.java:52) has failed the maximum allowable number of times: 4We tried with a much smaller set of labels, around 70k, using 32 G.4X workers and it was still running after 11h so we stopped it. Our label dataset contains multiple labels per cluster. Would that be an issue? Should we have our labels organized in pairs?
Hi Alex, Iβd suggest tweaking the number of Spark partitions and the label data sample size if you haven't already https://docs.zingg.ai/zingg0.4.0/stepbystep/configuration/tuning-label-match-and-link-jobs
thank you! we are going to try
The error above indicates issue with the blocking model creation. As suggested by Aniello G. you can try reducing the labelDataSampleSize and/or increasing the config of the driver machine. Zingg does not need a gpu unless you do semantic matching. So you want to recheck the cluster and assign cores appropriately
Thank you, we tried tuning the labelDataSampleSize but it didn't help. We are interested in the feature of enforcing matches that you mentioned is available in the enterprise version. When could we meet to discuss our options?
thank you, scheduled some time!
Perfect, looking forward to meeting you
this is how the metrics look for the glue job that we ran to train the model with 70k labels, it's close to 0% CPU usage all the time
the blocking model does not get trained on a cluster due to the iterative structure of the algorithms. so i suspect thats what is not working. we recently ran a test training on 50k samples locally and had good results. but it was the enterprise version where the blocking model is more advance. so maybe that led to the success there.