Thanks for your message Aniello G.. I have been meaning to update this document. Are there any specific doubts or questions you have that I can answer and also add to this document for other users? I also see you referred to the 0.3.4 release but the 0.4.0 has been out for a while and it would be better to use it.
Thanks for your reply. So in terms of version, we’re using your library within a Glue 4.0 job, which doesn’t support the Spark version you’re based on in 0.4.0 as far as I understand. We couldn’t get it to work unless we moved to EMR, which we’d prefer not to do in order to minimise the amount of redesign involved.
In terms of questions, it’s more about gaining a general understanding of how the min and max scores relate to each other within a group, as it becomes a bit confusing in scenarios where you have more than 2 or 3 records per matched group. The ultimate goal is, of course, to formulate an effective deduplication logic for our needs, but we wanted to ensure we fully understand the min and max scores before moving forward with development.
I see. I did not know about Glue 0.4.0 incompatibility. Will check that. Here is a bit more info on the scores.
Thanks! If we can get 0.4.0 to work on Glue 4.0, that would be great, especially since we also need to run it incrementally over daily deltas. The linking phase of 0.3.4 requires the two datasets to be duplicates-free, meaning we need to run an earlier step to remove duplicates from the delta before linking, as far as I understand, which isn’t necessary with FindIncrementalMatches. Am I right in thinking that 0.4.0’s incremental feature handles everything natively, similar to AWS’s FindIncrementalMatches?
Let us check the issue with 0.4.0 and Glue and come back. Nitish J. can you please take a look here? incrementalRun is an Enterprise only feature and not freely available on Zingg open source. It matches/links previously matched records against new, updated and deleted records and manages the persistent ZINGG_ID. You can read more about it at https://www.learningfromdata.zingg.ai/p/zingg-incremental-flow
Yeah, I appreciate it’s an Enterprise feature, but we would consider it if it proves to better fit our needs. I feel like the linking phase could potentially work as a workaround for our incremental needs, but it would require some adaptations and additional work to make it fit. Meantime I’ll wait for your update on Glue. Thanks again.
Sounds good. To add to the discussion - link and incrementalRun are not replacements to each other. incrementalRun takes care of complex cluster management in the case of clusters breaking and merging. Nevertheless, link is a great way to quickly look up one data source against the other. What is your reason for moving away from AWS FindMatches? Have you trained a model yet? What kind of data and data size do you have? If you face trouble running the Zingg job, we have an upcoming release which can help verify the blocking and help ensure the load is balanced. Let us know!
The reason is that AWS is discontinuing support for FM due to the low number of users, so we’re forced to migrate elsewhere. They suggested various alternatives, and Zingg seems to be a good candidate for us. Yes, we have a trained model with a few million contacts and we have been using FM/FIM in production for a year and a half. By the way, just as an FYI, AWS recently wrote a blog post about your tool based on our input 🙂 https://aws.amazon.com/blogs/big-data/entity-resolution-and-fuzzy-matches-in-aws-glue-using-the-zingg-open-source-library/
Hi Aniello G.,can you please help us know whats the exact issue you are facing with GLUE on ZINGG-0.4.0?
Hi Nitish J., sorry—I can’t recall the exact issue, but it seemed related to the Spark version, as 0.4.0 is compiled for Spark 3.5.0, whereas Glue 4.0 supports 3.3.0. Unless we can recompile your package for this version, could you help me with that, please?
I do recall, though, another issue with 0.4.0 where it couldn’t parse the config file correctly due to a syntax error, while the same file worked fine with 0.3.4. I’m not entirely sure if that was still related to the Spark libraries incompatibility or something else, but I thought it was worth mentioning…
Hi Aniello G., sure! we can provide you 0.4.0 jar with 3.3.x compiled spark. if it can help you?