Aniello G.

Commented on How to Diagnose Zingg Matching Phase Errors When R...·Posted inHelp Zingg

Aniello G.

Sorry, I don't have a quick answer on that as we're running Zingg in Glue instead.

Commented on Can We Dynamically Change MatchType Without Retrai...·Posted inHelp Zingg

Aniello G.

Thanks, I can confirm that fixed our issue.

Commented on Best Practice for Grouping No Match Entries in Lab...·Posted inHelp Zingg

Aniello G.

Ok, I’ll go for the former then. Thanks for confirming!

Posted in Help Zingg·

Aniello G.

Best Practice for Grouping No Match Entries in Labeling Files for Effective Training

Hi all, I have a question about the labelling file. Regarding the no match entries, is it better to group them into similar pairs that don’t match, for example:

z_cluster,z_isMatch,firstname,lastname,jobtitle,..
123,0,James,Sallivan,IT Specialist,..
123,0,Anthony,Sallivan,Data Engineer,..
456,0,Frank,Williams,Project Manager,..
456,0,Franco,William,Sr Project Manager,..

or is it also fine to have larger no match groups within the same z_cluster, since none of them match anyway, for example:

z_cluster,z_isMatch,firstname,lastname,jobtitle,..
123,0,James,Sallivan,IT Specialist,..
123,0,Anthony,Sallivan,Data Engineer,..
123,0,Frank,Williams,Project Manager,..
123,0,Franco,William,Sr Project Manager,..

Which of the two options would make the training more effective? Intuitively, I would go for the first option, but I’m wondering just in case... Thanks.

2Comments

Commented on Migrating from AWS Glue Record Matching to Zingg:...·Posted inHelp Zingg

Aniello G.

Hi Alex, I’d suggest tweaking the number of Spark partitions and the label data sample size if you haven't already https://docs.zingg.ai/zingg0.4.0/stepbystep/configuration/tuning-label-match-and-link-jobs

Posted in Help Zingg·

Aniello G.

Issues with Zingg 0.4.0 Pipes on AWS Glue 5.0 After TrainMatch Phase Causing SparkSession Failures

Hi all, We are attempting to run Zingg 0.4.0 on the newly released AWS Glue 5.0, but we are encountering issues with the Zingg pipes, specifically after the trainMatch phase, during access to the output pipe. The logs indicate that, with large datasets, the workers fail to communicate with the driver, causing the SparkSession to terminate. We have been trying to fine-tune various Spark parameters but have not had any success so far. It appears this may be related to the Spark setup, although these issues did not occur on Glue 4.0 when using the package compiled for Spark 3.3.x. It would be great to hear if anyone in the community has faced similar issues with Glue 5.0 and managed to resolve them. Thank you.

0Comments

Commented on Request for Feedback on Documentation and Website...·Posted inGeneral

Aniello G.

FYI, I was trying to run the recommend phase to generate some stopwords and realised it requires specifying one column at a time using --column rather than --columns <...> as indicated in your documentation https://docs.zingg.ai/zingg0.4.0/improving-accuracy/stopwordsremoval

Commented on Updating the Document with Answers and Recommendin...·Posted inHelp Zingg

Aniello G.

I resolved it by converting the resulting Dataset<Row> into a PySpark df...

Commented on Updating the Document with Answers and Recommendin...·Posted inHelp Zingg

Aniello G.

Thank you both. One more, this is failing when accessing the df.

df_output=InMemoryPipe("df_output")
args.setOutput(df_output)
...
df_output.getDataset().show()

Can't find an example for it, so I’m trying to reverse-engineer your source code but perhaps you have a straightforward answer for me 🙂 TypeError: JavaMember.__call__() got an unexpected keyword argument 'n'

Commented on Updating the Document with Answers and Recommendin...·Posted inHelp Zingg

Aniello G.

It finally worked 🎉; I'm now trying to save the output back in an InMemoryPipe and go like .getDataset() from there....