Best Practice for Grouping No Match Entries in Labeling Files for Effective Training

·Jan 22, 2025 12:48 PM

Hi all, I have a question about the labelling file. Regarding the no match entries, is it better to group them into similar pairs that don’t match, for example:

z_cluster,z_isMatch,firstname,lastname,jobtitle,..
123,0,James,Sallivan,IT Specialist,..
123,0,Anthony,Sallivan,Data Engineer,..
456,0,Frank,Williams,Project Manager,..
456,0,Franco,William,Sr Project Manager,..

or is it also fine to have larger no match groups within the same z_cluster, since none of them match anyway, for example:

z_cluster,z_isMatch,firstname,lastname,jobtitle,..
123,0,James,Sallivan,IT Specialist,..
123,0,Anthony,Sallivan,Data Engineer,..
123,0,Frank,Williams,Project Manager,..
123,0,Franco,William,Sr Project Manager,..

Which of the two options would make the training more effective? Intuitively, I would go for the first option, but I’m wondering just in case... Thanks.

2 comments