Hi all, I have a question about using MatchType.EXACT. Suppose I want to apply an exact match on a company domain (e.g. slack.com), but if a company doesn’t have a domain, I’d still like it to be considered as a potential match for Slack. Ideally, the domain would simply be ignored as a matching criterion in such cases. Is it essentially the case that when the domain matches exactly, it would result in an instant match (or a significant boost to the match score), whereas if the domain does not match, or if one record has a null domain, the domain field would simply be ignored, and the match would rely entirely on the other attributes? Is there a way to configure this behaviour?
Zingg’s behaviour comes from the training and the configuration. Lets take a hypothetical case. Let us say you keep an attribute - say country EXACT which has no null values and train records to match when only this field matches even if all others dont. In that case, Zingg will learn to match and all the records of that country, even of they were different people, would match up. In real cases, the labelling would be done in accordance to the entity. Zingg looks at all the attributes in the field definitions, and match types for all fields are considered in conjunction. By default, null is signalled as a possible match for that field.
In deterministic matching on the Zingg Enterprise product, Zingg would treat records to match if they fulfil the match criteria irrespective of what may be present in the other columns.
There is also the NULL_OR_BLANK match type, which you can use to signal to Zingg to record null values differently.
Thanks Sonal G. for your help! I had another quick question, If I use different pipes for different data sources rather than join them into one file, will I be unable to link records from within a single data source? Does Zingg assume no entities need to be resolved within a single pipe? What is the advantage of doing one or the other?
No Zingg does not assume that records within a source should not be matched. If you can tell me a bit more about your data and requirements, I can try to suggest something.
Currently, I have three data sources of firmographic data for this POC. If this approach shows promise, we plan to add more data sources in the future. For now, I’ve concatenated all three data sources into a single file with a unified schema, resulting in approximately 6 million records (only for Australia). I currently have one input pipe, but modifying the load step to accommodate three separate pipes wouldn’t be an issue. These are the current match fields and strategies I’m aiming for:
source_id = FieldDefinition("source_id", "string", MatchType.DONT_USE)
name = FieldDefinition("name", "string", MatchType.FUZZY)
domain = FieldDefinition("domain", "string", MatchType.EXACT)
country = FieldDefinition("country", "string", MatchType.DONT_USE)
jurisdiction_code = FieldDefinition("jurisdiction_code", "string", MatchType.DONT_USE)
location = FieldDefinition("location", "string", MatchType.FUZZY)
postcode = FieldDefinition("postcode", "string", MatchType.FUZZY)
latitude = FieldDefinition("latitude", "double", MatchType.DONT_USE)
longitude = FieldDefinition("longitude", "double", MatchType.DONT_USE)
source = FieldDefinition("source", "string", MatchType.DONT_USE)
registration_number = FieldDefinition("registration_number", "string", MatchType.EXACT)
FirmId = FieldDefinition("FirmId", "string", MatchType.DONT_USE)
# Put these in order from most important to least important! Zingg prioritizes them in that order.
field_defs = [registration_number
,name
,domain
,country
,postcode
,location
,jurisdiction_code
,latitude
,longitude
,source
,source_id
,FirmId]To prioritise fields, I’ve ordered them from most important to least important based on our current matching rules, although perhaps having e.g. exact registration number as the first field wouldn't be the best blocking strategy. I’m not currently using latitude and longitude as I don’t believe the numeric MatchType is a good fit. However, I’m open to suggestions if there’s a better way to leverage these fields effectively. I was also curious if "balancing the data" by including some known matches—for example, reducing the 6 million records to 2 million, where approximately 800,000 are known matches—would be an effective strategy?
The config looks good to me. It will find all the matches within the 6m, irrespective of source.
Regarding the 800k, are they the same records as in all attributes being exactly the same? If so, you may choose to send in one copy to Zingg. However if there are variations, better to send every thing so all possible matches with other records can be found
So is it better to have 1 main pipe rather than three separate ones in this case? Or there's no difference? No the 800k are not the same records but they are records that have been matched previously with our existing matching logic. So we are fairly confident that they are matches. Some will have matched on e.g. name and domain or exact registration number. And just on the lat, long question, do you know of a way to accommodate for this, other than writing custom code? Thanks!
Something I've noticed today while doing labelling the matches is that one of the three sources (the biggest one) hasn't shown up at all during the labelling stage. I'm wondering as to the reason for that? And if that is something that can be fixed with either sample size or order of blocking strategies?
You can try changing the labelSampleSize and force it to look at a different sample
There is no difference if you put many pipes or one during the match phase. Generally if your data is not in the same place, it’s more convenient to have multiple pipes. But all of them will be unified
Lat long - the community version numerical with fuzzy is likely to work. You can try it as an experiment once you have this model with fields you are more confident in