Currently, I have three data sources of firmographic data for this POC. If this approach shows promise, we plan to add more data sources in the future.
For now, I’ve concatenated all three data sources into a single file with a unified schema, resulting in approximately 6 million records (only for Australia). I currently have one input pipe, but modifying the load step to accommodate three separate pipes wouldn’t be an issue.
These are the current match fields and strategies I’m aiming for:
source_id = FieldDefinition("source_id", "string", MatchType.DONT_USE)
name = FieldDefinition("name", "string", MatchType.FUZZY)
domain = FieldDefinition("domain", "string", MatchType.EXACT)
country = FieldDefinition("country", "string", MatchType.DONT_USE)
jurisdiction_code = FieldDefinition("jurisdiction_code", "string", MatchType.DONT_USE)
location = FieldDefinition("location", "string", MatchType.FUZZY)
postcode = FieldDefinition("postcode", "string", MatchType.FUZZY)
latitude = FieldDefinition("latitude", "double", MatchType.DONT_USE)
longitude = FieldDefinition("longitude", "double", MatchType.DONT_USE)
source = FieldDefinition("source", "string", MatchType.DONT_USE)
registration_number = FieldDefinition("registration_number", "string", MatchType.EXACT)
FirmId = FieldDefinition("FirmId", "string", MatchType.DONT_USE)
# Put these in order from most important to least important! Zingg prioritizes them in that order.
field_defs = [registration_number
,name
,domain
,country
,postcode
,location
,jurisdiction_code
,latitude
,longitude
,source
,source_id
,FirmId]To prioritise fields, I’ve ordered them from most important to least important based on our current matching rules, although perhaps having e.g. exact registration number as the first field wouldn't be the best blocking strategy.
I’m not currently using latitude and longitude as I don’t believe the numeric MatchType is a good fit. However, I’m open to suggestions if there’s a better way to leverage these fields effectively.
I was also curious if "balancing the data" by including some known matches—for example, reducing the 6 million records to 2 million, where approximately 800,000 are known matches—would be an effective strategy?