Hi, I am trying to run Zingg on Databricks using the link option to match different sources. In my case, I have six different sources, but I am struggling to set up the code. In the labeling phase, it is comparing companies from the same system. My knowledge of machine learning is very limited, and I am trying to set up the pipeline using the samples in the repository. Is there any additional video or tutorial that I could follow?
The repo has an example under databricks which has a self explanatory notebook. Have you been able to check that?
ah ok, so I will keep labeling and see how that goes.
if you prefer, you can add some samples through trainingSamples
Let us know how it goes and if you need anything
It worked, almost. It worked well to fuzzy match the different attributes, but it missed cases where the name is exactly the same and other attributes are null. How can I train the model to match specific cases like these?
If you have a pattern of cases it doesn't match, you can select 5-10 representative pairs and use the trainingSamples to feed that to the model.
Yep, that is better now. But I got duplicated clusters with the same or different companies.
Cool! Can you give an example please ?
sharing an image is ok?
here I got 4 clusters, the same 3 companies repeated in 3 clusters and one cluster have 4 companies.
Oh, any chance you ran Zingg match multiple times and the output got appended? Else it looks like a bug to me if your input had distinct rows
Not really, this is before writing the final output: ============================================================ Found 9239 records that appear in multiple clusters Sample of records in multiple clusters: +----------+------------------+--------------------------------------------------------------------------------------------------------------------------------+------------------+ |system |customer_id |all_clusters |max_score | +----------+------------------+--------------------------------------------------------------------------------------------------------------------------------+------------------+ |Salesforce|0015G00001VXX79QAH|[1766152039827:127504, 1766152039827:45071] |0.8929810880954289| |Salesforce|0015G00001VXbmsQAD|[1766152039827:74971, 1766152039827:81991] |0.9956035871827682| |Salesforce|0015G00001VZpBSQA1|[1766152039827:105300, 1766152039827:10894, 1766152039827:55549, 1766152039827:6500] |0.9881827003423975| |Salesforce|0015G00001VZpDnQAL|[1766152039827:139932, 1766152039827:74607, 1766152039827:124202] |0.7169362231998548| |Salesforce|0015G00001VZuuTQAT|[1766152039827:141609, 1766152039827:148174] |0.819534337404906 | |Salesforce|0015G00001Va59OQAR|[1766152039827:141414, 1766152039827:79612] |0.9997608785006721| |Salesforce|0015G00001Va5GoQAJ|[1766152039827:77935, 1766152039827:33293] |0.7672079743675397| |Salesforce|0015G00001YJXZRQA5|[1766152039827:148135, 1766152039827:38454] |0.8880569815816263| |Salesforce|0015G00001YJhImQAL|[1766152039827:148850, 1766152039827:3393, 1766152039827:123630, 1766152039827:123617, 1766152039827:3380, 1766152039827:112567]|0.9960268398654779| |Salesforce|0015G00001YK7okQAD|[1766152039827:77194, 1766152039827:89583] |0.9999920329355999| +----------+------------------+--------------------------------------------------------------------------------------------------------------------------------+------------------+ only showing top 10 rows
Hi, happy new year. There are a few issues I am not sure how to handle. I am still getting duplicated clusters - I was able to identify them and deduplicate it, but it took a while. I still get duplicated companies. for example: Cluster 1 - Company A and Company B and Cluster 2 - Company A and Company C. Is this the expected behavior? (not according to documentation right? https://docs.zingg.ai/latest/scoring) I have a list of companies with the same name and different address, that should be treated as different companies. I added manually as a training sample, but it is still being matched in the same cluster in the final output. Any instruction on how to deal with it? Thank you!