hi @here I am exploring zingg with Databricks UC delta tables . I understand from PR that it is supported only through enterprise edition is there any way we can read delta tables in volumes as input
you can use the Pipe class directly and pass format as delta
hi Sonal G., thanks,also assume we have already some ground truth with https://docs.zingg.ai/latest/stepbystep/createtrainingdata/addowntrainingdata , can we directly use match phase or should we use train and match phases even while having training data and our idea is we will accumulate the trainign data (ground truth) with more and more rows and do entity Resolution for eg (100 rows) -> 1000K ( zingg identifies and added to groundtruth) and do ER on 100K records and so on , So if we can use match alone?
also final question is there any doc where we can have DBR Zingg jar Compatability
. Zingg 0.5.0 works well with DBR 16.4 scala 2.12 To your question about training. The best way to use Zingg is to label the edge cases through the labeler. If you have handy labels, add representative cases as training samples and run the findTrainingData and label phase few times to let Zingg discover the right parameters. You should always train on all your data so that the right scenarios can be picked up
Hope that helps
A question on that, how much % of the actual data should be labelled from the whole dataset for the model to learn enough to generate the correctly matched or non-matched pairs?
40-50 matching pairs to start with. refine with few more rounds and trainingSamples as needed.
Ok, a few more -
So doesn't model tell you if it is trained?
Also I couldn't get any output from 'match' phase but could see some output from 'link' phase. Is match phase not supposed to spit out an output or am I missing something here?
I followed Databricks tutorial for the spike that I am doing for Zingg. Is there a detailed doco for all the available features across community and enterprise edition so that I can get myself familiar with Zingg terms better?
During labeling, you do get the pairs to label and what Zingg is predicting, so that you can judge if it is converging. Can you share what you mean by not getting any output?
detailed product edition info is at https://www.zingg.ai/product/zingg-entity-resolution-compare-versions
I executed match phase (seems like intermediate results cannot be stored on UC volumes and hence, needed to copy the trained model into a local path):
options = ClientOptions([ClientOptions.PHASE, "match"])
# Run match against the DBFS-staged model (see DBFS workaround section above)
with zingg_dbfs_context(args):
zingg = ZinggWithSpark(args, options)
zingg.initAndExecute()And after that, this code which goes into except block:
try:
outputDF = spark.read.csv(ZINGG_OUTPUT_PATH)
colNames = ["z_minScore", "z_maxScore", "z_cluster", "rec_id", "fname", "lname", "stNo", "add1", "add2", "city", "state", "dob", "ssn"]
outputDF.toDF(*colNames).show(100)
except Exception as e:
if "PATH_NOT_FOUND" in str(e):
print(f"No match output found at: {ZINGG_OUTPUT_PATH}")
else:
raiseNot sure, what am I missing?
Hard to say, check the logs for the match phase?