How to Train and Label Data with Custom Schemas and Understand Model IDs in Spark

·Feb 20, 2025 04:59 PM

Hi All, I am going though the examples of training and labeling, but that has only specific fields. I want to train my own data with a different schema like having extra fields of email or phone. When I try to add those fields I am getting fields not found errors in spark. Is there any example do we have to train and label data when the schema varies? Also I see model id 100, 101..etc which I am placing on config file but I dont see any documentation which explain the difference between those.. Appreciate any help. Thanks

14 comments

· Sorted by Oldest

Sania G.
·
Mickey https://github.com/zinggAI/zingg/tree/main/examples We have examples for varied schemas in this folder depending on different datasets - you can go through these and let me know in case of any queries/concerns
Sania G.
·
Models with id 100, 101, etc. are pre-trained models that we have for the above examples Each config has a different model id for example - the febrl data set has modelId 100, febrl120k dataset has modelId 101 and so on
Mickey
·
Sonal G. thanks for the reply! Do we have amy documents on how to select these model based on usecase?
Mickey
·
I think I found it. It is modelIds.txt
✅1
Mickey
·
But I went through the examples but I dont see a single example using phone and email addresses. Do I need to point to any specific model for that? An example config would be great to have!
Sania G.
·
you can refer to this for field definitions on phone and email id: https://docs.zingg.ai/latest/stepbystep/configuration/field-definitions
Mickey
·
I actually tried that earllier, so raised that issue
Mickey
·
It gives spark sql analysis exception: email_address field can't be resolved
Sonal G.
·
To build a new model with your fields, please change the modelId
❤️1
Mickey
·
Thanks. Thats worked. I could run trainMatch easily on my local with 200 records but when I tried with 20k it simply hanged. May be I need to try this on aws and see.
Sonal G.
·
20k should be doable locally, even half a million. whats the setup you have?
Mickey
·
I just used the default settings on zingg.conf and running via docker on mac m1
Mickey
·
Forgot to update: yes it worked fine. I just restarted my laptop. It ran fine with 20k record and gave nice result :)
1
Sonal G.
·
That’s cool, glad to hear that!