Best Practices for Training Multiple Models in Databricks Using Shared Initial and Incremental Datasets

·Nov 17, 2024 06:14 AM

What is the best way to train different models (in databricks) using the same initial and incremental datasets with the goal of finding the better performing model? I've modified the ncvoters notebooks to use with my data, I've tried creating 3 separate directories and altering the names of the initial and incremental tables so each is saved in the correct path for each different directory, but I encounter some issues, I ran the Label Training section 5 times and labeled 25 negative pairs the first round and 80 positive pairs total the following rounds. When using trainMatch.execute() I get the following error: "zingg.common.client.ZinggClientException: Unable to train as insufficient training data found. Training data has 0 matches and 25 non matches. Please run findTrainingData and label till you have sufficient labelled data to build the models". It seems that when looking for the labeled pairs it's only finding the first set of pairs that were labeled and not the following ones. I checked the MARKED_DIR and it's correct, I can load the data and see all the pairs labeled correctly too. Do the tables have to be named initial and incremental always? What am I missing?

9 comments