Hi, I’ve been experimenting with zingg for the past week or two on Databricks. I just want to confirm if I’m using this correctly.
I have been loading data into a dataframe and then passing that to a Zingg pipe and I specify a s3path to save as delta format for the matched output like so:
df = spark.read.load(sample_path)
inputPipe = InMemoryPipe("sample_input", df)
outputPipe = Pipe(name="sample_output", format="delta")
outputPipe.addProperty("path", output_path)
args.setData(inputPipe)
args.setOutput(outputPipe)
I run multiple rounds of findTrainingData , label and save until I have at least 30-40 matched pairs.
After this I run trainMatch which will train the model and fully match the data from the inputPipe into the outputPipe.
If I want to run the previously trained model on a different set of data, I have been changing the input pipe like so:
full_data_df = spark.read.load(full_data_path)
inputPipe = InMemoryPipe("sample_input", full_data_df)and then running just match
### just run matches
options = ClientOptions([ClientOptions.PHASE, "match"])
#Zingg execution for the given phase
zingg = ZinggWithSpark(args, options)
zingg.initAndExecute()
Is this the correct process for working with Zingg?