Is this the correct process for using Zingg with Spark and Delta format on Databricks?
Hi, Iβve been experimenting with zingg for the past week or two on Databricks. I just want to confirm if Iβm using this correctly. I have been loading data into a dataframe and then passing that to a Zingg pipe and I specify a s3path to save as delta format for the matched output like so:
df = spark.read.load(sample_path)
inputPipe = InMemoryPipe("sample_input", df)
outputPipe = Pipe(name="sample_output", format="delta")
outputPipe.addProperty("path", output_path)
args.setData(inputPipe)
args.setOutput(outputPipe)I run multiple rounds of findTrainingData , label and save until I have at least 30-40 matched pairs. After this I run trainMatch which will train the model and fully match the data from the inputPipe into the outputPipe. If I want to run the previously trained model on a different set of data, I have been changing the input pipe like so:
full_data_df = spark.read.load(full_data_path)
inputPipe = InMemoryPipe("sample_input", full_data_df)and then running just match
### just run matches
options = ClientOptions([ClientOptions.PHASE, "match"])
#Zingg execution for the given phase
zingg = ZinggWithSpark(args, options)
zingg.initAndExecute()Is this the correct process for working with Zingg?