Is this the correct process for using Zingg with Spark and Delta format on Databricks?

·Apr 24, 2024 01:30 PM

Hi, I’ve been experimenting with zingg for the past week or two on Databricks. I just want to confirm if I’m using this correctly. I have been loading data into a dataframe and then passing that to a Zingg pipe and I specify a s3path to save as delta format for the matched output like so:

df = spark.read.load(sample_path)
inputPipe = InMemoryPipe("sample_input", df)

outputPipe = Pipe(name="sample_output", format="delta")
outputPipe.addProperty("path", output_path)

args.setData(inputPipe)
args.setOutput(outputPipe)

I run multiple rounds of findTrainingData , label and save until I have at least 30-40 matched pairs. After this I run trainMatch which will train the model and fully match the data from the inputPipe into the outputPipe. If I want to run the previously trained model on a different set of data, I have been changing the input pipe like so:

full_data_df = spark.read.load(full_data_path)
inputPipe = InMemoryPipe("sample_input", full_data_df)

and then running just match

### just run matches
options = ClientOptions([ClientOptions.PHASE, "match"])

#Zingg execution for the given phase
zingg = ZinggWithSpark(args, options)
zingg.initAndExecute()

Is this the correct process for working with Zingg?

df = spark.read.load(sample_path) inputPipe = InMemoryPipe("sample_input", df) outputPipe = Pipe(name="sample_output", format="delta") outputPipe.addProperty("path", output_path) args.setData(inputPipe) args.setOutput(outputPipe)

Is this the correct process for using Zingg with Spark and Delta format on Databricks?

11 comments

Is this the correct process for using Zingg with Spark and Delta format on Databricks?

11 comments