Jesse S.

Commented on How to Link Two Datasets Using PySpark on Databric...·Posted inHelp Zingg

Just for the sake of sharing and searchability, here is a code snippet for running linking between 3 data sources:

from zingg.client import *
from zingg.pipes import *


args = Arguments()
args.setModelId(modelId)
args.setZinggDir(zinggDir)

df = spark.read.option("header", "true").csv(source_path).select("FirstName", "LastName")
df1 = df.limit(1000)
df2 = df.limit(100)
df3 = df.limit(200)
inputPipe1 = InMemoryPipe("source1", df1)
inputPipe2 = InMemoryPipe("source2", df2)
inputPipe3 = InMemoryPipe("source3", df3)


outputPipe = Pipe(name="output", format="delta")
outputPipe.addProperty("path", output_path)

args.setData(inputPipe1, inputPipe2, inputPipe3)
args.setOutput(outputPipe)

field_FirstName = FieldDefinition("FirstName", "string", MatchType.FUZZY)
field_LastName = FieldDefinition("LastName", "string", MatchType.FUZZY)
fieldDefs = [field_FirstName, field_LastName]
args.setFieldDefinition(fieldDefs)

options = ClientOptions([ClientOptions.PHASE, "link"])
zingg = ZinggWithSpark(args, options)
zingg.initAndExecute()

Feel free to let me know if something about this can be done better.

Commented on How to Link Two Datasets Using PySpark on Databric...·Posted inHelp Zingg

Jesse S.

Well of course I answer my own question after finally giving up and asking. This post has a code snippet that shows that we can provide multiple input pipes to the the data field, which I didn’t realize.

Posted in Help Zingg·

Jesse S.

How to Link Two Datasets Using PySpark on Databricks: Syntax and Examples

What is the syntax to link two datasets? Preferably in terms of pyspark on databricks (but I’ll take what I can get). I’ve seen some references about this feature in this channel as well as page in the docs, but the specifics are a bit beyond me.

5Comments

Commented on Is this the correct process for using Zingg with S...·Posted inHelp Zingg

Jesse S.

Vikas G. Ok thanks, thats good to know Is there some documentation on what the enterprise version offers compared to open source?

Commented on Is this the correct process for using Zingg with S...·Posted inHelp Zingg

Jesse S.

Thanks, I ran into that a couple times initially. Is there any way to configure it to use overwrite in the case of delta or possibly other formats?

Commented on Is this the correct process for using Zingg with S...·Posted inHelp Zingg

Jesse S.

Ok perfect, thanks for confirming

Posted in Help Zingg·

Jesse S.

Is this the correct process for using Zingg with Spark and Delta format on Databricks?

Hi, I’ve been experimenting with zingg for the past week or two on Databricks. I just want to confirm if I’m using this correctly. I have been loading data into a dataframe and then passing that to a Zingg pipe and I specify a s3path to save as delta format for the matched output like so:

df = spark.read.load(sample_path)
inputPipe = InMemoryPipe("sample_input", df)

outputPipe = Pipe(name="sample_output", format="delta")
outputPipe.addProperty("path", output_path)

args.setData(inputPipe)
args.setOutput(outputPipe)

I run multiple rounds of findTrainingData , label and save until I have at least 30-40 matched pairs. After this I run trainMatch which will train the model and fully match the data from the inputPipe into the outputPipe. If I want to run the previously trained model on a different set of data, I have been changing the input pipe like so:

full_data_df = spark.read.load(full_data_path)
inputPipe = InMemoryPipe("sample_input", full_data_df)

and then running just match

### just run matches
options = ClientOptions([ClientOptions.PHASE, "match"])

#Zingg execution for the given phase
zingg = ZinggWithSpark(args, options)
zingg.initAndExecute()

Is this the correct process for working with Zingg?

11Comments

Commented on Welcome Jesse S. to the Team!·Posted inIntroduce Yourself

Jesse S.

Hi Sonal G.! I am a senior data engineer at Conde Nast. I'm currently exploring zingg to see if it is a viable solution to some of our initiatives. Thanks for the slack channel!