Troubleshooting Zingg execute performance and ambiguous reference error with Spark 3.5.0 integration

Wayne F. · 2025-02-20T14:02:27.090Z

As I've mentioned in the other thread, trying to %pip install zingg causes issues with versions in pip. I can do just %pip install zingg which of course doesn't have the compatibility view and works, leaving Zingg at 0.4.0 and Spark at 3.5.0, but there is a mis-specification in retuirements somewhere. However, the issue is again we are running into an issue where zingg.execute() runs for 11 minutes (on a truncated input dataset) and ends without indicating an error. Looking at the cluster driver logs, I get: org.apache.spark.sql.AnalysisException: [AMBIGUOUS_REFERENCE] Reference `id` is ambiguous, could be: [`id`, `id`]. SQLSTATE: 42704

Zingg Community

Troubleshooting Zingg execute performance and ambiguous reference error with Spark 3.5.0 integration | Zingg Community

Wayne F.
·
Input table info:
Input table has 487,072 rows Input columns: ['ID', 'FIXED_NAME', 'FIXED_STREET_ADDRESS', 'FIXED_BUS_ID']
Wayne F.
·
model definitions: [['ID', 'string', JavaObject id=o460], ['FIXED_NAME', 'string', JavaObject id=o458], ['FIXED_STREET_ADDRESS', 'string', JavaObject id=o458], ['FIXED_BUS_ID', 'string', JavaObject id=o458]]
Wayne F.
·
- - ID - DONT_USE - - FIXED_NAME - FUZZY - - FIXED_STREET_ADDRESS - FUZZY - - FIXED_BUS_ID - FUZZY
Wayne F.
·
So we're not making nor specifying a duplicate id field. Spark SQL lowercases ID to id , in my experience. Does Zingg create an internal column named id or ID?
Luke B.
·
yes, I believe you should be considering "id" as a protected/reserved column in zingg
Luke B.
·
I've hit the same issue recently and renaming my source column to "sourceid" or something else resolved the issue
👍1
Luke B.
·
note: IANAZE (I am not a zingg employee)
🤔1
Wayne F.
·
Will do. I had inserted kludgey, on-the-fly code to rename it and then ran into a hard crash but I might've done that wrong. Thanks so much for the confirmation!
Sonal G.
·
This is strange. The febrl example does have an id. The code does not assume id as a protected field either. By any chance do you have graph frames preinstalled on the cluster? They use id internally, but Zingg works around that.
Luke B.
·
I don't think so, Sonal. In my example (I just hit this on Tuesday, actually) I had two CSVs that I uploaded to my databricks workspace, directly to the hive_metastore. This created delta-lake versions of the two tables. they both had the same schema: id - bigint name - string address - string zip - bigint I configued zingg with 2 input pipes, one for each table (although pointing to the underling files, of course) I was able to get all the way to the trainMatch stage... which, interestingly, did not fail in the notebook. It actually showed up as a successful completion. But when I went to look at the data in the outputPipe, no file was created. I looked at the driver logs and I found the same (or at least similar) stack trace:
org.apache.spark.sql.AnalysisException: [AMBIGUOUS_REFERENCE] Reference `id` is ambiguous, could be: [`id`, `id`]. SQLSTATE: 42704 at org.apache.spark.sql.errors.QueryCompilationErrors$.ambiguousReferenceError(QueryCompilationErrors.scala:2627) at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:407) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:183)
👍🏻1
Luke B.
·
I then renamed the id column in both tables, updated the relevant FieldDefinition() and started from scratch and everything worked as expected
Wayne F.
·
Sonal G. It appears that Zingg uses GraphFrames internally, based on my error message in the next thread. (As I mentioned to Luke, renaming the column gets past this mysterious error -- only visible in the cluster logs -- and I then get a hard Java error in the notebook.)
Sonal G.
·
Yes we do use graph frames but we ensure that id clash doesn’t happen. Sania G. can you please run the febrl example by changing the recid to id in both main and 0.4.0 and report back what happens?
👍1
Sania G.
·
on using id - everything runs perfectly in both cases
Sonal G.
·
Wayne F. our local tests with spark 3.5.0 run fine with id

Troubleshooting Zingg execute performance and ambiguous reference error with Spark 3.5.0 integration

27 comments

Troubleshooting Zingg execute performance and ambiguous reference error with Spark 3.5.0 integration

27 comments