As I've mentioned in the other thread, trying to %pip install zingg causes issues with versions in pip. I can do just %pip install zingg which of course doesn't have the compatibility view and works, leaving Zingg at 0.4.0 and Spark at 3.5.0, but there is a mis-specification in retuirements somewhere. However, the issue is again we are running into an issue where zingg.execute() runs for 11 minutes (on a truncated input dataset) and ends without indicating an error. Looking at the cluster driver logs, I get:
org.apache.spark.sql.AnalysisException: [AMBIGUOUS_REFERENCE] Reference `id` is ambiguous, could be: [`id`, `id`]. SQLSTATE: 42704
Input table info:
Input table has 487,072 rows Input columns: ['ID', 'FIXED_NAME', 'FIXED_STREET_ADDRESS', 'FIXED_BUS_ID']
model definitions: [['ID', 'string', JavaObject id=o460], ['FIXED_NAME', 'string', JavaObject id=o458], ['FIXED_STREET_ADDRESS', 'string', JavaObject id=o458], ['FIXED_BUS_ID', 'string', JavaObject id=o458]]
- - ID - DONT_USE - - FIXED_NAME - FUZZY - - FIXED_STREET_ADDRESS - FUZZY - - FIXED_BUS_ID - FUZZY
So we're not making nor specifying a duplicate id field. Spark SQL lowercases ID to id , in my experience. Does Zingg create an internal column named id or ID?
yes, I believe you should be considering "id" as a protected/reserved column in zingg
Will do. I had inserted kludgey, on-the-fly code to rename it and then ran into a hard crash but I might've done that wrong. Thanks so much for the confirmation!
This is strange. The febrl example does have an id. The code does not assume id as a protected field either. By any chance do you have graph frames preinstalled on the cluster? They use id internally, but Zingg works around that.
I don't think so, Sonal. In my example (I just hit this on Tuesday, actually) I had two CSVs that I uploaded to my databricks workspace, directly to the hive_metastore. This created delta-lake versions of the two tables. they both had the same schema: id - bigint name - string address - string zip - bigint I configued zingg with 2 input pipes, one for each table (although pointing to the underling files, of course) I was able to get all the way to the trainMatch stage... which, interestingly, did not fail in the notebook. It actually showed up as a successful completion. But when I went to look at the data in the outputPipe, no file was created. I looked at the driver logs and I found the same (or at least similar) stack trace:
org.apache.spark.sql.AnalysisException: [AMBIGUOUS_REFERENCE] Reference `id` is ambiguous, could be: [`id`, `id`]. SQLSTATE: 42704
at org.apache.spark.sql.errors.QueryCompilationErrors$.ambiguousReferenceError(QueryCompilationErrors.scala:2627)
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:407)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:183) I then renamed the id column in both tables, updated the relevant FieldDefinition() and started from scratch and everything worked as expected
on using id - everything runs perfectly in both cases