OK, we relabeled and retrained both of our models and are beyond the Java errors. (I'd also add that 0.4.0 seems to be improved in the labeling step.) Now I'm not getting an error, but Zingg is also not outputting the score table. No indications in the notebook running Zingg, but looking at the cluster stderr I see:
org.apache.spark.sql.AnalysisException: [AMBIGUOUS_REFERENCE] Reference `id` is ambiguous, could be: [`id`, `id`]. SQLSTATE: 42704
We do in fact have a column ID in our input:
- ('ID', 'string', 'DONT_USE')
so is that a reserved column name and might Zingg be adding another ID in its processing? (Spark or Databricks wants to lowercase column names so there's no difference between ID and id.)
So I renamed the ID column and changed the parameters on the fly, and now get an error similar to what I was getting in the past:
Py4JJavaError: An error occurred while calling o1058.execute.
: java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:277)
at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:154)
at zingg.spark.core.util.SparkGraphUtil.buildGraph(SparkGraphUtil.java:39)
at zingg.common.core.executor.Matcher.getOutput(Matcher.java:175)
at zingg.common.core.executor.Matcher.writeOutput(Matcher.java:151)
at zingg.common.core.executor.Matcher.execute(Matcher.java:131)
at zingg.common.client.Client.execute(Client.java:251)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
at java.lang.Thread.run(Thread.java:750)
File <command-226341090012288>, line 59
57 # rc = zingg.initAndExecute()
58 rc = zingg.init()
---> 59 rc = zingg.execute()Which sounds suspiciously like something going wrong while using GraphFrames to do Connected Component detection.
%pip install zingg==0.4.0 pyspark==3.1.2 py4j==0.10.9 pyyaml
is that what is available on your cluster? The pyspark version you just sent points to an older version π€
OK, I was following documentation or error messages. THe current, I think, is 3.5.1, so maybe it's restricting that.
%pip install zingg==0.4.0 pyspark==3.5 py4j pyyaml
%pip install zingg==0.4.0 pyspark==3.5 pyyaml
The conflict is caused by: zingg 0.4.0 depends on py4j==0.10.9 pyspark 3.5.0 depends on py4j==0.10.9.7
If I just specify no versions, PIP installs Pyspark 3.1.3, which is essentially what I did by hand, above:
%pip install zingg pyspark pyyaml
yields:
Successfully installed py4j-0.10.9 pyspark-3.1.3 zingg-0.4.0