For experiment, I am just trying with 1k record and my lable data sample size is 0.0001
2024-06-27 09:04:53,122 [main] INFO zingg.client.Client -
2024-06-27 09:04:53,122 [main] INFO zingg.client.Client - **************************************************************************
2024-06-27 09:04:53,122 [main] INFO zingg.client.Client - * ** Note about analytics collection by Zingg AI ** *
2024-06-27 09:04:53,123 [main] INFO zingg.client.Client - * *
2024-06-27 09:04:53,123 [main] INFO zingg.client.Client - * Please note that Zingg captures a few metrics about application's *
2024-06-27 09:04:53,123 [main] INFO zingg.client.Client - * runtime parameters. However, no user's personal data or application *
2024-06-27 09:04:53,123 [main] INFO zingg.client.Client - * data is captured. If you want to switch off this feature, please *
2024-06-27 09:04:53,123 [main] INFO zingg.client.Client - * set the flag collectMetrics to false in config. *
2024-06-27 09:04:53,123 [main] INFO zingg.client.Client - **************************************************************************
2024-06-27 09:04:53,123 [main] INFO zingg.client.Client -
2024-06-27 09:04:54,033 [main] WARN org.apache.spark.util.Utils - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2024-06-27 09:04:54,034 [main] WARN org.apache.spark.util.Utils - Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
2024-06-27 09:04:54,035 [main] WARN org.apache.spark.util.Utils - Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
2024-06-27 09:04:54,035 [main] WARN org.apache.spark.util.Utils - Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
2024-06-27 09:04:55,649 [main] WARN org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry - The function round replaced a previously registered function.
2024-06-27 09:04:55,649 [main] INFO zingg.ZinggBase - Start reading internal configurations and functions
2024-06-27 09:04:55,664 [main] INFO zingg.ZinggBase - Finished reading internal configurations and functions
2024-06-27 09:04:55,684 [main] WARN zingg.util.PipeUtil - Reading input jdbc
2024-06-27 09:04:55,685 [main] WARN zingg.util.PipeUtil - Reading Pipe [name=test, format=jdbc, preprocessors=null, props={url=jdbc:postgresql://localhost:5432/tmp, dbtable=tbl_src, driver=org.postgresql.Driver, user=postgres, password=12345678}, schema=null]
2024-06-27 09:05:00,843 [main] WARN zingg.TrainingDataFinder - Read input data 1001
2024-06-27 09:05:00,846 [main] WARN zingg.util.PipeUtil - Reading input parquet
2024-06-27 09:05:00,847 [main] WARN zingg.util.PipeUtil - Reading Pipe [name=null, format=parquet, preprocessors=null, props={location=models/10/trainingData//marked/}, schema=null]
2024-06-27 09:05:00,877 [main] WARN zingg.util.PipeUtil - Path does not exist: file:/zingg/models/10/trainingData/marked
2024-06-27 09:05:00,878 [main] WARN zingg.util.DSUtil - No preexisting marked training samples
2024-06-27 09:05:00,878 [main] WARN zingg.util.DSUtil - No configured training samples
2024-06-27 09:05:00,878 [main] WARN zingg.util.DSUtil - No training data found
2024-06-27 09:05:00,986 [main] INFO zingg.TrainingDataFinder - Created positive sample pairs
2024-06-27 09:05:01,116 [main] INFO zingg.TrainingDataFinder - Preprocessing DS for stopWords
2024-06-27 09:05:01,424 [main] INFO zingg.util.Heuristics - **Block size **8 and total count was 0
2024-06-27 09:05:01,424 [main] INFO zingg.util.Heuristics - Heuristics suggest 8
2024-06-27 09:05:01,424 [main] INFO zingg.util.BlockingTreeUtil - Learning indexing rules for block size 8
2024-06-27 09:05:01,948 [main] WARN zingg.block.Block - Ran out of training at size 0 for node Canopy [function=null, context=null, elimCount=0, hash=null, training=0]
2024-06-27 09:05:02,138 [main] INFO zingg.TrainingDataFinder - Writing uncertain pairs when either positive or negative samples not provided
java.lang.IllegalArgumentException: requirement failed: Sampling fraction (Infinity) must be on interval [0, 1] without replacement
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.catalyst.plans.logical.Sample.<init>(basicLogicalOperators.scala:1001)
at org.apache.spark.sql.Dataset.sample(Dataset.scala:2200)
at org.apache.spark.sql.Dataset.sample(Dataset.scala:2217)
at zingg.TrainingDataFinder.execute(TrainingDataFinder.java:128)
at zingg.client.Client.execute(Client.java:241)
at zingg.client.Client.main(Client.java:182)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2024-06-27 09:05:02,605 [main] INFO zingg.client.util.Email - Email message sent.
2024-06-27 09:05:02,605 [main] WARN zingg.client.Client - Apologies for this message. Zingg has encountered an error. requirement failed: Sampling fraction (Infinity) must be on interval [0, 1] without replacement.