Hi Zingg Team I am trying to run match phase for a 1.2 M dataset in my local machine using docker. But the job is struck post the log,
26/06/22 04:44:33 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLASTried increasing the executor memory to 16G and keeping driver memory to 4G, but did not help much. Other phases - label, train is completed. Any help would be appreciated.
Logs when i run the trainMatch phase
26/06/22 04:56:12 INFO Client:
26/06/22 04:56:12 INFO Trainer: Reading inputs for training phase ...
26/06/22 04:56:12 INFO Trainer: Initializing learning similarity rules
26/06/22 04:56:12 WARN PipeUtilReader: Reading Pipe [name=null, format=parquet, preprocessors=null, props={path=/tmp/zingg_dir/app_models/100/trainingData//marked/}]
26/06/22 04:56:14 WARN DSUtil: Read marked training samples
26/06/22 04:56:14 WARN DSUtil: No configured training samples
26/06/22 04:56:15 WARN BlockManager: Block rdd_12_0 already exists on this machine; not re-adding it
26/06/22 04:56:15 WARN Trainer: Training on positive pairs - 23
26/06/22 04:56:15 WARN Trainer: Training on negative pairs - 58
26/06/22 04:56:15 WARN PipeUtilReader: Reading Pipe [name=app, format=csv, preprocessors=null, props={path=/tmp/zingg_dir/nobids_apps.csv, header=false, delimiter=,}]
26/06/22 04:56:19 INFO Heuristics: **Block size **100 and total count was 279411
26/06/22 04:56:19 INFO Heuristics: Heuristics suggest 100
26/06/22 04:56:19 INFO BlockingTreeUtil: Learning indexing rules for block size 100
26/06/22 04:56:20 WARN PipeUtilWriter: Writing output Pipe [name=null, format=parquet, preprocessors=null, props={path=/tmp/zingg_dir/app_models/100/model/block/zingg.block}]
26/06/22 04:56:20 WARN TaskSetManager: Stage 26 contains a task of very large size (1193 KiB). The maximum recommended task size is 1000 KiB.
26/06/22 04:56:20 INFO Trainer: Learnt indexing rules and saved output at /tmp/zingg_dir/app_models
26/06/22 04:56:20 INFO ModelUtil: Learning similarity rules
26/06/22 04:56:20 INFO ModelUtil: Start reading internal configurations and functions
26/06/22 04:56:20 INFO ModelUtil: Finished reading internal configurations and functions
26/06/22 04:56:22 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
26/06/22 04:56:43 INFO Trainer: Learnt similarity rules and saved output at /tmp/zingg_dir/app_models
26/06/22 04:56:43 INFO Trainer: Finished Learning phase
26/06/22 04:56:43 WARN PipeUtilReader: Reading Pipe [name=app, format=csv, preprocessors=null, props={path=/tmp/zingg_dir/nobids_apps.csv, header=false, delimiter=,}]
26/06/22 04:56:47 INFO Matcher: Read 1116587
26/06/22 04:56:47 WARN Blocker: Blocking model location is Pipe [name=null, format=parquet, preprocessors=null, props={path=/tmp/zingg_dir/app_models/100/model/block/zingg.block}]
26/06/22 04:56:47 WARN PipeUtilReader: Reading Pipe [name=null, format=parquet, preprocessors=null, props={path=/tmp/zingg_dir/app_models/100/model/block/zingg.block}]
26/06/22 04:56:47 INFO Matcher: Blocked
26/06/22 04:56:48 INFO SparkModel: threshold while predicting is 0.5
26/06/22 04:56:48 WARN CacheManager: Asked to cache already cached data.
26/06/22 04:56:48 WARN DAGScheduler: Broadcasting large task binary with size 1229.9 KiBmore matching samples helps build a better blocking model
most Zingg jobs are compute intensive. so more cores always help
any way to get more logs to understand whats happening?
Are all your cores occupied? Do you have like 8 cores or more?
you could definitely try the debug logs for Matcher class and see how many comparisons it is doing.