How did you spot this.. thank you so much! spark.sparkContext.setCheckpointDir("/tmp/checkpoint") gave me: AttributeError: 'SparkSession' object has no attribute 'setCheckpointDir' so I change it to: spark.sparkContext.setCheckpointDir("/tmp/checkpoint"). It seems that it works now.. thank you so much for you help Sonal.. and Padam.. Thank you team!
I actually think that the zipped .gz files might help us here more, so if you meant those, I merged them and attached them here:
Great, thanks for the pointer! I only think '../executor/1/stderr' contains valuable info and possibly an allocation error in '../driver/stdout/' (the other ones are just two lines each, but I attached them all..
Right! I enabled logs storing in the cluster and now I can see inside both the executor and the driver logs. In the driver path you can see image 1 and in the executor there are two folders one with '/0/' and one with '/1/'. Which logs should I access? And what should I look for? I can also attach them here if you are used at looking at these logs and you can spot where it works fine and where there is trouble quickly.
Good week Sonal, thank you! Inside the stages file I see three folders:
0_vecAssembler_db74d1b2b6dd
1_poly_85dcbb8f817d
2_logreg_15d7e89c81c1
each one of those folders have a metadata folder inside them that has files without 0 size, as you can see in the first image below. Only the last one (2_logreg.... folder) has an extra file apart from metadata which is called data and has actually a big snappy.parquet file inside it (second image). Finally, how can I access the logs for the training? are they under any specific metadata folder? Also I attach one more image (image 3) inside the trainingData/marked so you can see that it has more than 60 parquet files in case it helps.
Thanks Sonal! After saving the labels i ran trainMatch: options = ClientOptions([ClientOptions.PHASE,"trainMatch"]) # Zingg execution for the given phase zingg = ZinggWithSpark(args, options) zingg.initAndExecute() I also find it odd that everytime I run findTrainingData it takes 4-5min and when I run trainMatch it takes around 2 min, isn't it odd? Inside zinggdir/model/ i see options 'block/' and 'classifier/' Inside classifier I see best.model and then multiple folders that you can see in the screenshot. And then further down each ones splits into metadata and stages. Same for the zingdir/model/block for which I attached one more screenshot. What do you think is going on here? Are you thinking anything else in my code or within databricks or permissions related (even though i have admin rights in the cluster etc, i know azure has many many layers of permissions that i might not know) or anything that doesnt produce a direct error is causing this? It just doesnt seem to create the anything in the output location, without any error. Again thank you and I can print anything you would want to see!
Thanks a lot for your response Sonal! I ran it around 12-13 times (I also noticed that after first run that I got 23 pairs, I was always getting 20 pairs after that in every run, is that how it is supposed to go?) and the message I got was: You have accumulated 108 pairs labeled as positive matches. You have accumulated 128 pairs labeled as not matches. If you need more pairs to label, re-run the cell for 'findTrainingData' Not sure if you want these logs but printing labels and matches looks like the image I ve attached. I also gave you the types of my columns in the second image. Training surprisingly happened very very quickly in a matter of seconds (compared to findTrainingData step which needed 4-5 minutes every time for 100k dataset). After training i ran: %fs ls dbfs:/FileStore/tables/zingg/output05_09Oct_1_subset100k_cleanPhone and got: FileNotFoundException: No such file or directory dbfs:/FileStore/tables/zingg/output05_09Oct_1_subset100k_cleanPhone I don't think it is caused because I am trying to store them in FileStore/tables, I also tried storing them in tmp/zingg. Please let me know anything you want to see (code or for me to print something out) to see what is causing this... Thank you very much for this!
any ideas on this team?
Thanks a lot for the Reply Padam!
This doesn't work, gives me the same result to my query.
& 3. You are right. I ran this and I got :
path name size modificationTime
dbfs:/FileStore/zingg/zingg05Trial29Sep_2_cleanPhone/model/ model/ 0 1.76E+12
dbfs:/FileStore/zingg/zingg05Trial29Sep_2_cleanPhone/trainingData/ trainingData/ 0 1.76E+12But then I cleared the state of the cluster and rerun in a new notebook and refreshed the cluster and I get: FileNotFoundException: No such file or directory dbfs:/FileStore/tables/zingg/output05_30Sep_2_subset100k_cleanPhone why does this happen? I can cleary see that zingg learns after each run that i provide candidate pairs, but doesnt seem to be able to save them? What might be the issue you know? I can run any command and paste the result here if you like, because up until the training it seemed to work fine!
Hello there team and thanks a ton for building Zingg! What an amazing tool:) I managed to install it through UC in databricks and it works! It also seems to learn after each training set of labels I input..which is a good sign.. However, when I try to predict Matching Records in the end: outputDF = spark.read.csv("dbfs:/FileStore/zingg/zingg05Trial29Sep_2_cleanPhone", schema_out) colNames = ["z_minScore", "z_maxScore", "z_cluster", 'CompanyID', 'CompanyName', 'Email', 'Phone', 'Address1', 'LastName', 'FirstName', 'Address2', 'Country'] outputDF.toDF(*colNames).show(100) I get: [[UNABLE_TO_INFER_SCHEMA](https://learn.microsoft.com/azure/databricks/error-messages/error-classes#unable_to_infer_schema)] Unable to infer schema for CSV. It must be specified manually. SQLSTATE: 42KD9 File <command-8616479842645297>, line 2 ----> 2 outputDF = spark.read.csv("dbfs:/FileStore/zingg/zingg05Trial29Sep_2_cleanPhone") So i do: schema_out = "z_minScore float, z_maxScore float, z_cluster int, CompanyName string, CompanyID string, FirstName string, LastName string, Email string, Phone string, Country string, Address1 string, Address2 string" outputDF = spark.read.csv("dbfs:/FileStore/zingg/zingg05Trial29Sep_2_cleanPhone") but then i get an empty DF: +----------+----------+---------+---------+-----------+-----+-----+--------+--------+---------+--------+-------+ |z_minScore|z_maxScore|z_cluster|CompanyID|CompanyName|Email|Phone|Address1|LastName|FirstName|Address2|Country| +----------+----------+---------+---------+-----------+-----+-----+--------+--------+---------+--------+-------+ +----------+----------+---------+---------+-----------+-----+-----+--------+--------+---------+--------+-------+ please help me on what might have gone wrong here..