Troubleshooting Schema Inference and Empty DataFrame Issue When Reading CSV in Databricks with Zingg Output | Zingg Community

Zingg Community Icon

Zingg Community

Troubleshooting Schema Inference and Empty DataFrame Issue When Reading CSV in Databricks with Zingg Output | Zingg Community

Padam P.
·
Hi, here are the few things that you can try out
1.
outputDF = spark.read.csv('/dbfs/FileStore/zingg/zingg05Trial29Sep_2_cleanPhone')
colNames = ["z_minScore", "z_maxScore", "z_cluster", "CompanyName", "CompanyID", "FirstName", "LastName", "Email", "Phone", "Country", "Address1", "Address2"] outputDF.toDF(*colNames).show(100) 2. Check if file is empty %fs ls dbfs:/FileStore/zingg/zingg05Trial29Sep_2_cleanPhone 3. Check sample of file content %fs head dbfs:/FileStore/zingg/zingg05Trial29Sep_2_cleanPhone
George P.
·
Thanks a lot for the Reply Padam!
1.
This doesn't work, gives me the same result to my query.
2.
& 3. You are right. I ran this and I got :
path name size modificationTime dbfs:/FileStore/zingg/zingg05Trial29Sep_2_cleanPhone/model/ model/ 0 1.76E+12 dbfs:/FileStore/zingg/zingg05Trial29Sep_2_cleanPhone/trainingData/ trainingData/ 0 1.76E+12
But then I cleared the state of the cluster and rerun in a new notebook and refreshed the cluster and I get: FileNotFoundException: No such file or directory dbfs:/FileStore/tables/zingg/output05_30Sep_2_subset100k_cleanPhone why does this happen? I can cleary see that zingg learns after each run that i provide candidate pairs, but doesnt seem to be able to save them? What might be the issue you know? I can run any command and paste the result here if you like, because up until the training it seemed to work fine!
George P.
·
any ideas on this team?
Sonal G.
·
George P. can you please share the logs of label and match?
Sonal G.
·
also, how many times did you run the findTrainingData and label runs?
George P.
·
Thanks a lot for your response Sonal! I ran it around 12-13 times (I also noticed that after first run that I got 23 pairs, I was always getting 20 pairs after that in every run, is that how it is supposed to go?) and the message I got was: You have accumulated 108 pairs labeled as positive matches. You have accumulated 128 pairs labeled as not matches. If you need more pairs to label, re-run the cell for 'findTrainingData' Not sure if you want these logs but printing labels and matches looks like the image I ve attached. I also gave you the types of my columns in the second image. Training surprisingly happened very very quickly in a matter of seconds (compared to findTrainingData step which needed 4-5 minutes every time for 100k dataset). After training i ran: %fs ls dbfs:/FileStore/tables/zingg/output05_09Oct_1_subset100k_cleanPhone and got: FileNotFoundException: No such file or directory dbfs:/FileStore/tables/zingg/output05_09Oct_1_subset100k_cleanPhone I don't think it is caused because I am trying to store them in FileStore/tables, I also tried storing them in tmp/zingg. Please let me know anything you want to see (code or for me to print something out) to see what is causing this... Thank you very much for this!
Sonal G.
·
hmm. the marked records seem to be ok. the first round is around 20, but we dont cap it and let it select a few extra. most other rounds are 20. so nothing abnormal there. so after saving the labels, did you run trainMatch or train?
Sonal G.
·
also, if you look under <zinggdir>/<modelId> what do you see?
George P.
·
Thanks Sonal! After saving the labels i ran trainMatch: options = ClientOptions([ClientOptions.PHASE,"trainMatch"]) # Zingg execution for the given phase zingg = ZinggWithSpark(args, options) zingg.initAndExecute() I also find it odd that everytime I run findTrainingData it takes 4-5min and when I run trainMatch it takes around 2 min, isn't it odd? Inside zinggdir/model/ i see options 'block/' and 'classifier/' Inside classifier I see best.model and then multiple folders that you can see in the screenshot. And then further down each ones splits into metadata and stages. Same for the zingdir/model/block for which I attached one more screenshot. What do you think is going on here? Are you thinking anything else in my code or within databricks or permissions related (even though i have admin rights in the cluster etc, i know azure has many many layers of permissions that i might not know) or anything that doesnt produce a direct error is causing this? It just doesnt seem to create the anything in the output location, without any error. Again thank you and I can print anything you would want to see!
Sonal G.
·
all the folders under the best.model have 0 size?
Sonal G.
·
under best.model/bestModel/stages, do you see non empty files?
Sonal G.
·
one possible explanation could be that the trainMatch job is erroring out, but the exception is not being thrown and it appears to have run. looking at the logs will help here
George P.
·
Good week Sonal, thank you! Inside the stages file I see three folders:
1.
0_vecAssembler_db74d1b2b6dd
2.
1_poly_85dcbb8f817d
3.
2_logreg_15d7e89c81c1
each one of those folders have a metadata folder inside them that has files without 0 size, as you can see in the first image below. Only the last one (2_logreg.... folder) has an extra file apart from metadata which is called data and has actually a big snappy.parquet file inside it (second image). Finally, how can I access the logs for the training? are they under any specific metadata folder? Also I attach one more image (image 3) inside the trainingData/marked so you can see that it has more than 60 parquet files in case it helps.
👀1
Sonal G.
·
thanks George P.. the model is surely getting created, so I suspect we have to dig into match now. trainMatch is made of two phases - train and match and they can be run individually as well. the logs would be the driver and executor logs of your cluster
George P.
·
Right! I enabled logs storing in the cluster and now I can see inside both the executor and the driver logs. In the driver path you can see image 1 and in the executor there are two folders one with '/0/' and one with '/1/'. Which logs should I access? And what should I look for? I can also attach them here if you are used at looking at these logs and you can spot where it works fine and where there is trouble quickly.