Help with Stopword Optimization on Zingg Databricks Without Using Shell Commands

Mrudula K. · 2024-07-16T19:26:38.901Z

Hi! I'm trying to do some stopword optimization on zingg databricks (without sh commands). I'm unable to find documentation on modules and libraries I can use. Kindly help me! Thank you!

Zingg Community

Sonal G.
·
The Zingg stopWords functionality is documented at docs.zingg.ai. Does that not help?
Mrudula K.
·
The documentation talks about running the shell script to generate stopWords. I wanted to know if there are libraries/modules that can directly do the job as I want to do it on databricks? The documentation is definitely helpful :)
Sonal G.
·
Ah, got it! You can run the phase in the python notebook as well.
Sonal G.
·
https://github.com/zinggAI/zingg/blob/main/examples/databricks/FebrlExample.ipynb
Sonal G.
·
the above has the stop word generation part in the end
Mrudula K.
·
Got it. Thank you so much!
✅1
Mrudula K.
·
Hi Sonal G., I tried to implement the stopwords, but I'm facing an issue with the recommender. It is creating folders for the columns I want to generate stopwords, but the folders are empty. I tried running the code on the sample data on your github repo as well, but I'm facing the same issue. Could you please help me out with this issue? Thank you!
Sonal G.
·
What’s the cutoff value you have set?
Mrudula K.
·
I tried with 0.2, 0.5 and 0.9. None of them return anything
Sonal G.
·
the febrl example with only 65 recordsis way too small but the notebook I shared earlier does return the values for the febrl sample of 120k records. can you check the logs and let us know if you see something wrong?
Mrudula K.
·
Sure. I'll get back to you with the logs
Mrudula K.
·
I don't see anything wrong in the logs. And it is not showing me anything for 120k records as well.
Sonal G.
·
Vikas G.
Vikas G.
·
Mrudula K. could you please share the code you used to generate the stop words. I will try to run it locally and see what is a problem
Mrudula K.
·
Sure. Thank You zinggDir = "/models" modelId = "databricksdemotrial_120k" input_file = "/febrl120k/test.csv" try: # Stopwords recommendation phase options = ClientOptions([ClientOptions.PHASE, "recommend","--column", "firstName"]) args.setStopWordsCutoff(0.5) zingg = ZinggWithSpark(args, options) # Log the options generated # LOG.debug(f"Zingg options generated for stopwords recommendation: {vars(options)}") print (options) options_dict = vars(options) formatted_options = {key: str(value.getOptionValue) for key, value in options_dict.items()} LOG.debug(f"Zingg options generated for stopwords recommendation: {formatted_options}") zingg.initAndExecute() # Log the stopwords recommendations stopwordsForfname = spark.read.csv(zinggDir+"/"+modelId+"/stopWords/firstName") stopwordsForfname_list = stopwordsForfname.collect() LOG.info(f"Recommended stopwords for 'firstName': {stopwordsForfname_list}") except Exception as e: # Log any errors that occur during the stopwords recommendation phase LOG.error("Error occurred during stopwords recommendation:", exc_info=True) LOG.error(f"Error details: {e}") I'm running this on databricks (just for context)

28 comments