Hi , i am trying to run spark job using zingg properties file , like below
spark-submit
--properties-file zingg.conf
--
..I checked these configuration setting https://github.com/zinggAI/zingg/blob/main/config/zingg.conf which when getting passed, the configuration is not able to pick up all the properties...like spark.serializer, spark.kryoserializer.buffer.max properties... although some other properties are showing up in the log... this is where i am having some confusion... why zingg is able to pick up few properties from the zingg.conf file . Anyone can give any leads on this ?
Looks like you might want to try the --conf option
Okay, doesn't look like that is the case
yes, --conf is specific to the config.json.env file that we are passing, which contains the field definitions stuff. --properties-file is for the zingg.conf file as mentioned here in Zingg docs https://blog.infostrux.com/identity-resolution-with-zingg-ai-snowflake-and-aws-emr-for-the-canadian-football-league-22cf0850ab53
Yup, my bad.
Btw Rajesh P., the blog you're referencing does use the zingg.sh executable, and not spark-submit directly. might want to recheck that
Are you running it on AWS EMR?
Yes I am running on AWS EMR, and to the above note, i believe the arguments, like --conf, --properties-file would still hold true right while running the zingg job right ?
Yeah, should be fine. the zingg executable just translates the --properties-file argument as is, but doesn't really have a --conf option I think
See if there are emr specific ways of passing spark configuration values?
yeah Sonal G., EMR too takes the properties like spark-submit command... checked here https://repost.aws/knowledge-center/emr-set-spark-parameters
Sorry i dont see the --properties-file option in the link you shared.
Sonal G., i have tried to pass the --conf option as well and passed the spark zingg.conf properties file. It is not able to run... But, when i tried to run using --properties-file by passing the zingg.conf file, it is able to pickup few spark properties ( like spark.executor.memory, spark.executor.cores ...) , but not all the spark properties (like spark.serializer ) mentioned in the zingg.conf file... (you can see in the screenshot above)
from what I understand, the --conf option is only for limited properties, not a whole file. the link you shared has clear instructions on setting the default properties. have oyu tried that?
yes...
there is nothing in Zingg code to selectively load spark configuration. I would recommend checking the EMR docs and support to see whats going wrong.