Hi. I have a client looking to use Zingg. They have two datasets initially (with 2 more coming online soon) which have ~450 million records and ~60 million records, respectively. They wish to model based on 5-15 fields but this will be dependent on their join condition. They're looking for a rough estimate of time, and therefore cost to run the model. Can anyone make a suggestion? The current hardware sizing documentation states:
80m records with 8-10 fields took less than 2 hours on 1 driver (128 GB RAM, 32 cores), 8 workers (224 GB RAM, 64 cores). This is a user-reported stat without any optimization.
So we can (roughly) extrapolate, I am going to guess ~2-5 hours but honestly I cannot be certain. Any guidance here would be deeply appreciated.