Hey,
We have 90M records as input to Zingg, its taking too much time (more than 11hrs), Please provide better solution approach to this ? can we split data ? or anything!!
Splitting data defeats resolution, since you want to find matches across the entire dataset. You should look at labelling and training more as well as increasing cluster size. For such huge datasets, spark tuning will also help.