Using Zingg Pre-Trained Model 101 for Entity Matching on PostgreSQL Tables with Different Field Names
Hello everyone, I am discovering Zingg who can potentially solve a entity matching task for my company. I need to match 2 tables on postgresql. One of them have 300K rows, one of them has much less. So I think I am supposed to use the link phase. I am using the docker image of Zingg version 0.5.0. I would like to use the pre-trained model 101 (or better) to see if the results are good enough. I am currently blocked by the issue where the pretained model doesn't have the same field names as my data set. Thank you very much in advance. This is the command I use to launch the application: docker run --rm --network host -v $(pwd)/configLink.json:/config.json -v $(pwd)/zingg.properties:/zingg.properties -v /tmp/zingg/jars/postgresql-42.5.1.jar:/zingg/spark-3.5.0-bin-hadoop3/jars/postgresql-42.5.1.jar -v /tmp/zingg/models:/zingg/models zingg/zingg:0.5.0 ./scripts/zingg.sh --properties-file /zingg.properties --phase link --conf /config.json This is my config file: { "modelId": "101", "zinggDir": "/zingg/models", "numPartitions": 4, "labelDataSampleSize": 0.1, "collectMetrics": false, "data": [ { "name": "table1_input", "format": "jdbc", "props": { "url": "jdbc:postgresql://xxx:xxx/xxx", "dbtable": "xxx.xxx", "user": "xxx", "password": "xxx", "driver": "org.postgresql.Driver" }, "schema": "LOGIN DECIMAL(38,18),NUMACH DECIMAL(38,18),COMPANY_ID DECIMAL(38,18),NAME STRING,ADDRESS STRING,POSTCODE STRING,CITY STRING,COUNTRY_ISO STRING,LAST_ACTIVITY DATE,MAP_ORIGIN DECIMAL(38,18),SYS_MODIFY_DATE DATE,SYS_CREATE_DATE DATE,IS_INDIVIDUAL DECIMAL(38,18)" }, { "name": "table2_input", "format": "jdbc", "props": { "url": "jdbc:postgresql://xxx:xxx/xxx", "dbtable": "xxx.xxx", "user": "xxx", "password": "xxx", "driver": "org.postgresql.Driver" }, "schema": "LOGIN DECIMAL(38,18),NUMACH DECIMAL(38,18),COMPANY_ID DECIMAL(38,18),NAME STRING,ADDRESS STRING,POSTCODE STRING,CITY STRING,COUNTRY_ISO STRING,LAST_ACTIVITY DATE,MAP_ORIGIN DECIMAL(38,18),SYS_MODIFY_DATE DATE,SYS_CREATE_DATE DATE,IS_INDIVIDUAL DECIMAL(38,18)" } ], "output": [ { "name": "matched_output", "format": "jdbc", "props": { "url": "jdbc:postgresql://xxx:xxx/xxx", "dbtable": "xxx.xxx", "user": "xxx", "password": "xxx", "driver": "org.postgresql.Driver", "mode": "overwrite" } } ], "fieldDefinition":[ { "fieldName": "NAME", "fields": "NAME", "dataType": "string", "matchType": "fuzzy" }, { "fieldName": "ADDRESS", "fields": "ADDRESS", "dataType": "string", "matchType": "fuzzy" }, { "fieldName": "POSTCODE", "fields": "POSTCODE", "dataType": "string", "matchType": "fuzzy" }, { "fieldName": "CITY", "fields": "CITY", "dataType": "string", "matchType": "fuzzy" }, { "fieldName": "COUNTRY_ISO", "fields": "COUNTRY_ISO", "dataType": "string", "matchType": "exact" } ] }