Setting up the Job
Before you begin
-
You have generated the suspect data pairs by using the tMatchPairing component.
-
You added a label next to the second record in each suspect pair to say whether it is a duplicate record or not or whether it is a possible duplicate as well:
480060609;DFSS_AgencySiteLies_2012.csv;Catholic Charities of the Archdiocese of Chicago St. Joseph;4800 S. Paulina; st. joseph_1;; 480060609;purple_binder_early_childhood.csv;Catholic Charities Chicago - St. Joseph;4800 S Paulina Street; st. joseph_1;0.8058642705131237;YES 425760624;chapin_dfss_providers_2011_070212.csv;CHICAGO PUBLIC SCHOOLS GOLDBLATT, NATHAN R.;4257 W ADAMS; r._20;; 422560653;chapin_dfss_providers_2011_070212.csv;CHICAGO PUBLIC SCHOOLS ROBINSON, JACKIE R.;4225 S LAKE PARK AVE; r._20;0.8219437219200757;NO
The labels used in this example are YES or NO, but you can use any label you like and more than two.
Procedure
- Drop the following components from the Palette onto the design workspace: tFileInputDelimited and tMatchModel.
- Connect the components together using the link.
- Check that you have defined the connection to the Spark cluster in the Computing suspect pairs and suspect sample from source data. view as described in