Skip to main content

Fuzzy Match FAQ

These topics are common questions related to the Fuzzy Match Tool and the related Fuzzy Match Edit Match Options.

How many columns should I configure in a single Fuzzy Match tool?

There is no standard answer to this question. Consider matching columns that should be different between rows and could denote the row as unique. For example, in a standard contact database, the name, address, and phone number should identify a unique person. Many people can have the same city and state, so these would be less meaningful.

It is important to understand the relationship between using multiple columns and how much importance, or weight, to give to each column being considered in the matching process. For example, Name might not be as important as Address and ZIP, so weighting Name less than Address and ZIP can result in more matches where the Address and ZIP are exact, but the Name has scored less than an exact match.

What is the difference between Purge Values and Merge Values?

Purge Values finds matches within each individual dataset, as well as matches between 2 datasets. Purge Values can be used on 1 dataset to remove duplicates from, or de-dupe, the database. This can be a prep-step before a 2-database merge is performed.

Merge Values compares rows from 2 different data sources (rows with different source IDs). Choosing merge only finds matches between 2 datasets.

Why do I need to de-dupe my database prior to sending it through Merge Values?

A database should be de-duped prior to using Merge Values because:

  • Merge Values doesn't detect duplicate rows within the same source.

  • The matching process is faster without duplicate rows. For example, Dataset 1 has 5 duplicates. Dataset 2 has 10. If a merge runs without purging those duplicates, the match will check 50 match pairs. If the duplicates are purged, the match checks 1 match pair.

Why do I need to have a unique ID per row?

The Fuzzy Match tool uses an identifier (ID) to label matches, either from 1 file to another or from 1 row to another, in a single file. The tool uses the ID to report which rows match.

The ID have to be unique for each row, including rows from different datasets, to make sure output from the tool is accurate. Follow these best practices for unique IDs:

  • Know the size of your datasets to better understand the necessary starting value for each Row ID column.

  • Add a Row ID tool to both dataset streams.

  • Set the Starting Value of different dataset streams several magnitudes off from each other to ensure all rows have a unique value assigned.

    Tip

    Example

    Assign 100000000 as the initial value for the Record ID tool for the primary file and 200000000 as the initial value for the customer file. Consistently using this practice allows you to easily identify the sources of the match rows.

The RowID1 and RowID2 column naming convention is confusing coming out of the Fuzzy Match tool. Is there a trick to keep these columns straight?

In Purge mode, the data in RowID1 and RowID2 are the row identifiers from your dataset.

In Merge mode, RowID1 and RowID2 correspond to matched IDs, 1 from each dataset. Setting the Row IDs at starting values of different magnitudes allows you to more easily recognize which dataset is being referenced.

RowID1 is always the "first" value in the matched pair if the 2 IDs are sorted alphanumerically.

The rows from my sources are split between the Row ID columns. Why aren't they in the same column?

Fuzzy Match match pair IDs are sorted alphanumerically by row. Numeric Row ID columns sort RowID1 to RowID2, smallest to largest respectively, but string Row IDs can sort in unexpected ways.

Take a scenario where row 101 matches with row 11. If the columns are stored as numbers, RowID1 would be 11 and RowID2 would be 101. If the columns are stored as strings, RowID1 would be 101 and RowID2 would be 11.

Switch to a numeric Row ID column, or verify that strings with prepended Row IDs have a standardized format between rows.

If I want to use city or state in my match configuration, which match style should I use?

In most address match scenarios, where the address database is consistently populated with data, city and state columns are not needed in matching. NameAddress, and Zip Code are more commonly used Match Style options. Examine your data to determine if city or state columns may be relevant.

Use Double Metaphone if:

  • The city and state columns are not abbreviated.

  • The columns may contain spelling errors.

Use Whole Field or Whole Field - Case Insensitive if:

  • The state column is abbreviated and requires and exact match. An exact match is typically required if moving on to a more granular process of matching.

If I have multiple address configurations in my database (i.e. some address with suite numbers, some with not, and some with suite numbers in an additional column), which match style should I use?

In many address match scenarios, suite columns are not needed in matching. NameAddress, and Zip Code are more commonly used Match Style options. Examine your data to determine if suite columns may be relevant.

Double Metaphone w/ Digits is the preferred match style for any address column, regardless of whether the address includes suite numbers. Also consider using the Strip Punctuation and Remove Units From US Address option under Pre-process.

If I parse out a Name column into multiple columns (i.e. First Name, Last Name, Middle Initial) will I get better match results?

In most cases, parsing a Name column into individual component columns is not necessary and should not result in a better match. Use the Generate Keys for Each Word option with the Soundex algorithm to generate name column keys. This ensures that the word order is not considered, so both "Cindy Smith” or "Smith, Cindy” are considered a match.

Parsing the Name column is advantageous when you want to put different weights on each value. For Rosey Smith to match R Smith, the last name is weighted at 80% and the first name is weighted at 20%.

What if the Name column contains Mr., Mrs., Miss, etc. Will this affect the match rate for this column?

Under Edit > Pre-process, use Strip Punctuation & Salutations to ignore these words while performing a match.

How do I apply custom match styles?

Designer Cloud automatically applies custom match styles in a workflow's XML when you select them. Unlike Designer Desktop, there's no need for you to select Save.