Skip to main content

Fuzzy Match Edit Match Options

Refer to this page for a detailed description of the custom match options in the Edit Match Options window. To access the Edit Match Options window, select Edit from the Match Options column in the Fuzzy Match tool Configuration window.

Match Style

Match style is a predetermined method of finding an appropriate match between rows of an input file. The individual match style choices are defined on the Fuzzy Match Tool page.

All predefined and custom match styles appear in the Match Style list. The subsequent Preprocess setting is based on the Match Style chosen. If you edit a predefined match style, the style value changes to Custom in the dropdown list. Note that custom match styles aren't saved and can only be used inside the workflow in which they are created. Changes for a custom match style automatically apply to the tool.

Preprocess

Preprocess describes a procedure that runs before Generate Keys and the Fuzzy Match function. The Preprocess should result in better matches. The Preprocess choices include...

  • None: No pre-process runs.

  • Strip Punctuation: Any punctuation characters within the specified column are ignored while the tool determines matches.

  • Strip Punctuation & Salutations: Any punctuation and titles like MsMr, and Mrs within the specified column are ignored while the tool determines matches.

  • Strip Punctuation & AND, OF & THE: Any punctuation characters and instances of the words andof, and the within the specified column, are ignored while the tool determines matches.

  • Strip Punctuation & Remove Units from US Addresses: Any punctuation characters, as well as any unit numbers within the specified column, are ignored while the tool determines matches.

Generate Keys

Generate Keys is the method by which a potential match is identified.

Designer Cloud reads through the specified column and assigns keys to the components of that column. Once all keys are generated, Designer Cloud compares the concatenated keys for every matched column. If the keys generated are equal for 2 rows, a potential match is identified and the pair proceeds to the next phase of the match process.

  • None: Keys for this column are considered when deciding which rows match.

  • Digits Only: Only rows with the same digits from left to right in the specified column are matched.

  • Digits Only - Reverse: Only rows with the same digits from right to left in the specified column are matched.

    Tip

    Example

    1-(800)555-1234 matches 800-555-1234.

    Non-digit characters are ignored and numbers are matched from last (4) to first (8 or 1). For this row to match, specify the Maximum Key Length as 10 to ignore the leading 1.

  • Double Metaphone: Double Metaphone is the preferred algorithm. It codes English words (and foreign words often heard in the English language) phonetically by reducing them to 12 consonant sounds. This reduces matching problems from spelling errors. Double Metaphone is the preferred method for matching based on sound. It returns 2 keys if a word has 2 possible pronunciations, like a foreign word that has been Anglicized. For more information, go to Double Metaphone.

  • Double Metaphone w/ Digits: Use the same Double Metaphone algorithm, but include digits as well. When there are digits in the string, digits in the first token are the key.

    Tip

    Example

    1234 5th St.

    1234 is the key.

  • Soundex: An algorithm to code surnames phonetically by reducing them to the 1st letter and up to 3 digits, where each digit is 1 of 6 consonant sounds. This reduces matching problems from different spellings. This algorithm was devised to code names recorded in US census records. The standard algorithm works best on European names. Variants have been devised for names from other cultures. For more information, go to Soundex.

    Leading Letter Replacements: Replace these leading letters and letter combinations prior to generating the match key:

    KNLeading Letters

    NReplacement

    AV

    AF

    AH

    A

    AW

    A

    CAAN

    TAAN

    DG

    G

    D

    G

    HA

    A

    KN

    K

    MAC

    MC

    M

    N

    NST

    NS

    PF

    F

    PH

    F

    Q

    G

    SCH

    SH

    Z

    S

  • Soundex w/ Digits: Use the same Soundex algorithm, but include digits as well. When there are digits in the string, digits in the 1st token are the key.

  • Whole Field (Case Insensitive): Only rows where the entire column matches are matched. Case is ignored.

  • Alphanumeric Only (Case Insensitive): Look only at alphanumeric characters to make a match. Case is ignored.

  • Address Number + Soundex: Removes the address number from a string and applies the Soundex algorithm to the remainder of the column. The Soundex code is then appended to the address number to create a unique key.

Generate Keys for Each Word

Check Generate Keys for Each Word to generate a separate key for each word. Note that "john smith" and "smith john" are able to line up as a potential match even though the words are out of order.

When checked, these options appear:

  • Don't Generate Keys for These Words: Specify or select words to exclude from the key assignment. This can reduce the processing time by limiting the number of potential matches.

  • Don't Generate Keys for Single-Letter Words: Check the box to exclude single-letter words from the key assignment. This can reduce the processing time by limiting the number of potential matches.

Ignore if Empty

Ignore an empty value of the specified match column. If the match column is empty, then no key is generated and the row is ignored

Maximum Key Length

Specify the maximum length of the key to consider for the match.

Match Function

The Match Function is a more granular process by which a match is identified and a score is applied. The function differs from keys, which must match exactly. Use Match Function to define the match function.

Note

Each Function has a type associated with it:

  • Word-based (Match Function begins with Words:) functions look at any words within the specified column, regardless of the order of the words.

  • Non-word-based functions match against the entire string as a whole and start with Character:.

  • For Words & Digits functions, all tokens that have digits in them must be on both sides to be considered a match. These would typically be used for addresses.

These are the Match Function available...

  • None - Key Match Only: Look only at the key-generation specifications.

  • Levenshtein Distance: The smallest number of insertions, deletions, and substitutions required to change 1 string or tree into another. When the Levenshtein Distance is selected, the match score is meaningfully lower due to differences. For more information, go to Levenshtein Distance. Levenshtein Distance options include...

    • Character: Levenshtein Distance

    • Character (No Spaces): Levenshtein Distance

    • Words: Levenshtein Distance

    • Words and Digits: Levenshtein Distance

  • Jaro Distance: A measure of similarity between 2 strings. The Jaro measure is the weighted sum of the percentage of matched characters and necessary transpositions. The Jaro Distance is more forgiving than the Levenshtein Distance with respect to differences in strings. For more information, go to Jaro-Winkler. Jaro Distance options include...

    • Character: Jaro Distance

    • Character (No Spaces): Jaro Distance

    • Words: Jaro Distance

    • Words and Digits: Jaro Distance

  • Best of Jaro & Levenshtein Distance: Both match types are analyzed and the score is taken. Best of Jaro and Levenshtein Distance options include...

    • Character: Best of Jaro & Levenshtein Distance

    • Character (No Spaces): Best of Jaro & Levenshtein Distance

    • Words: Best of Jaro & Levenshtein Distance

    • Words and Digits: Best of Jaro & Levenshtein Distance

Word-based Function Options

When you select a word-based function (function begins with "Words:"), more configuration options appear...

  • When Using Word Based Match, Also Use: You can specify an additional match method that produces an additional score, taking the best score, and eliminates the need for running 2 instances of a Fuzzy Match tool.

    • Character: Use the word-based match score in addition to a character match function. 2 scores are generated and the best match score is used to identify the match.

    • Character (No Spaces): Same as above, but spaces are ignored when generating the character-based match.

  • Word Frequency Statistics (Word Match Only): You can specify a Word Frequency table based on predefined statistics. When specified, the words that appear in the database carry less importance when they are present in the incoming data, and the match score is adjusted accordingly. Options include...

    • Name: Contains frequent words in a name column. The frequency inversely relates to how important those words are in the match score.

    • US Address: Contains frequent words in a US Address column. The frequency inversely relates to how important those words are in the match score.

    • US Company: Contains frequent words in a Company Name column. The frequency inversely relates to how important those words are in the match score.

      Tip

      Example

      Match "Albert Commette" to "Albert Commette MD."

      The Word Frequency Statistics table for "Name" includes the word "MD." When Word Frequency: Name is specified, the resulting match score is roughly 5 points higher than if Word Frequency: Name is not specified.

  • Abbreviation Table: Select a table from the dropdown to check against and further identify duplicates. Use this option on columns that contain either only the first name or both the first and last names.

  • Penalty: Set the penalty percentage applied when a match is made with data from the Abbreviation Table. The default value is 15%. A penalty is recommended because a nickname match is another potential source of error. The penalty percent is subtracted from the match score prior to comparison with the match threshold.

Other Match Function Options

  • Match Threshold: Set the allowable uncertainty percentage to return a match for a particular column. If the threshold for column 1 is 60% and the column only matches with 55% confidence, the row is ignored.

  • Match Weight: Apply importance to the column. This causes the column to be considered more or less strongly during a match.

    For example, if "Company Name" is twice as important as "Contact Name," the Match Weight for Company Name should be twice the value of the Match Weight for Contact Name. This weight is used when calculating the overall Match Score.

For additional information regarding Fuzzy Match use, go to Fuzzy Match FAQ.