Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0762

...

Artifacts:

When a cluster clean step is added to your recipe, the number of individual changes can be many megabytes of data. Instead of storing these objects within the recipe definition, they are stored as a set of artifacts in the artifact storage database and referenced from the recipe.

  • These artifacts exist outside the scope of the recipe file.
  • These artifacts must be stored in a 

    D s item
    itemdatabase
     for the step to be editable and exportable.

    Info

    NOTE: If the artifact storage service is disabled, this feature is unusable.

  • When a flow is exported, an artifact.data file is included as part of the export. This file must be imported with the flow definition, or the cluster clean step in the imported flow is broken. For more information, see Export Flow.

Example - Multiple methods of clustering

...

When you standardize using a spelling-based algorithm, the following values are clustered:

Source ValueNew Value
  
Apple 
apple 
Åpple 
 Unclustered values
pear 
pair 
pare 

After you select the cluster of values at top, you can enter apple, in the right context panel to replace that cluster of values with a single string.

In the above, the unclustered values are dissimilar in spelling, but in English, they sound the same (homonyms). When you select the Pronunciation-based algorithm, these values are clustered:

Source ValueNew Value
  

pear

 
pair 
pare 
 Unclustered values
Appleapple 
appleapple 
Åppleapple 

...

The six source values have been reduced to two final values through two different methods of clustering. See below for more information on the clustering algorithms.

Source ValueNew Value
  

pear

 pear
pairpear 
parepear 
  
Appleapple 
appleapple 
Åppleapple 

...

Info

NOTE: These steps are applied to an internal representation of the data. Your dataset and recipe are not changed by this comparison. Changes are only applied if you choose to modify the values and add the mapping.

 

  1. Remove accents from characters, so that only ASCII characters remain.
  2. Change all characters to lowercase.
  3. Remove whitespace.
  4. Split the string on punctuation, any remaining whitespace, and control characters. Remaining characters are assembled into groups called tokens.
  5. Sort the tokens and remove any duplicates.
  6. Join the tokens back together.
  7. Compare all tokenized values in the column for purposes of clustering.

...