And Much More!
Standardizing values is a way of grouping similar values into a single, consistent format. With Cluster Clean,
gives users access to multiple algorithms for grouping values and easy-to-use tools for standardizing to a single value.
D s product product ee r true
The two different options that are presented in the Cluster Clean menu are by string similarity and by pronunciation. String Similarity compares strings against a combination of all values and uses either fingerprint or fingerprint ngram algorithms to cluster. You can see this in the following example:
The Pronunciation algorithm uses a double metaphone algorithm to compare values across languages by pronunciation. You can see this in action below. Determining which clustering algorithm to use depends on the scenario, but the Cluster Clean feature will give you the flexibility to choose depending on the context you have.
Tip: You can mix-and-match algorithms. Some values may be standardized using spelling, while others are more sensibly standardized based on international pronunciation standards.
Below, some values are still highlighted from the string similarity example:
For more information,, see Overview of Standardization.
The enhanced Selection Model makes for quicker and more intuitive interactions within the Transformation Grid. Selecting a column now gives users a more complete profile of the column. Additionally, users now have quicker access points to more detailed profiling information depending on the column’s data type. For instance, a date column will give users options to explore the distributions of values in terms of years, months, days of the week, etc. Excluding weekends, as an example, now only requires a few interactions with the profile:
Likewise, cleaning up issues in columns with multiple date formats can be quickly addressed by exploring and interacting with Patterns:
The enhanced Selection Model enables similar interactions as those in the Columns View. You can now copy and paste columns and column values:
You can also perform multi-column selection in the Transformer Grid, which updates suggestions based on the context, and works with the Toolbar--allowing for quick and easy multi-column transformations:
For more information, see Selection Details Panel.
The 6.0 Enterprise release also has enhancements to our Job Details page . This redesigned page now includes the following tabs:
Overview - A summary page of the job run
Output Destinations - Information on the output datasets and download and publishing page
Profile - Overview of profiling information like missing values, column distributions, etc
Dependencies - An audit trail of the recipes and steps involved in the job run
Data sources - Information on the datasets used to create the job output
Parameters - An optional screen that lists any parameters used to create the data sources
For flows using parameters in the input, you will see the following information:
For more information, see Job Details Page.
With new metadata references, users can now reference the source file path and the source row number
using the following functions:
D s product
$sourcerownumber. This gives users access to lineage at both the source and record level in their data, improving governance and insight into changes made to your data. See Source Metadata References .
Publishing back to relational databases is now supported. Connections to Oracle, SQL Server, PostgreSQL or Teradata automatically support the ability to write your results back to the database.
The following connection types are natively supported for relational publishing.
- Postgres Data Type Conversions
- SQL Server Data Type Conversions