In , visual profiling provides real-time interactive visualizations of your dataset to assist in the discovery, cleansing, and transformation of your data. Visual representations are required for interpreting large volumes of data, and the platform's innovative profiling techniques visualize key statistical information in a dynamic, easy-to-consume format for faster transformation.
Visual profiles are available while you transform your data in the Transformer page, when you dig into the detail of individual columns, and after you execute your job at scale. Each of these interfaces has different usage patterns designed to accelerate and simplify data transformation for that specific area of the process.
Locate anomalies. Visual profiling surfaces missing or invalid data in individual columns. These values can then be selected and transformed as needed.
Identify distributions. In the data grid, you can review value distribution for each column in your dataset. When exploring the column details, you can also identify and select statistical outliers among your column data.
In the following example, a dataset containing address information has been loaded in the Transformer page:
Example dataset |
In this example, we are interested in exploring geographic information. From the column drop-down for the Zip
column, you select Column Details.
Explore detail on demand. Generate visual profiles from the column drop-down. |
When you explore the column details of the new column, you can see the following representation of the data:
Zip Code data type represented as a U.S. map |
In this case, the values in your Zip column are recognized as being of Zipcode data type. The application then represents these values as a U.S. map, which quickly renders numeric data into a format that's much easier to read and analyze.
Type-specific visualizations. The profile of the column values is represented in a type-specific visualization to assist in rapid analyzing and taking action on some or all values in the column. |
Wherever you can interact with data, visual profiling simplifies the process.
Customized visualizations. Each interface has been optimized for the scope of the data it is visualizing, whether the data is a single column, the entire sample of a dataset, or generated results. |
In the Transformer page, the data grid is a tabular representation of a sample of your dataset. It is the primary interface through which you build your transformation recipes. Profiling tools:
Whenever a transform is selected or specified, a preview of its effects is displayed in the data grid, including any changes to the data quality bar and column histogram of affected columns. See Transform Preview.
For additional details on visual transformation, see Transform Basics.
Through the Transformer page, you can explore statistical details about individual columns, visually represented based on the column's data type. From the drop-down for any column, select Column Details.
In this interface, you can review the range of values in the column and can optionally select one or more values from other columns to see which values in the current column apply. The visualizations for a column depend on the data type.
See Column Details Panel.
In the Column Details panel, you can review profiling of patterns detected in the values for the selected column. These patterns can be selected, which identifies the relevant values in the column that match the pattern. You can then use these selections as the basis for building transforms that apply to the matching values.
For more information, see Column Details Panel.
After the application has successfully executed a job for which profiling is enabled, you can explore a visualization of the generated dataset in the Job Results page. See Job Results Page.
Decoupled from the user interface, the profiling engine performs the calculations required to power the visualizations before job execution and after the job results have been generated.
NOTE: When you choose to profile your results, you are creating two distinct tasks: 1) run your transform recipe against your source and 2) profile the results. Due to the computational complexity of generating the interactive results, a profiling task often takes longer to complete than a transformation task and is therefore an optional element of a job run. |
Generally, visual profiles represented in the user interface, in places like column histograms and column details, are exact measurements against the current sample.
On generated results, visual profiles tend favor approximations.
NOTE: The computational cost of generating exact visual profiling measurements on large datasets in interactive visual profiles severely impacts performance. Depending on the environment, you may choose to run profiling jobs on generated results as separate jobs. For more information on enabling this feature, see Profiling Options. |
Below, you can review details on how metrics are calculated in visual profiling performed in different areas of the platform.
The UI leverages the Photon running environment when displaying visual profiles on sampled data.
NOTE: Profiles are executed on the currently sampled data. Results may vary when the full transformation job is executed. |
Metric Type | Measurement |
---|---|
Frequency (top-k) | Exact |
Unique value counts | Exact |
Numerical histograms | Exact |
Simple statistics (mean, stdev, min, max) | Exact |
Quartiles | Exact |
When profiling jobs are executed on , they leverage the server-side version of the Photon running environment.
Metric Type | Measurement |
---|---|
Frequency (top-k) | Approximate |
Numerical histograms | Approximate |
Simple statistics (mean, stdev, min, max) | Exact |
Quartiles | Exact |
For profiling jobs, the Spark running environment is used for Spark transformation jobs and optionally for Photon jobs. For more information on enabling the execution of visual profiling on Spark for Photon jobs, see Profiling Options.
Metric Type | Measurement |
---|---|
Frequency (top-k) | Approximate |
Numerical histograms | Approximate |
Simple statistics (mean, stdev, min, max) | Exact |
Quartiles | Approximate |