NOTE: When a new sample is generated, any
Initial: By default, the application loads the first N rows of the dataset as the initial sample when the Transformer page is opened. The number of rows depends on column count, data density, and other factors. If the dataset is small enough, the full dataset is used.
NOTE: By default, samples may be up to 10 MB in size. For datasets smaller than this limit, the entire dataset is loaded.
Click the link in the current sample card to see the list of all available samples.
Tip: To change the name of a sample, click its card in the list of all available. Then, click the Edit icon.
At the bottom of the Transformer page, you can review the number of rows and columns and count of data types in the currently displayed sample.
NOTE: As you add transform transformation steps to your recipe, the values in the status bar change to reflect the current state of the loaded sample.
- In the Samples panel, select the type of sample to create. For more information on sample types, see Overview of Sampling.
In the Collect new sample panel, specify the following parameters, some of which may not be required for your sampling method:
Choose a sampling method: Select or enter the type of sample. If you already selected a sampling method, this value is pre-populated for you.
Name: You can enter a new name of the sample as needed.
Tip: Naming your samples can assist in tracking them later. For example, you might choose to add a date stamp to the name to track when you captured the sample.
Scan Type: (Does not apply to all sampling methods) Types of scans:
Quick- performs a random scan of the dataset to extract the appropriate number of rows for the sample
Full- gathers the sample from the entire dataset. Depending on the size of the dataset, this method can take a while.
Use latest data: When collecting a Full Scan sample from a JDBC source and performance ingest caching has been enabled, you can choose to override the cached data and to gather all of your data from the original datasources.
NOTE: If the cached data has expired, the sample is always collected from the original datasources, even if this option is not selected.
Click more details to review the list of datasets whose cached data will be overridden.
Ingest caching applies to non-native relational (JDBC) datasources. For more information, see Configure JDBC Ingestion.
- Column or columns: (Stratified, Cluster-based) Name of the column from which to gather values to evaluate (Anomaly-based) Specify the name or names of one or more columns containing the anomalies to include in your sample. Multiple columns can be specified by comma-separated values. A column range can be specified using the tilde (
Condition: (Filter-based, Stratified, Cluster-based, Anomaly-based) Filter the sample based on a specified condition. For example:
invoiceDate > 90
- Anomaly type: (Anomaly-based) Select the type of anomalous values to include in your sample: invalid, missing, or both types.
Variable overrides: If one or more variables is associated with your dataset, you can define the value overrides to be applied when the sample is executed.
You can use these overrides to sample data from different source files in your dataset with parameters.
- A variable can have an empty value.
For more information, see Overview of Parameterization.
- To begin collecting the sample, click Collect.
- You can continue working while the sample is collected. When the sample is available, a status message is displayed in the Transformer page.
- You can click Load Sample in the Samples panel to begin using it.