As you develop your recipe, you might need to take new samples of the data. For example, you might need to focus on the mismatched or invalid values that appear in a single column. Through the Transformer page, you can specify the type of sample that you wish to create and initiate the job to create the sample. This sampling job occurs in the background.You can create a new sample at any time. When a sample is created, it is stored within your storage directory on the backend datastore.For more information on creating samples, see Samples Panel.
Depending on the type of sample you select, it may be generated based on one of the following methods, in increasing order of time to create:
- on a specified set of rows (firstrows)
on a quick scan across the dataset
Tip: Quick scan samples are executed in the
D s photon
on a full scan of the entire dataset
Tip: Full scan samples are executed in the cluster running environment.
When a non-initial sample is executed for a single dataset-recipe combination, the following steps occur:
- When a new sample is generated, any Sort transformations that have been applied previously must be re-applied. Depending on the type of output, sort order may not be preserved.
- Samples taken from a dataset with parameters are limited to a maximum of 50 files when executed on the
running environment. You can modify parameters as they apply to sampling jobs. See Samples Panel.
D s photon
With each step that is added or modified to your recipe,
|D s product|
Random selection of a subset of rows in the dataset. These samples are comparatively fast to generate. You can apply quick scan or full scan to determine the scope of the sample.
Find specific values in one or more columns. For the matching set of values, a random sample is generated.