- The default sample is the initial sample.
- By default, each sample is 10 MB in size or the entire dataset if it's smaller.
- If the source data is larger than 10MB in size, a random sample is automatically generated for you when the recipe is first loaded in the Transformer page.
- The initial sample is selected by default. When the automatic random sample has finished generation, it can be manually selected for display.
- If your source of data is a directory containing multiple files, the initial sample for the combined dataset is generated from the first set of rows in the first filename listed in the directory.
If the matching file is a multi-sheet Excel file, the sample is taken from the first sheet in the file.
If you are wrangling a dataset with parameters, the initial sample loaded in the Transformer page is taken from the first matching dataset.
Additional samples can be generated from the context panel on the right side of the Transformer page. Sample jobs are independent job executions. When a sample job succeeds or fails, a notification is displayed for you.
As you develop your recipe, you might need to take new samples of the data. For example, you might need to focus on the mismatched or invalid values that appear in a single column. Through the Transformer page, you can specify the type of sample that you wish to create and initiate the job to create the sample. This sampling job occurs in the background.
You can create a new sample at any time. When a sample is created, it is stored within your storage directory on the backend datastore.
NOTE: The Initial Data sample contains raw data from the source. Any generated sample is stored in JSONLines format with additional metadata on the sample. These different storage formats can result is differences between initial and generated sample sizes.
For more information on creating samples, see Samples Panel.
Important notes on sampling
- Sampling Depending on the running environment, sampling jobs may incur costs. These costs may vary between
and your clustered running environments, depending on type of sample and cost of job execution.
D s photon
- When sampling from compressed data, the data is uncompressed and then expanded. As a result, the sample size reflects the uncompressed data.
- Changes to preceding steps that alter the number of rows or columns in your dataset can invalidate the current sample, which means that the sample is no longer a valid representation of the state of the dataset in the recipe. In this case,
automatically switches you back to the most recently collected sample that is currently valid. Details are below.
D s product
- Some advanced sampling options are available only with execution across a scan of the full dataset.
- Undo/redo do not change the sample state, even if the sample becomes invalid.
- Samples taken from a dataset with parameters are limited to a maximum of 50 files when executed on the
running environment. You can modify parameters as they apply to sampling jobs. See Samples Panel.
D s photon
With each step that is added or modified to your recipe,
|D s product|
For more information on sample types, see Sample Types.
|D s also|