Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version next

...

As you develop your recipe, you might need to take new samples of the data. For example, you might need to focus on the mismatched or invalid values that appear in a single column. Through the Transformer page, you can specify the type of sample that you wish to create and initiate the job to create the sample. This sampling job occurs in the background.You can create a new sample at any time. When a sample is created, it is stored within your storage directory on the backend datastore.For more information on creating samples, see Samples Panel.

Sampling methods

Depending on the type of sample you select, it may be generated based on one of the following methods, in increasing order of time to create:

  1. on a specified set of rows (firstrows)
  2. on a quick scan across the dataset 

    Tip

    Tip: Quick scan samples are executed in the

    D s photon
    running environment.

  3. on a full scan of the entire dataset 

    Tip

    Tip: Full scan samples are executed in the cluster running environment.

Sampling mechanics

When a non-initial sample is executed for a single dataset-recipe combination, the following steps occur:

...

  • When a new sample is generated, any Sort transformations that have been applied previously must be re-applied. Depending on the type of output, sort order may not be preserved.

...

  • Samples taken from a dataset with parameters are limited to a maximum of 50 files when executed on the
    D s photon
    running environment. You can modify parameters as they apply to sampling jobs. See Samples Panel
    .

Sample Invalidation

With each step that is added or modified to your recipe,

D s product
 checks to see if the current sample is valid. Samples are valid based on the state of your flow and recipe at the step when the sample was collected. If you add steps before the step where it was created, the currently active sample can be invalidated. For example, if you change the source of data, then the sample in the Transformer page no longer applies, and a new sample must be displayed.

...

Random selection of a subset of rows in the dataset. These samples are comparatively fast to generateYou can apply quick scan or full scan to determine the scope of the sample.

Filter-based samples

Find specific values in one or more columns. For the matching set of values, a random sample is generated.

...