Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Excerpt

To prevent overwhelming the client or significantly impacting performance, 

D s product
rtrue
 generates one or more samples of the data for display and manipulation in the client application. Since 
D s product
 supports a variety of clients and use cases, you can change the size of samples, the scope of the sample, and the method by which the sample is created. This section provides background information on how the product manages dataset sampling.

How Sampling Works

Info

NOTE: Generated samples are created by executing jobs on the applicable running environment. Quick Scan samples are executed in

D s photon
. Full Scan samples are generated in the applicable running environment on the cluster. Each running environment has a proprietary method of calculating the available volume of data in memory which is used for executing the sampling job that is launched in the running environment. As a result, the number of rows returned for the same sample type across different running environments can vary significantly.

Initial Data

When a dataset is first created, a background job begins to generate a sample using the first set of rows of the dataset.

...

This initial data

...

 sample is usually very quick to generate, so that you can get to work right away on your transformations.

  • The default sample is the initial sample.
    • By default, each sample is 10 MB in size or the entire dataset if it's smaller.  If the recipe is a child recipe, then the Initial Data sample indicates the selected sample of the parent recipe.
  • If
  • your source of data is a directory containing multiple files, the initial sample for the combined dataset is generated from the first set of rows in the first filename listed in the directory.
      • The maximum number of files in a directory that can be read in the initial sample is limited by parameter for performance reasons. 

      • If the matching file is a multi-sheet Excel file, the sample is taken from the first sheet in the file.If you are wrangling a dataset with parameters, the initial sample loaded in the Transformer page is taken from the first matching dataset

    • .
    • If the matching file is a multi-sheet Excel file, the sample is taken from the first sheet in the file
      • .

      By default, each initial sample is either: 
      • 10 MB in size
      • Limited by the maximum number of files
      • The entire dataset

    Generating samples

    Additional samples can be generated from the context panel on the right side of the Transformer page. Sample jobs are independent job executions. When a sample job succeeds or fails, a notification is displayed for you.

    As you develop your recipe, you might need to take new samples of the data. For example, you might need to focus on the mismatched or invalid values that appear in a single column. Through the Transformer page, you can specify the type of sample that you wish to create and initiate the job to create the sample. This sampling job occurs in the background.You can create a new sample at any time. When a sample is created, it is stored within your storage directory on the backend datastore.

    ...

    .

    ...

    For more information on creating samples, see Samples Panel.

    ...

    1. on a specified set of rows (firstrows)
    2. on a quick scan across the dataset 

      By default, Quick Scan
      Tip

      Tip: Quick scan samples are executed

      on the 

      in the

      D s photon

       running environment. If 
      D s photon
       is not available or is disabled, the 
      D s webapp
       attempts to execute the Quick Scan sample on an available clustered

      running environment.

       

    3. If the clustered running environment is not available or doesn't support Quick Scan sampling, then the Quick Scan sample job fails.

    4. on a full scan of the entire dataset 

      Tip

      Tip: Full

      Scan

      scan samples are executed in the cluster running environment.



    Sampling mechanics

    When a non-initial sample is executed for a single dataset-recipe combination, the following steps occur:

    ...

    Info

    NOTE: When a flow is shared, its samples are shared with other users. However, if those users do not have access to the underlying files that back a sample, they do not have access to the sample and must create their own.

    Sample storage

    Changing sample sizes

    If needed, you can change the size of samples that are loaded into the browser your current recipe. You may need to reduce these sizes if you are experiencing performance problems or memory issues in the browserWhen a sample is generated, it is stored in the default storage layer in the jobrun directory that is created for the user who initiated the sample. For more information, see Change Recipe Sample Sizesee Overview of Storage.

    Important notes on sampling


    • A new sampling

    ...

    • Depending on the running environment, sampling jobs job is executed in
      D s dataflow
      , which may incur costs.
      These costs may vary between
    • If the source file is in Avro format, the
      d-s-
      photon and your clustered running environments, depending on type of sample and cost of job execution
      dataflow
      job samples from the entire file. As a result, additional processing costs may be incurred. This is a known issue.


    • When sampling from compressed data, the data is uncompressed and then expanded. As a result, the sample size reflects the uncompressed data.
    • Changes to preceding steps that alter the number of rows or columns in your dataset can invalidate the current sample, which means that the sample is no longer a valid representation of the state of the dataset in the recipe. In this case, 
      D s product
       automatically switches you back to the most recently collected sample that is currently valid. Details are below.

    ...

    Info

    NOTE:

    D s product
    does not delete samples after they have been created. If you are concerned about data accumulation, you should configure periodic purges of the appropriate directories on the base storage layer. For more information, please contact your IT administrator.

    ...

    .


    Choosing Samples

    After you have collected multiple samples of multiple types on your dataset, you can choose the proper sample to use for your current task, based on:

    ...

    • Some advanced sampling options are available only with execution across a scan of the full dataset.
    • Undo/redo do not change the sample state, even if the sample becomes invalid. 
    • Samples taken from a dataset with parameters are limited to a maximum of 50 files when executed on the
      D s photon
      running environment. You can modify parameters as they apply to sampling jobs. See Samples PanelWhen a new sample is generated, any Sort transformations that have been applied previously must be re-applied. Depending on the type of output, sort order may not be preserved. See Sort Rows.

    Sample Invalidation

    With each step that is added or modified to your recipe,

    D s product
     checks to see if the current sample is valid. Samples are valid based on the state of your flow and recipe at the step when the sample was collected. If you add steps before the step where it was created, the currently active sample can be invalidated. For example, if you change the source of data, then the sample in the Transformer page no longer applies, and a new sample must be displayed.

    ...

    Tip

    Tip: You can annotate your recipe with comments, such as: sample: random and then create a new sample at that location.

    Sample Types

    D s product
     currently supports the following sampling methods.

    Random samples

    Random selection of a subset of rows in the dataset. These samples are comparatively fast to generate.You can apply quick scan or full scan to determine the scope of the sample.

    Filter-based samples

    Find specific values in one or more columns. For more information on sample types, see Sample Types. D s alsoinCQLtruelabel((label = "sample") OR (label = "sampling"))the matching set of values, a random sample is generated.

    You must define your filter in the Filter textbox.

    Anomaly-based samples

    Find mismatched or missing data or both in one or more columns.

    You specify one or more columns and whether the anomaly is:

    1. mismatched
    2. missing
    3. either of the above

    Optionally, you can define an additional filter on other columns.

    Stratified samples

    Find all unique values within a column and create a sample that contains the unique values, up to the sample size limit. The distribution of the column values in the sample reflects the distribution of the column values in the dataset. Sampled values are sorted by frequency, relative to the specified column.

    Optionally, you can apply a filter to this one.

    Tip

    Tip: Collecting samples containing all unique values can be useful if you are performing mapping transformations, such as values to columns. If your mapping contains too many unique values among your key-value pairs, you can try to delete all columns except the one containing key-value pairs in a step, collect the sample, add the mapping step, and then delete the step where all other columns are removed.

    Cluster-based samples

    Cluster sampling collects contiguous rows in the dataset that correspond to a random selection from the unique values in a column. All rows corresponding to the selected unique values appear in the sample, up to the maximum sample size. This sampling is useful for time-series analysis and advanced aggregations.

    Optionally, you can apply an advanced filter to the column.