To prevent overwhelming the client or significantly impacting performance,
How Sampling Works
When a dataset is first created, a background job begins to generate a sample using the first set of rows of the dataset. This initial sample is data sample is usually very quick to generate, so that you can get to work right away on your transformations.
- on a specified set of rows (firstrows)
on a quick scan across the dataset
Tip: Quick scan Tip
By default, Quick Scan samples are executed
D s photon
is not available or is disabled, the
D s photon
attempts to execute the Quick Scan sample on an available clustered running environment.
D s webapp
- If the clustered running environment is not available or doesn't support Quick Scan sampling, then the Quick Scan sample job fails.
on a full scan of the entire dataset
Scan samples are executed in the cluster running environment.
When a non-initial sample is executed for a single dataset-recipe combination, the following steps occur:
NOTE: When a flow is shared, its samples are shared with other users. However, if those users do not have access to the underlying files that back a sample, they do not have access to the sample and must create their own.
Changing sample sizes
If needed, you can change the size of samples that are loaded into the browser your current recipe. You may need to reduce these sizes if you are experiencing performance problems or memory issues in the browser. For more information, see Change Recipe Sample Size.
Important notes on sampling
- Sampling jobs may incur costs. These costs may vary between
and your clustered running environments, depending on type of sample and cost of job execution.
D s photon
- When sampling from compressed data, the data is uncompressed and then expanded. As a result, the sample size reflects the uncompressed data.
- Changes to preceding steps that alter the number of rows or columns in your dataset can invalidate the current sample, which means that the sample is no longer a valid representation of the state of the dataset in the recipe. In this case,
automatically switches you back to the most recently collected sample that is currently valid. Details are below.
D s product
For more information, see Sample Jobs Page.
Cancel Sample Jobs
Generating a sample can consume significant time, system resources, and in some deployments cost. As needed, you can cancel a sample job that is in progress in either of the following ways:
- Locate the in-progress sampling job in the Samples panel. Click X.
- Click the Jobs icon in the left nav bar. Select Sample jobs. For more information, see Sample Jobs Page.
After you have collected multiple samples of multiple types on your dataset, you can choose the proper sample to use for your current task, based on:
Tip: You can annotate your recipe with comments, such as:
|D s product|
First rows samples
This sample is taken from the first set of rows in the transformed dataset based on the current cursor location in the recipe. The first N rows in the dataset are collected based on the recipe steps up to the configured sample size.
- This sample may span multiple datasets and files, depending on how the recipe is constructed.
- The first rows sample is different from the initial sample, which is gathered without reference to any recipe steps.
These samples are fast to generate. These samples may load faster in the application than samples of other types.
Tip: If you have chained together multiple recipes, all steps in all linked recipes must be run to provide visual updates. If you are experiencing performance problems related to this kind of updating, you can select a recipe in the middle of the chain of recipes and switch it off the initial sample to a different sample. When invoked, the recipes from the preceding datasets do not need to be executed, which can improve performance.
Random selection of a subset of rows in the dataset. These samples are comparatively fast to generate. You can apply quick scan or full scan to determine the scope of the sample.
Find specific values in one or more columns. For the matching set of values, a random sample is generated.
You must define your filter in the Filter textbox.
Find mismatched or missing data or both in one or more columns.
You specify one or more columns and whether the anomaly is:
- either of the above
Optionally, you can define an additional filter on other columns.
Find all unique values within a column and create a sample that contains the unique values, up to the sample size limit. The distribution of the column values in the sample reflects the distribution of the column values in the dataset. Sampled values are sorted by frequency, relative to the specified column.
Optionally, you can apply a filter to this one.
Tip: Collecting samples containing all unique values can be useful if you are performing mapping transformations, such as values to columns. If your mapping contains too many unique values among your key-value pairs, you can try to delete all columns except the one containing key-value pairs in a step, collect the sample, add the mapping step, and then delete the step where all other columns are removed.
Cluster sampling collects contiguous rows in the dataset that correspond to a random selection from the unique values in a column. All rows corresponding to the selected unique values appear in the sample, up to the maximum sample size. This sampling is useful for time-series analysis and advanced aggregations.
Optionally, you can apply an advanced filter to the columnmore information on sample types, see Sample Types.