To prevent overwhelming the client or significantly impacting performance, generates one or more samples of the data for display and manipulation in the client application. Since supports a variety of clients and use cases, you can change the size of samples, the scope of the sample, and the method by which the sample is created. This section provides background information on how the product manages dataset sampling.
When a dataset is first created, a background job begins to generate a sample using the first set of rows of the dataset. This initial data sample is usually very quick to generate, so that you can get to work right away on your transformations.
If the matching file is a multi-sheet Excel file, the sample is taken from the first sheet in the file.If you are wrangling a dataset with parameters, the initial sample loaded in the Transformer page is taken from the first matching dataset.
Additional samples can be generated from the context panel on the right side of the Transformer page. Sample jobs are independent job executions. When a sample job succeeds or fails, a notification is displayed for you.
As you develop your recipe, you might need to take new samples of the data. For example, you might need to focus on the mismatched or invalid values that appear in a single column. Through the Transformer page, you can specify the type of sample that you wish to create and initiate the job to create the sample. This sampling job occurs in the background.You can create a new sample at any time. When a sample is created, it is stored within your storage directory on the backend datastore.For more information on creating samples, see Samples Panel.
Depending on the type of sample you select, it may be generated based on one of the following methods, in increasing order of time to create:
on a quick scan across the dataset
Tip: Quick scan samples are executed in the running environment.
on a full scan of the entire dataset
Tip: Full scan samples are executed in the cluster running environment.
When a non-initial sample is executed for a single dataset-recipe combination, the following steps occur:
NOTE: When a sample is executed from the Samples panel, it is launched based on the steps leading up to current location in the recipe steps. For example, if your recipe includes joining in other datasets, those steps are executed, and the sample is generated with dependencies on these other datasets. As a result, if you change your recipe steps that occur before the step where the sample was generated, you can invalidate your sample. More information is available below.
When your flow contains multiple datasets and flows, all of the preceding steps leading up to the currently selected step of the recipe are executed, which can mean:
NOTE: When a flow is shared, its samples are shared with other users. However, if those users do not have access to the underlying files that back a sample, they do not have access to the sample and must create their own.
When a sample is generated, it is stored in the default storage layer in the
jobrun directory that is created for the user who initiated the sample. For more information, see Overview of Storage.
Any parameters that are associated with your dataset can be applied to sampling:
Variables: You can apply override values to the defaults for your dataset's variables at sample execution time. In this manner, you can draw your samples from specific sources files within your dataset with parameters.
After you have created a sample, you cannot delete it through the application.
NOTE: does not delete samples after they have been created. If you are concerned about data accumulation, you should configure periodic purges of the appropriate directories on the base storage layer. For more information, please contact your IT administrator.
After you have collected multiple samples of multiple types on your dataset, you can choose the proper sample to use for your current task, based on:
Tip: You can begin work on an outdated yet still valid sample while you generate a new one based on the current recipe.
With each step that is added or modified to your recipe, checks to see if the current sample is valid. Samples are valid based on the state of your flow and recipe at the step when the sample was collected. If you add steps before the step where it was created, the currently active sample can be invalidated. For example, if you change the source of data, then the sample in the Transformer page no longer applies, and a new sample must be displayed.
Tip: After you have completed a step that significantly changes the number of rows, columns, or both in your dataset, you may need to generate a new sample, factoring in any costs associated with running the job. Performance costs may be displayed in the Transformer page.
NOTE: If you modify a SQL statement for an imported dataset, any samples based on the old SQL statement are invalidated.
You can generate a new sample of the same type through the Samples panel. If no sample is valid, you must generate a new sample before you can open the dataset.
A sample that is invalidated is listed under the Unavailable tab. It cannot be selected for use. If subsequent steps make it valid again, it re-appears in the Available tab.
All steps between the step in your current sample and the currently displayed step must be computed in the browser. As you build more complex recipes, it's a good idea to create samples at various steps in your recipe, particularly after you have executed a complex step. This type of sample checkpointing can improve overall performance.
For example, as soon as you load a new recipe, you should take a sample, which can speed up the process of loading.
Tip: You can annotate your recipe with comments, such as:
currently supports the following sampling methods.
Random selection of a subset of rows in the dataset. These samples are comparatively fast to generate.You can apply quick scan or full scan to determine the scope of the sample.
Find specific values in one or more columns. For the matching set of values, a random sample is generated.
You must define your filter in the Filter textbox.
Find mismatched or missing data or both in one or more columns.
You specify one or more columns and whether the anomaly is:
Optionally, you can define an additional filter on other columns.
Find all unique values within a column and create a sample that contains the unique values, up to the sample size limit. The distribution of the column values in the sample reflects the distribution of the column values in the dataset. Sampled values are sorted by frequency, relative to the specified column.
Optionally, you can apply a filter to this one.
Tip: Collecting samples containing all unique values can be useful if you are performing mapping transformations, such as values to columns. If your mapping contains too many unique values among your key-value pairs, you can try to delete all columns except the one containing key-value pairs in a step, collect the sample, add the mapping step, and then delete the step where all other columns are removed.
Cluster sampling collects contiguous rows in the dataset that correspond to a random selection from the unique values in a column. All rows corresponding to the selected unique values appear in the sample, up to the maximum sample size. This sampling is useful for time-series analysis and advanced aggregations.
Optionally, you can apply an advanced filter to the column.