How Sampling Works
NOTE: Generated samples are created by executing jobs on the applicable running environment. Quick Scan samples are executed in Trifacta Photon. Full Scan samples are generated in the applicable running environment on the cluster. Each running environment has a proprietary method of calculating the available volume of data in memory which is used for executing the sampling job that is launched in the running environment. As a result, the number of rows returned for the same sample type across different running environments can vary significantly.
When a dataset is first created, a background job begins to generate a sample using the first set of rows of the dataset. This initial data sample is usually very quick to generate, so that you can get to work right away on your transformations.
- The default sample is the initial sample.
- If the recipe is a child recipe, then the Initial Data sample indicates the selected sample of the parent recipe.
- If your source of data is a directory containing multiple files, the initial sample for the combined dataset is generated from the first set of rows in the first filename listed in the directory.
The maximum number of files in a directory that can be read in the initial sample is limited by parameter for performance reasons.
- For more information, see Dataprep Project Settings Page.
If you are wrangling a dataset with parameters, the initial sample loaded in the Transformer page is taken from the first matching dataset.
If the matching file is a multi-sheet Excel file, the sample is taken from the first sheet in the file.
- By default, each initial sample is either:
- 10 MB in size
- Limited by the maximum number of files
- The entire dataset
- To change the sample size, see Change Recipe Sample Size.
Additional samples can be generated from the context panel on the right side of the Transformer page. Sample jobs are independent job executions. When a sample job succeeds or fails, a notification is displayed for you.
As you develop your recipe, you might need to take new samples of the data. For example, you might need to focus on the mismatched or invalid values that appear in a single column. Through the Transformer page, you can specify the type of sample that you wish to create and initiate the job to create the sample. This sampling job occurs in the background.
You can create a new sample at any time. When a sample is created, it is stored within your storage directory on the backend datastore.
NOTE: The Initial Data sample contains raw data from the source. Any generated sample is stored in JSONLines format with additional metadata on the sample. These different storage formats can result is differences between initial and generated sample sizes.
For more information on creating samples, see Samples Panel.
Depending on the type of sample you select, it may be generated based on one of the following methods, in increasing order of time to create:
- on a specified set of rows (firstrows)
on a quick scan across the dataset
By default, Quick Scan samples are executed on the Trifacta Photon running environment.
- If Trifacta Photon is not available or is disabled, the Dataprep by Trifacta application attempts to execute the Quick Scan sample on an available clustered running environment.
- If the clustered running environment is not available or doesn't support Quick Scan sampling, then the Quick Scan sample job fails.
on a full scan of the entire dataset
Full Scan samples are executed in the cluster running environment.
When a non-initial sample is executed for a single dataset-recipe combination, the following steps occur:
- All of the steps of the recipe are executed on the dataset on the backend cluster, up to the currently selected recipe step.
- The generated sample is executed on the current state of the dataset.
NOTE: When a sample is executed from the Samples panel, it is launched based on the steps leading up to current location in the recipe steps. For example, if your recipe includes joining in other datasets, those steps are executed, and the sample is generated with dependencies on these other datasets. As a result, if you change your recipe steps that occur before the step where the sample was generated, you can invalidate your sample. More information is available below.
When your flow contains multiple datasets and flows, all of the preceding steps leading up to the currently selected step of the recipe are executed, which can mean:
- The number of datasets that must be accessed increases.
- The number of recipe steps that must be executed on the backend increases.
- The time to process the sampling job increases.
- When the sample is displayed in the Transformer page, all steps after the one from which it was executed are computed in the web browser. So, if you have a lengthy series of steps or complex operations after the step where you generated a sample, you can cause performance issues of the Transformer page, including the occasional browser crash. Try generating a new sample later in your flow for better performance.
- If you have added an expensive transformation step, such as a complex union or join, you can improve performance of the Transformer page by generating and using a new sample after the transformation step.
NOTE: When a flow is shared, its samples are shared with other users. However, if those users do not have access to the underlying files that back a sample, they do not have access to the sample and must create their own.
When a sample is generated, it is stored in the default storage layer in the
jobrun directory, assigned to the user who initiated the sample. For more information, see Overview of Storage.
Changing sample sizes
If needed, you can change the size of samples that are loaded into the browser your current recipe. You may need to reduce these sizes if you are experiencing performance problems or memory issues in the browser. For more information, see Change Recipe Sample Size.
Important notes on sampling
- Depending on the running environment, sampling jobs may incur costs. These costs may vary between Trifacta Photon and your clustered running environments, depending on type of sample and cost of job execution.
- When sampling from compressed data, the data is uncompressed and then expanded. As a result, the sample size reflects the uncompressed data.
- Changes to preceding steps that alter the number of rows or columns in your dataset can invalidate the current sample, which means that the sample is no longer a valid representation of the state of the dataset in the recipe. In this case, Dataprep by Trifacta automatically switches you back to the most recently collected sample that is currently valid. Details are below.
Parameterization of samples
Any parameters that are associated with your dataset can be applied to sampling:
- Parameters: Subsequent samples generated from the Transformer page are sampled across all datasets matched by parameter values.
Variables: You can apply override values to the defaults for your dataset's variables at sample execution time. In this manner, you can draw your samples from specific sources files within your dataset with parameters.
After you have created a sample, you cannot delete it through the application.
NOTE: Dataprep by Trifacta does not delete samples after they have been created. If you are concerned about data accumulation, you should configure periodic purges of the appropriate directories on the base storage layer. For more information, please contact your IT administrator.
For more information, see Sample Jobs Page.
After you have collected multiple samples of multiple types on your dataset, you can choose the proper sample to use for your current task, based on:
- How well each sample represents the underlying dataset. Does the current sample reflect the likely statistics and outliers of the entire dataset at scale?
- How well each sample supports your next recipe step. If you're developing steps for managing bad data or outliers, for example, you may need to choose a different sample.
Tip: You can begin work on an outdated yet still valid sample while you generate a new one based on the current recipe.
- Some advanced sampling options are available only with execution across a scan of the full dataset.
- Undo/redo do not change the sample state, even if the sample becomes invalid.
- When a new sample is generated, any Sort transformations that have been applied previously must be re-applied. Depending on the type of output, sort order may not be preserved. See Sort Rows.
- Samples taken from a dataset with parameters are limited to a maximum of 50 files when executed on the Trifacta Photon running environment. You can modify parameters as they apply to sampling jobs. See Samples Panel.
With each step that is added or modified to your recipe, Dataprep by Trifacta checks to see if the current sample is valid. Samples are valid based on the state of your flow and recipe at the step when the sample was collected. If you add steps before the step where it was created, the currently active sample can be invalidated. For example, if you change the source of data, then the sample in the Transformer page no longer applies, and a new sample must be displayed.
Tip: After you have completed a step that significantly changes the number of rows, columns, or both in your dataset, you may need to generate a new sample, factoring in any costs associated with running the job. Performance costs may be displayed in the Transformer page.
NOTE: If you modify a SQL statement for an imported dataset, any samples based on the old SQL statement are invalidated.
- The Transformer page reverts to displaying the most recently collected sample that is currently valid.
You can generate a new sample of the same type through the Samples panel. If no sample is valid, you must generate a new sample before you can open the dataset.
A sample that is invalidated is listed under the Unavailable tab. It cannot be selected for use. If subsequent steps make it valid again, it re-appears in the Available tab.
The data that is displayed in the data grid is based on all of the upstream samples after which all subsequent steps in each upstream recipe are performed in the browser. If you have a large number of steps or complex steps between the recipe locations for your samples in use and your current recipe location, you may experience performance slow-downs or crashes in the data grid. For more information on sampling best practices, see https://community.trifacta.com/s/article/Best-Practices-Managing-Samples-in-Complex-Flows.
All steps between the step in your current sample and the currently displayed step must be computed in the browser. As you build more complex recipes, it's a good idea to create samples at various steps in your recipe, particularly after you have executed a complex step. This type of sample checkpointing can improve overall performance.
For example, as soon as you load a new recipe, you should take a sample, which can speed up the process of loading.
Tip: You can annotate your recipe with comments, such as:
sample: random and then create a new sample at that location.
For more information on sample types, see Sample Types.
This page has no comments.