This section provides an overview of how jobs of various types are initiated, managed, and executed in . You can also review summaries of the available running environments for your product edition.


NOTE: During job execution of any kind, never modifies source data. All transformation is performed on requested elements of the data. If the data needs to be retained for any period of time during use or transformation, it is stored in the browser or in the base storage layer. After the data has been used for the intended purpose, it is removed from temporary storage.

When you build your recipe in the , you can see in real-time the effects of the transformations that you are creating. When you wish to produce result sets of these transformations, you must run a job, which performs a separate set of execution steps on the data. Job execution is a separate process for the following reasons:

Other features of job execution:

Jobs Types

The following types of jobs can be executed as part of normal operations of the product. 

Job locations:

Transformation job types

Informally, a "job" is considered any action that is performed on a set of data. Commonly, jobs refer to the process of transforming source data into output results. However:

Job groups:

For transformation job types, the following terms apply:

The following diagram illustrates how these job types are related.

+ myJob jobGroup
  + Connect job
  + Request job
  + Ingest job
  + Transform job
  + Transfer job
  + Process job


Tip: You can have one or more of each of these job types as part of a single jobGroup.

Connect

A Connect job performs the steps necessary to connection the  to the datastore that contains source data. These jobs use the connection objects that are native to the platform or that you have created to make the connection to your imported datasets.

NOTE: Depending on the running environment, a Connect job may time out after a period of inactivity or failure to connect, and it may be retried one or more times before the job is marked as failed.

Request

A Request job sends a query or other request to the source datastore for the assets specified in the imported datasets.

Ingest

Requested data is brought from the external source to the execution layer, which is the temporary storage location as defined for the running environment. 

Convert

Some formats supported for import are not natively understood by the product. These formats must be converted to a format that the platform can quickly process. This process typically converts binary formats, such as XLS or PDF, into CSV files that are stored temporarily in the base storage layer for purposes of job execution. After the job has succeeded or failed, these converted files are removed.

Transform

After data has been requested and ingested (if needed), a Transform job converts the steps of a recipe into an intermediate scripted format (called CDF). The CDF script is then passed to the appropriate running environment for transformation of the source data. Additional details are provided later.

Prepare

If the specified job is publishing results to a connection other than the base storage layer, the results are initially prepared on the base storage layer, after which they are written to the target datastore. 

This job type does not apply when the base storage layer is the final destination for the results.

Transfer

A Transfer job writes the results to the appropriate output location, as specified by the output objects referenced when the job was launched.

Process

When the transfer is complete, a Process job performs final cleanup, including removal of temp files such as intermediate results written to the base storage layer.

Other job types

Profiling

When you execute a transformation job, you can optionally choose to create a visual profile of the results of that job. Visual profiling is a separate job that sometimes takes longer to execute than the job itself, but a visual profile can be useful in highlighting characteristics of your data, including metrics and errors on individual columns. 

Visual profiles are available for review in the Job Details page. You can also download PDF or JSON versions of your visual profile. 

For more information on visual profiling, see Overview of Visual Profiling.

Sampling

When you are interacting with your source data to transform it through the browser, you are working on a sample of the data. As needed, you can take new samples of the data to provide different perspectives on it. Also, for longer and more complex flows, you should get in the habit of taking periodic samples, which can improve performance in the browser.

Through the Samples panel, you can launch a job to collect a new sample of your data. There are multiple types of sampling, which can be executed using one of the following methods:

For more information, see Overview of Sampling.

Basic Process for Transformation Jobs

A transformation job is run based on the outputs that you are trying to generate. For a selected output, the executed job runs the transformations for all of the recipes between the output and all of its imported datasets. For example, generation of a single output could require the transformation of five different recipes that use 13 different imported datasets.

Job preparation

When you initiate a job through the , the following steps occur:

  1. A jobGroup is created in the database. It consists of the specification of one or more jobs, as described above.
  2. The recipe whose output is being executed is requested from the . This recipe is expanded from storage format and later is stored temporarily in the database for reference.
  3. The  verifies access to data sources and output locations.
  4. A job execution graph (flow chart) is created for the various jobs required to complete execution of the jobGroup.
    1. This graph includes jobs for ingest, transformation, conversion, and other steps, as described above.
  5. The graph is sent to the batch job runner service in the platform. This service manages the submission, tracking, and completion of all jobs to supported running environments.
  6. Batch job runner requests to the  to return a Common Dataflow Format (CDF) version of the expanded recipe. 
    1. CDF is a domain-specific language for data transformation that runs anywhere that supports Python execution.
    2.  is compiled into CDF format at execution time. This CDF script is delivered to the running environment for execution. 
    3. CDF scripts are internal to the platform and are not accessible to users of the platform.
  7. Depending on the running environment, additional modifications to the CDF script may be made before the job is submitted.
  8. The batch job runner places the job in a queue for submission to the running environment.

Job execution

When the job is ready to be pulled from the queue, the following tasks are completed: 

  1. The job definition, CDF script, and associated resources are submitted to the resource coordinating process of the running environment. 
    1. This coordinator is the batch job runner for local jobs or a dedicated service on remote running environments. 
    2. For example, for EMR execution, which is a remote running environment, the job is submitted to the YARN service, which manages the delegation of work tasks to the various nodes in the cluster.
    3. In the resource coordinator, jobs from the product are labeled as Trifacta Transformer or Trifacta Profiler (for profiling jobs).
  2. Periodically, batch job runner polls the running environment for status on job execution. 
    1. This status information is stored and updated in the Jobs database.
  3. The  queries the Jobs database for updated information. 
    1. These updates are stored in the  for internal services to access to present updates.
    2. Updates can appear in Flow View page and also in the Jobs and Job Details page, so that you can track progress.
  4. During execution, the resource manager arranges for the delivery of data and CDF script objects to nodes of the cluster.
    1. On these individual nodes, portions of the data are processed through the CDF script. 
    2. The results of this processing is messaged back to the resource manager.
    3. When all of the nodes have reported back that the job processing has been completed, results are written to the location or locations as defined in the output object that was selected during job execution. 
  5. Batch job runner updates any available job logs as needed based on the results of the job execution. These logs may be available through the .

Job monitoring

Transformation jobs: After a transformation job has been launched, you can monitor the state of the job as it passes through separate stages in the process. 

Sample jobs: In-progress sampling jobs can be tracked through the following locations:

Plan runs: When you have launched jobs as part of a plan run, you can track progress through the .

See Plan Runs Page.

Job cleanup

After the results have been written, the following tasks are completed:

  1. Applicable job logs are updated and written to the appropriate location. 
  2. The expanded recipe stored in the database is removed.
  3. Any temporary files written to the base storage layer are removed.

Scheduled jobs

You can also schedule the execution of jobs within your flows. This process works as follows:

  1. In Flow View, you define the outputs that you wish to deliver when the flow is executed according to a schedule. These outputs are different objects that the outputs you create from your recipes, but you can define them to write to the same locations.
  2. You specify the schedule for when the job is to be executed. Date and time information, as well as frequency of execution, can be defined within the flow.

When the specified time is reached, the job is queued for execution, as described above. For more information, see Overview of Automator.

Job Execution Performance

Job execution is a resource-intensive and multi-layered process that transforms data of potentially limitless size. The following factors can affect performance in the  and during job execution:

Running Environments