This page shows you how to get started using the for
.
Before you begin, you may need to set up your project and your account. For more information, see Getting Started with Dataprep by Trifacta .
If you are not logged in to the , please do the following.
In this quickstart, you complete the following steps. These steps mirror the basic workflow of any wrangling project:
In the , you build your projects in a container object called a flow. To your flow, you add datasets and create other objects such as recipes and outputs to construct the end-to-end pipeline of data into and out of the product.
Steps:
Fill in a flow name and description. Since this quick start uses 2016 data from the United States Federal Elections Commission 2016, you may wish to name it, "FEC-2016", with a description that refers to this data.
Create Flow dialog |
The flow is created and displayed in Flow View:
Flow View page |
Through Flow View, you can import datasets to begin your work. An imported dataset is a reference to a file, table, or view that you wish to transform.
NOTE: |
Steps:
Under Choose a file or folder, click the Pencil icon. Then, insert
gs://dataprep-samples/us-fec
in the GCS text box. Click Go.
Import data from GCS |
cn-2016.txt
file in the directory. In the right panel, enter the name Candidate Master 2016
.Now, select the itcont-2016.txt
dataset. In the right panel, enter the name Campaign Contributions 2016
.
Two FEC datasets selected for import |
To wrangle the contents of an imported dataset, you can create a recipe. A recipe is a series of transformation steps that are executed in sequence on the imported datasets that are connected to the recipe. In this section, you begin the process of building transformation steps in a recipe.
Steps:
On the FEC 2016 Flow page, select the Plus icon next to the Candidate Master 2016 dataset. Then, select Add new Recipe.
Add a new recipe for the Candidate Master 2016 imported dataset |
A new recipe icon appears. In the right panel, click Edit Recipe.
Edit a recipe |
The Transformer page opens, where you can explore a sample of your data and build your recipe by applying transformation steps to it.
NOTE: The current sample is displayed in the upper left corner. In the example below, the sample is listed as Initial Data. The full dataset is small enough to be displayed in the browser. For larger datasets in which the full dataset is too large for display, this value is Initial Sample, which consists of the first set of lines from the imported dataset. The number of lines depends on the width of the dataset. |
A recipe is created in the Recipe panel of the Transformer page. If the Recipe pane is not displayed on the right side of the page, click the Recipe icon at the top-right of the Grid view page.
Recipe icon |
In the Candidate Master 2016 dataset, column5
(Date/Time) contains values for years. Select the years 2016 and 2017 in the histogram by dragging across them.
Select values from a column histogram to prompt suggestions |
In the right panel, you should see a suggestion card titled "Keep rows where value is in selected range." Click Add.
Suggestion card |
The following recipe step is added to the recipe:
In the column6 (State)
header, hover over and click the mismatched (red) bar to select the mismatched rows.
Click red bar to selected values in the column that do not match the column's data type. |
In the Status bar at the bottom of the page, select the Show only affected checkbox. Notice that some the red highlighted (mismatched) items have the value "US" in column6 and "P" in column7. These are presidential candidates. The mismatch occurs because column6 is marked as a "State" column (indicated by the flag icon), but it also includes non-state (such as "US") values.
Click Show only affected checkbox in the Status bar. |
To correct the mismatch, click the X in the right panel to cancel the transformation. The column must be re-typed as a column of String data type. Click the flag icon above column6 and select "String".
Change column data type |
Now, let's filter on just the presidential candidates. In the histogram for column7, click the "P" bin.
Select only "P" values. |
In the right panel, you should see a suggestion card titled "Keep rows where column7 is 'P’:. Click Add .
Choose to keep rows that contain "P" as the value in column7. |
Now, you can transform the other imported dataset and join it into the first one.
Steps:
In the right panel, click Edit Recipe to open a sample of the contributions dataset in the Transformer page.
Create a new recipe for the Campaign Contributions 2016 dataset |
In the Transformer page, open the Recipe panel. You can add a new step to the recipe that removes extra delimiters from the contributions dataset. Copy and paste the following Wrangle language command in the Search box:
replacepatterns col: * with: '' on: `{start}"|"{end}` global: true |
Tip: You can copy and paste the raw |
The Transformation Builder parses the Wrangle command and populates the fields for the Replace text or patterns transformation. Click Add to add the transformation to the recipe.
Copy and paste transformation steps |
Joins: Let's join this dataset to the first one. Add another new step to the recipe, then type "Join" in the Search box.
Search for "Join" transformation. |
In the Join window, select the Recipes in current flow tab. Select Candidate Master 2016-2 (the Candidate Master dataset with the Recipe steps added above).
Join window |
In the Join window, you specify the keys and conditions of the join transformation. infers some of these details for you. In this case, the join type (inner) is correct, but the join keys are not. In the panel on the right side, hover over them and click the Pencil icon.
Specify join keys |
For the join keys, please specify the following keys: Current = column 2
and Joined-in = column 11
. Click Save and Continue.
Click Next. In the Join - Output Columns window, select the checkbox immediately under the "All (36)" label which adds all columns of both datasets to the joined dataset:
Select columns to include |
Create a summary: Add the following step to the recipe to generate a useful summary by aggregating, averaging and counting the contributions in column 16 and grouping on the candidates by IDs, names, and party affiliation in columns 2, 9, 8 respectively. Click New Step in the Recipe panel. Then, copy the following step and paste it into the Search box:
pivot value:sum(column16),average(column16),countif(column16 > 0) group: column2,column9,column8 |
The above statement is rendered as the following transformation:
A sample of the joined and aggregated data is displayed, representing a summary table of US presidential candidates and their 2016 campaign contribution metrics.
Transformer page after dataset has been joined in |
You can make the data easier to interpret by pasting in the following renaming and rounding steps to the recipe.
rename type: manual mapping: [column9,'Candidate_Name'], |
set col: Average_Contribution_Sum value: round(Average_Contribution_Sum) |
Joined datasets |
As needed, you can generate a different kind of sample of your data, which helps to locate outliers and to verify that your transformation steps apply to all rows in the dataset.
NOTE: The example FEC dataset displays the Initial Data sample, so gathering a new sample is less important in this case. For larger datasets, you may need to gather new examples at various stages during recipe development. |
Steps:
To view more data, mouse over the "Initial Sample" link at the top-left of the page.
Sample drop-down |
Click Collect a new sample.
In the Samples panel, select a random, quick sample, then click Collect.
Collect new random sample |
New sample loaded in the Transformer page |
You can now run a job to apply your changes across the entire joined dataset. In the Transformer page, click Run Job.
In the Run Job page:
JSON
from the drop-down.Depending on how your project is configured, you may be required to select a running environment where your job is executed. The following running environments may be available for selection:
Tip: The |
Tracking progress: In Flow View, you can see the progress of the job in the right panel. To explore details, click the Job Id link.
Profile: When the job completes successfully, click the Profile tab to see the visual profile of your job results.
Export Results: Click the Output Destinations tab. The output files are listed. From a file's context menu, you can click View on Google Cloud Storage. You can download from there.