Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r089

...

This page shows you how to get started using the 

D s webapp
rtrue
 for 
D s product
.

Before you begin

Before you begin, you may need to set up your project and your 

D s product
 account. For more information, see  Getting Started with Dataprep by Trifacta .

Login

If you are not logged in to the 

D s webapp
, please do the following.

  1. Navigate to the Google Cloud Console. See https://console.cloud.google.com/dataprep.
  2. In the Google Cloud Console, select the project where 
    D s product
     has been enabled. 
  3. In the left nav bar, select Dataprep.

Basic Flow

In this quickstart, you complete the following steps. These steps mirror the basic workflow of any wrangling project:

  1. Create a flow. A flow is a container for holding assets that you create or import in 
    D s product
    .
  2. Import datasets. This quickstart uses Federal Election Commission data from 2016. You import these publicly available datasets into your flow.
  3. Wrangle your data. In the 
    D s webapp
    , you create transformations and see them applied immediately to a sample of your data on-screen.
  4. Gather a new sample. During the wrangling process, you may choose to collect a new sample from your data, which can provide different perspectives of your wrangling.
  5. Run a job. After you have completed your wrangling steps in the 
    D s webapp
    , you can run a job to perform these transformations across all of the datasets in your flow.
  6. Download results. When the job completes, the generated results can be downloaded for use. 

Create a flow

In the 

D s webapp
, you build your projects in a container object called a flow. To your flow, you add datasets and create other objects such as recipes and outputs to construct the end-to-end pipeline of data into and out of the product.

...

  1. From the Home page, click the Flows icon    in the left nav bar. 
  2. In the Flows page, click Create. Select Create Flow
  3. Fill in a flow name and description. Since this quick start uses 2016 data from the United States Federal Elections Commission 2016, you may wish to name it, "FEC-2016", with a description that refers to this data.

    D caption
    Create Flow dialog
  4. Click Create.
  5. The flow is created and displayed in Flow View: 

    D caption
    Flow View page
  6. In Flow View, you can begin importing and building the objects of your flow.

Import datasets

Through Flow View, you can import datasets to begin your work. An imported dataset is a reference to a file, table, or view that you wish to transform.

...

  1. From Flow View, click Add Datasets to open the Add Datasets to Flow page. 
  2. Click Import Datasets
  3. In the Import Data page, select GCS in the left panel. 
  4. Under Choose a file or folder, click the Pencil icon. Then, insert  gs://dataprep-samples/us-fec in the GCS text box. Click Go.

    D caption
    Import data from GCS
  5. Select the cn-2016.txt file in the directory. In the right panel, enter the name Candidate Master 2016.
  6. Now, select the itcont-2016.txt dataset. In the right panel, enter the name Campaign Contributions 2016

    D caption
    Two FEC datasets selected for import
  7. After both datasets are listed and renamed in the right panel, click Continue to add the datasets to the flow.


Wrangle the Candidate file

To wrangle the contents of an imported dataset, you can create a recipe. A recipe is a series of transformation steps that are executed in sequence on the imported datasets that are connected to the recipe. In this section, you begin the process of building transformation steps in a recipe.

...

  1. On the FEC 2016 Flow page, select the Plus icon next to the Candidate Master 2016 dataset. Then, select Add new Recipe.

    D caption
    Add a new recipe for the Candidate Master 2016 imported dataset
  2. A new recipe icon appears. In the right panel, click Edit Recipe.

    D caption
    Edit a recipe
  3. The Transformer page opens, where you can explore a sample of your data and build your recipe by applying transformation steps to it.

    Info

    NOTE: The current sample is displayed in the upper left corner. In the example below, the sample is listed as Initial Data. The full dataset is small enough to be displayed in the browser. For larger datasets in which the full dataset is too large for display, this value is Initial Sample, which consists of the first set of lines from the imported dataset. The number of lines depends on the width of the dataset.

  4. A recipe is created in the Recipe panel of the Transformer page. If the Recipe pane is not displayed on the right side of the page, click the Recipe icon at the top-right of the Grid view page.

    D caption
    Recipe icon
  5. In the Candidate Master 2016 dataset, column5 (Date/Time) contains values for years. Select the years 2016 and 2017 in the histogram by dragging across them.

    D caption
    Select values from a column histogram to prompt suggestions
  6. In the right panel, you should see a suggestion card titled "Keep rows where value is in selected range." Click Add.

    D caption
    Suggestion card
  7. The following recipe step is added to the recipe:

    D trans
    RawWrangletrue
    p03Value(date(2016, 1, 1) <= column5) && (column5 < date(2017, 1, 1))
    Typestep
    WrangleTextfilter type: custom rowType: single row: (date(2016, 1, 1) <= column5) && (column5 < date(2017, 1, 1)) action: Keep
    p01NameCondition
    p01ValueCustom formula
    p02NameType of formula
    p02ValueCustom single
    p03NameCondition
    p04ValueKeep matching rows
    p04NameAction
    SearchTermFilter rows

  8. In the column6 (State) header, hover over and click the mismatched (red) bar to select the mismatched rows.

    D caption
    Click red bar to selected values in the column that do not match the column's data type. 
  9. In the Status bar at the bottom of the page, select the Show only affected checkbox. Notice that some the red highlighted (mismatched) items have the value "US" in column6 and "P" in column7. These are presidential candidates. The mismatch occurs because column6 is marked as a "State" column (indicated by the flag icon), but it also includes non-state (such as "US") values.

    D caption
    Click Show only affected checkbox in the Status bar.
  10. To correct the mismatch, click the X in the right panel to cancel the transformation. The column must be re-typed as a column of String data type. Click the flag icon above column6 and select "String".

    D caption
    Change column data type
  11. The column's data type is changed to String. String data type matches with any non-empty value in a column, which means that the mismatch is removed. The data quality bar is now completely green. 
  12. Now, let's filter on just the presidential candidates. In the histogram for column7, click the "P" bin.

    D caption
    Select only "P" values.
  13. In the right panel, you should see a suggestion card titled "Keep rows where column7 is 'P’:. Click  Add .

    D caption
    Choose to keep rows that contain "P" as the value in column7.

Wrangle the Contributions file and join it in

Now, you can transform the other imported dataset and join it into the first one. 

...

  1. In the Transformer page, click FEC 2016 in the upper left corner to return to the Flow View page. 
  2. In Flow View, select Campaign Contributions 2016. Then, click the Plus icon next to it and select Add new Recipe
  3. In the right panel, click Edit Recipe to open a sample of the contributions dataset in the Transformer page.

    D caption
    Create a new recipe for the Campaign Contributions 2016 dataset
  4. In the Transformer page, open the Recipe panel. You can add a new step to the recipe that removes extra delimiters from the contributions dataset. Copy and paste the following Wrangle language command in the Search box:

    D code

    replacepatterns col: * with: '' on: `{start}"|"{end}` global: true

    Tip

    Tip: You can copy and paste the raw

    D s lang
    step into the Search box to add a recipe step.

  5. The Transformation Builder parses the Wrangle command and populates the fields for the Replace text or patterns transformation. Click Add to add the transformation to the recipe.

    D caption
    Copy and paste transformation steps
  6. Joins: Let's join this dataset to the first one. Add another new step to the recipe, then type "Join" in the Search box.

    D caption
    Search for "Join" transformation.
  7. Click the Join datasets link to open the Joins page.
  8. In the Join window, select the Recipes in current flow tab. Select Candidate Master 2016-2 (the Candidate Master dataset with the Recipe steps added above).

    D caption
    Join window
  9. Click Accept.
  10. In the Join window, you specify the keys and conditions of the join transformation. 

    D s product
     infers some of these details for you. In this case, the join type (inner) is correct, but the join keys are not. In the panel on the right side, hover over them and click the Pencil icon. 

    D caption
    Specify join keys
  11. For the join keys, please specify the following keys: Current = column 2 and Joined-in = column 11. Click Save and Continue

  12. Click Next. In the Join - Output Columns window, select the checkbox immediately under the "All (36)" label which adds all columns of both datasets to the joined dataset:

    D caption
    Select columns to include
  13. Click Review. If all looks good, click Add to Recipe. In the Transformer page, the join transformation has been applied.
  14. Create a summary: Add the following step to the recipe to generate a useful summary by aggregating, averaging and counting the contributions in column 16 and grouping on the candidates by IDs, names, and party affiliation in columns 2, 9, 8 respectively. Click New Step in the Recipe panel. Then, copy the following step and paste it into the Search box:

    D code

    pivot value:sum(column16),average(column16),countif(column16 > 0) group: column2,column9,column8

  15. The above statement is rendered as the following transformation:

    D trans
    RawWrangletrue
    p03Valuesum(column16),average(column16),countif(column16 > 0)
    Typestep
    WrangleTextpivot value:sum(column16),average(column16),countif(column16 > 0) group: column2,column9,column8
    p01NameColumn labels
    p01Valuenone
    p02NameRow labels
    p02Valuecolumn2,column9,column8
    p03NameValues
    SearchTermPivot columns
     

  16. A sample of the joined and aggregated data is displayed, representing a summary table of US presidential candidates and their 2016 campaign contribution metrics.

    D caption
    Transformer page after dataset has been joined in
  17. You can make the data easier to interpret by pasting in the following renaming and rounding steps to the recipe.

    D code

    rename type: manual mapping: [column9,'Candidate_Name'],
    [column2,'Candidate_ID'],[column8,'Party_Affiliation'],
    [sum_column16,'Total_Contribution_Sum'],
    [average_column16,'Average_Contribution_Sum'],
    [countif,'Number_of_Contributions']

    D code

    set col: Average_Contribution_Sum value: round(Average_Contribution_Sum)

D caption
Joined datasets


Gather new samples

As needed, you can generate a different kind of sample of your data, which helps to locate outliers and to verify that your transformation steps apply to all rows in the dataset.

...

D caption
New sample loaded in the Transformer page


Run a job

You can now run a job to apply your changes across the entire joined dataset. In the Transformer page, click Run Job.

...