Page tree

 

Support | BlogContact Us | 844.332.2821

 

Contents:

This documentation applies to Trifacta Wrangler. Download this free product.
Registered users of this product or Trifacta Wrangler Enterprise should login to Product Docs through the application.

Contents:


In the Join page, you can join your current dataset with another dataset based upon information that is common to both datasets. For example, you could join together two sets of regional sales data based upon the product identifiers that they both use to track sales. In the Transformer page, enter join in the Choose a transformation textbox in the Transform Builder.

  • join is a standard operation for merging the data from two different datasets. Some information on joins is provided below.
  • You cannot perform joins on columns of Object or Array data type.
  • A join operation is different from a union operation. In a union operation, identical or similar columns are included from multiple datasets in a generated dataset. For more information, see Union Page

Tip: You should perform your join operation as late as possible in your recipe steps. If your joined dataset has not been completely transformed, subsequent steps might impact the data in the dataset to which it was joined. If needed, you can modify your join after its creation. Details are below.


NOTE: Unnest, union, or join transforms may significantly increase the number of rows or columns in your dataset. To prevent overloading the browser's memory, the application may apply a limit function to the results to artificially limit the number of rows displayed in your sample. You can generate a new sample if desired. This limitation is not applied during the job execution.

Before You Begin

  • Review your record counts. Before you join your datasets, you should review your record counts and the uniqueness of your keys, which should provide an idea of the number of records you may see in the output. Note that the number of output records depends on the type of join and the matches between join keys.
  • Review your join key values. If there are variations in the values in your join keys, you may end up with duplicate records in your joined dataset. Look for mismatched or missing values in your join keys, and correct if possible.
  • Review the granularity of your data. If you bring together data at a lower fidelity than the source, you can end up with record matches that are not actually matching data. For example, if your timestamps are down-sampled from milliseconds to seconds as part of the join, you may have "matching" timestamps in seconds that were not matches at the millisecond level in the source data.

Step 1 - Select Datasets to Join

In the following image, the columns of the current dataset (Point of sale data from region 1) are displayed in the left column. On the right side, you can select the dataset (in this case, a reference dataset on products) that you would like to join with the current dataset. 

Figure: Join Page

  • Select the dataset. To select the columns (keys) to base on the join and the type of join, click Preview Selected Dataset.
  • In a join operation, the key corresponds to an identifier that can be used for matching records between datasets. For example, in a list of products or sales data, the key is likely to be some form of internal system identifier, which is consistently used across all related datasets.

Step 2 - Specify Join Parameters

In the preview of the new dataset, you can see how the two datasets have been joined. By default, Trifacta® Wrangler attempts to specify the keys to use for the join and the type of join to perform.

  • Before you accept the join, you should review the selected join keys and the type of join. 
  • As needed, you can change the dataset with which you are joining the first dataset. Click the Datasets tab, and then click Change Dataset

Specify Join Keys

To review and modify the columns used as the keys for the join, click the Join Keys tab.

The following image displays the preview of the currently configured join. It has determined that the item number (Item_Nbr) field of Region 1 data and the item number (ITEM_NBR)  field of the product dataset should be used as the keys for performing the join.  

Figure: Review join keys and join type

Specify Other Join Parameters

  1. Review the list of available columns, which are displayed for both datasets. After the currently selected keys, fields from the source dataset are listed, followed by the fields of the secondary set.
  2. To make changes to the two join keys, click the Edit link.
  3. In the left part of the panel, select the field from the source dataset to use as its key. In the right, select the field from the target dataset.
    1. You may select any of the Suggested Join Keys to populate.
    2. To add a new field to use as the second join key, click the Plus icon.
  4. You can also modify the join type at this time. 
  5. Click Edit to modify matching parameters. From the available options, you can loosen the conditions on how key values are matched in the join. 
  6. Click Save to update the join keys.

Fuzzy Joins

By default, matches between join keys are performed on a strict, case-sensitive matching between key values in the selected columns. In some cases, it may be useful to loosen the conditions under which matches are found.

You can select from the following options to perform more fuzzy joins. These options are applied to the join key columns in both datasets to attempt to find matches. After the join is executed, no data in either column is changed based on these selections.

OptionDescription
Ignore CaseIgnore case differences between the join key values for matching purposes.
Ignore Special CharactersIgnore all characters that are not alphanumeric, accented Latin characters, or whitespace, prior to testing for a match.
Ignore WhitespaceIgnore all whitespace characters, including spaces, tabs, carriage returns, and newlines.

Auto-updating Joins

After you have joined in a dataset, subsequent changes to that dataset are automatically reflected in the dataset to which it was joined. 

Tip: If you must freeze the data in the dataset that you are joining in, you should create a copy of the dataset as a snapshot and join in the copy. See Dataset Details Page.

To join in the copy, edit the join transform in the target dataset and change the dataset that is being joined. See Fix Dependency Issues.

Join Types

The following types of joins are supported. For example, the following tables contains information about employees and departments.

Employee table:

NameDepartmentIDRole
Dave Smith001Product Marketing Manager
Julie Jones002Software Engineer
Scott Tanner001Director of Demand Gen
Ted Connors002Software Engineer
Margaret Lane001VP of Marketing
Mary Martin004Receptionist

Department table:

NameDepartmentID
Marketing001
Engineering002
Accounting003

In the above example, DepartmentID is the key to use in both tables for any joins.

Inner Join

An inner join requires that key values exist in both tables for the records to appear in the results table. Records appear in the merge only if there are matches in both tables for the key values.

  • If you want to include rows containing non-matching values, you must use some form of an outer join. See below.

For the preceding example tables, an inner join on the DepartmentID table produces the following result table:

Employee.NameEmployee.DepartmentIDEmployee.RoleDepartment.NameDepartment.DepartmentID
Dave Smith001Product Marketing ManagerMarketing001
Julie Jones002Software EngineerEngineering002
Scott Tanner001Director of Demand GenMarketing001
Ted Connors002Software EngineerEngineering002
Margaret Lane001VP of MarketingMarketing001

Notes:

  • All fields are included in the merged result set. Fields from the first dataset are listed first.
  • The row for Mary Martin is excluded, since there is no reference in the Department table for her department identifier. The row for Accounting is excluded, since there is no reference in the Employee table for the department identifier.
    • To include these rows, you either need to augment the data or perform a form of an outer join.
  • A null value in one table does not match a null value in another table. So, rows with null values in a join key are never included in an inner join. These values should be fixed.

    Tip: An inner join can be used to eliminate rows with null values in their key fields.

Left Outer Join

A left outer join (or left join) does not require that there be matching records for each value in the key value of the source (left) table. Each row in the left table appears in the results, regardless of whether there are matches in the right table.

For the preceding example tables, a left join on the DepartmentID table produces the following result table:

Employee.NameEmployee.DepartmentIDEmployee.RoleDepartment.NameDepartment.DepartmentID
Dave Smith001Product Marketing ManagerMarketing001
Julie Jones002Software EngineerEngineering002
Scott Tanner001Director of Demand GenMarketing001
Ted Connors002Software EngineerEngineering002
Margaret Lane001VP of MarketingMarketing001
Mary Martin004ReceptionistNULLNULL

Notes:

  • In this left join, the Mary Martin row has been added to the result, since her record in the Employee table does contain an entry for the DepartmentID. However, since there are no corresponding values in the Department table, the corresponding fields in the result table are NULL values.

Right Outer Join

A right outer join (or right join) is the reverse of a left join. A right join does not require that there be matching records for each value in the key value of the secondary (right) table. Each row in the right table appears in the results, regardless of whether there are matches in the left table.

For the preceding example tables, a right join on the DepartmentID table produces the following result table:

Employee.NameEmployee.DepartmentIDEmployee.RoleDepartment.NameDepartment.DepartmentID
Dave Smith001Product Marketing ManagerMarketing001
Julie Jones002Software EngineerEngineering002
Scott Tanner001Director of Demand GenMarketing001
Ted Connors002Software EngineerEngineering002
Margaret Lane001VP of MarketingMarketing001
NULLNULLNULLAccounting003

Notes:

  • In this right join, the Accounting entry is added. However, since there is no entry in the Employee table for the DepartmentID value, those fields are NULL values in the result set.

Full Outer Join

full outer join combines the effects of a left join and a right join. If there is a match between the key values, a row is written in the result.

  • If there is no match for a key value that appears in either table, a single record is written to the result, with NULL values inserted for the fields from the other table. 
Employee.NameEmployee.DepartmentIDEmployee.RoleDepartment.NameDepartment.DepartmentID
Dave Smith001Product Marketing ManagerMarketing001
Julie Jones002Software EngineerEngineering002
Scott Tanner001Director of Demand GenMarketing001
Ted Connors002Software EngineerEngineering002
Margaret Lane001VP of MarketingMarketing001
Mary Martin004ReceptionistNULLNULL
NULLNULLNULLAccounting003

Notes:

  • Any duplicated rows between joining from left-to-right and from right-to-left are removed from the results.

Joins Together

The following diagram summarizes the relationships between the types of supported joins. In each venn diagram, the area of intersection is the set of records that contain shared key values.

Figure: Join Types

After you have created a join, you can modify it through the Recipe panel. For more information, see Edit a Join.

Your Rating: Results: PatheticBadOKGoodOutstanding! 11 rates

This page has no comments.