In the Join page, you can join your current dataset with another dataset or recipe based upon information that is common to both datasets. For example, you could join together two sets of regional sales data based upon the product identifiers that they both use to track sales. In the Search panel, enter join datasets
or select from the toolbar.
Tip: Depending on the types of operations you need to perform, you may need to perform joins earlier or later in your recipe. For more information, see Optimize Job Processing. |
For the current recipe loaded in the Transformer page, you can select any imported dataset, reference dataset, or recipe to join with the currently loaded data.
Join Object Type | Description | |
---|---|---|
imported dataset | Join is performed on the stored version of the data for any imported dataset to which you have access. If this data is overwritten, the data in the join is changed. | |
reference dataset | Join is performed on a reference to a recipe in another flow. If the recipe changes the schema of the data, then the join is affected by those changes.
| |
recipe | You can join to any recipe in your current flow.
|
In the following image, the columns of the current dataset (Point of sale data from region 1) are displayed in the left column. On the right side, you can select the imported dataset, reference dataset, or recipe that you'd like to join with the current dataset. In this case, the REF_PROD.txt
file has been selected, meaning that the join is performed on this imported dataset:
Join Page |
In the preview of the new dataset, you can see how the two datasets have been joined. By default, attempts to specify the keys to use for the join and the type of join to perform.
To review and modify the columns used as the keys for the join, click the Join Keys tab.
The following image displays the preview of the currently configured join. It has determined that the item number (Item_Nbr
) field of Region 1 data and the item number (ITEM_NBR
) field from REF_PROD
should be used as the keys for performing the join.
Review join keys and join type |
To save changes, click OK.
To remove the two columns as join keys, click the Trash icon.
To add more join keys, click Add.
NOTE: Be careful applying multiple join keys. Depending on the join type, this type of join can greatly expand the size of the generated data. |
A cross join combines each row of the first data set with each row of the second dataset, where every combination is represented in the output. As a result, the number of total rows in the join are:
Rows(DatasetA) * Rows(DatasetB) |
NOTE: Depending on the size of your datasets, a cross join can greatly expand the size of the output, which may increase costs in some environments. |
To specify a cross join, delete all join keys.
By default, matches between join keys are performed on a strict, case-sensitive matching between key values in the selected columns. In some cases, it may be useful to loosen the conditions under which matches are found.
You can select from the following options to perform more fuzzy joins. These options are applied to the join key columns in both sources to attempt to find matches. After the join is executed, no data in either column is changed based on these selections.
Option | Description |
---|---|
Ignore Case | Ignore case differences between the join key values for matching purposes. |
Ignore Special Characters | Ignore all characters that are not alphanumeric, accented Latin characters, or whitespace, prior to testing for a match. |
Ignore Whitespace | Ignore all whitespace characters, including spaces, tabs, carriage returns, and newlines. |
After you have joined in another set of data, subsequent changes to that data are automatically reflected in the output of the join.
NOTE: After you add your join to the recipe, if the data grid is empty, then the keys that you specified in the join may not have a match in the currently selected sample. You should revisit the keys used in your join. If the join still generates an empty grid on the new sample, you should collect a new sample. See Samples Panel. |
Tip: If you must freeze the data in the dataset that you are joining in, you should create a copy of the dataset as a snapshot and join in the copy. See Dataset Details Page. To join in the copy, edit the join and change the source that is being joined. See Fix Dependency Issues. |
In the Join Column Settings dialog, you can apply prefixes to column names in the joined dataset, which can be helpful for tracking the source of a column in complex datasets. For example, you may wish to prepend each column from a dataset called, salesRegion01
with the prefix: sR01
.
The following types of joins are supported. For example, the following tables contains information about employees and departments.
Employee table:
Name | DepartmentID | Role |
---|---|---|
Dave Smith | 001 | Product Marketing Manager |
Julie Jones | 002 | Software Engineer |
Scott Tanner | 001 | Director of Demand Gen |
Ted Connors | 002 | Software Engineer |
Margaret Lane | 001 | VP of Marketing |
Mary Martin | 004 | Receptionist |
Department table:
Name | DepartmentID |
---|---|
Marketing | 001 |
Engineering | 002 |
Accounting | 003 |
In the above example, DepartmentID
is the key to use in both tables for any joins.
An inner join requires that key values exist in both tables for the records to appear in the results table. Records appear in the merge only if there are matches in both tables for the key values.
For the preceding example tables, an inner join on the DepartmentID
table produces the following result table:
Employee.Name | Employee.DepartmentID | Employee.Role | Department.Name | Department.DepartmentID |
---|---|---|---|---|
Dave Smith | 001 | Product Marketing Manager | Marketing | 001 |
Julie Jones | 002 | Software Engineer | Engineering | 002 |
Scott Tanner | 001 | Director of Demand Gen | Marketing | 001 |
Ted Connors | 002 | Software Engineer | Engineering | 002 |
Margaret Lane | 001 | VP of Marketing | Marketing | 001 |
Notes:
A null value in one table does not match a null value in another table. So, rows with null values in a join key are never included in an inner join. These values should be fixed.
Tip: An inner join can be used to eliminate rows with null values in their key fields. |
A left outer join (or left join) does not require that there be matching records for each value in the key value of the source (left) table. Each row in the left table appears in the results, regardless of whether there are matches in the right table.
For the preceding example tables, a left join on the DepartmentID
table produces the following result table:
Employee.Name | Employee.DepartmentID | Employee.Role | Department.Name | Department.DepartmentID |
---|---|---|---|---|
Dave Smith | 001 | Product Marketing Manager | Marketing | 001 |
Julie Jones | 002 | Software Engineer | Engineering | 002 |
Scott Tanner | 001 | Director of Demand Gen | Marketing | 001 |
Ted Connors | 002 | Software Engineer | Engineering | 002 |
Margaret Lane | 001 | VP of Marketing | Marketing | 001 |
Mary Martin | 004 | Receptionist | NULL | NULL |
Notes:
DepartmentID
. However, since there are no corresponding values in the Department table, the corresponding fields in the result table are NULL
values.A right outer join (or right join) is the reverse of a left join. A right join does not require that there be matching records for each value in the key value of the secondary (right) table. Each row in the right table appears in the results, regardless of whether there are matches in the left table.
For the preceding example tables, a right join on the DepartmentID
table produces the following result table:
Employee.Name | Employee.DepartmentID | Employee.Role | Department.Name | Department.DepartmentID |
---|---|---|---|---|
Dave Smith | 001 | Product Marketing Manager | Marketing | 001 |
Julie Jones | 002 | Software Engineer | Engineering | 002 |
Scott Tanner | 001 | Director of Demand Gen | Marketing | 001 |
Ted Connors | 002 | Software Engineer | Engineering | 002 |
Margaret Lane | 001 | VP of Marketing | Marketing | 001 |
NULL | NULL | NULL | Accounting | 003 |
Notes:
DepartmentID
value, those fields are NULL
values in the result set.A full outer join combines the effects of a left join and a right join. If there is a match between the key values, a row is written in the result.
NULL
values inserted for the fields from the other table. Employee.Name | Employee.DepartmentID | Employee.Role | Department.Name | Department.DepartmentID |
---|---|---|---|---|
Dave Smith | 001 | Product Marketing Manager | Marketing | 001 |
Julie Jones | 002 | Software Engineer | Engineering | 002 |
Scott Tanner | 001 | Director of Demand Gen | Marketing | 001 |
Ted Connors | 002 | Software Engineer | Engineering | 002 |
Margaret Lane | 001 | VP of Marketing | Marketing | 001 |
Mary Martin | 004 | Receptionist | NULL | NULL |
NULL | NULL | NULL | Accounting | 003 |
Notes:
The following diagram summarizes the relationships between the types of supported joins. In each venn diagram, the area of intersection is the set of records that contain shared key values.
Join Types |
After you have created a join, you can modify it through the Recipe panel. For more information, see Edit a Join.