Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r081

D toc

Excerpt

In the Join

...

window of the 

D s webapp
rtrue
, you can join your current dataset with another dataset or recipe based upon information that is common to both datasets.

For example, you could join together two sets of regional sales data based upon the product identifiers that they both use to track sales. In the Search panel, enter join datasets or select the Join icon from the toolbar.

  • join is a standard operation for merging the data from two different datasets. Some information on joins is provided belowFor more information, see Join Types.
  • You cannot perform joins on columns of Object or Array data type.
  • A join operation is different from a union operation. In a union operation, data from one or more datasets is appended to the current dataset, assuming that the columns are identical or very similar. For more information, see Union Page
Tip

Tip: Depending on the types of operations you need to perform, you may need to perform joins earlier or later in your recipe. For more information, see Optimize Job Processing.

D s limitforbrowser

Before You Begin

  • Review your record counts. Before you specify the join, you should review your record counts and the uniqueness of your keys, which should provide an idea of the number of records you may see in the output. Note that the number of output records depends on the type of join and the matches between join keys.
  • Review your join key values. If there are variations in the values in your join keys, you may end up with duplicate records in your joined dataset. Look for mismatched or missing values in your join keys, and correct if possible.
  • Review the granularity of your data. If you bring together data at a lower fidelity than the source, you can end up with record matches that are not actually matching data. For example, if your timestamps are down-sampled from milliseconds to seconds as part of the join, you may have "matching" timestamps in seconds that were not matches at the millisecond level in the source data.

 

 

Step 1 - Select

...

For the current recipe loaded in the Transformer page, you can select any imported dataset, reference dataset, or recipe to join with the currently loaded data.

 

...

Join is performed on the stored version of the data for any imported dataset to which you have access. If this data is overwritten, the data in the join is changed.

...

Join is performed on a reference to a recipe in another flow. If the recipe changes the schema of the data, then the join is affected by those changes.

Dataset or Recipe

In the Search panel, enter join datasets. Then, select the dataset or recipe that you wish to join with your current dataset.

Image Added


D caption
Select dataset or recipe to join

You can use the Data tab to preview the data in the selected object.

Info

NOTE: You must

...

You can join to any recipe in your current flow.

Tip

Tip: Where possible, join to a recipe instead of an imported dataset. With a recipe, you can switch to a random sample and then to use that in the join.

In the following image, the columns of the current dataset (Point of sale data from region 1) are displayed in the left column. On the right side, you can select the imported dataset, reference dataset, or recipe that you'd like to join with the current dataset. In this case, the REF_PROD.txt file has been selected, meaning that the join is performed on this imported dataset: 

Image Removed

D caption
typefigure
Join Page
  • Select the object to join in. To select the columns (keys) to base on the join and the type of join, click Preview.

Step 2 - Specify Join Parameters

In the preview of the new dataset, you can see how the two datasets have been joined. By default, 

D s product
rtrue
 attempts to specify the keys to use for the join and the type of join to perform.

  • In a join operation, a key corresponds to a column that can be used for matching records between datasets. For example, in a list of products or sales data, the key column is likely to be some form of internal system identifier, which is consistently used across all related datasets.
  • Before you accept the join, you should review the selected join keys and the type of join.
  • You may also need to specify the columns to include in the joined dataset. 
  • As needed, you can change the object with which you are joining the first dataset. Click the Data tab, and then click Change Data

Specify Join Keys

To review and modify the columns used as the keys for the join, click the Join Keys tab.

...

have read access to the object to join it to your dataset.

  • Use the Search bar to locate specific objects.
  • Click Accept.

Step 2 - Select Join Conditions

In the next step, you specify the type of join and one or more join keys (columns).

Image Added

D caption
Specify join type and join keys

Dataset samples:

Mouse over the Dataset samples indicator to see the current samples from the datasets that are part of the join. For more information, see Samples Panel.

Join type:

From the drop-down, select the type of join to apply. For more information, see Join Types.

Join keys:

In the above image, the platform has determined that the item number (Item_Nbr) field of Region 1 data and the item number (ITEM_NBR)  field from REF_PROD should be used as the keys for performing the join.  

Image Removed

D caption
typefigure
Review join keys and join type

Specify Other Join Parameters

...

  • To make changes to the two join keys, mouse over the specified keys:
    • To remove the two columns as join keys, click the

...

    • X icon.

...

To save changes, click OK.

    • To edit the keys to use and other key options, click the Pencil icon. See below.
    • To add more join keys, click Add

      Info

      NOTE: Be careful applying multiple join keys. Depending on the join type, this type of join can greatly expand the size of the generated data.

...

  1. After the currently selected keys, fields from the source 1 are listed, followed by the fields of source 2.
  2. To modify the settings for these columns, click Edit Settings. For more information, see Join Column Settings below.
  3. Use the checkboxes to select the columns to include in the joined dataset.

...

Cross Joins

cross join combines each row of the first data set with each row of the second dataset, where every combination is represented in the output. As a result, the number of total rows in the join are:

Code Block
Rows(DatasetA) * Rows(DatasetB)
Info

NOTE: Depending on the size of your datasets, a cross join can greatly expand the size of the output, which may increase costs in some environments.

To specify a cross join, delete all join keys.

...

Edit keys:

By default, matches between join keys are performed on a strict, case-sensitive matching between key values in the selected columns. In some cases, it may be useful to loosen the conditions under which matches are found.You can select from the The following options to perform more fuzzy joins. These options are applied to the join key columns in both sources to attempt to find matches. After the join is executed, no data in either column is changed based on these selections.


Ignore Case
OptionDescription
fuzzy match

Use a fuzzy matching algorithm for key value matching.

Tip

Tip: Use this option to perform fuzzy join matching of primary keys between datasets.


Info

NOTE: Fuzzy joins can only be applied to String data types. Other data types cannot be fuzzy-matched using the algorithm.

Fuzzy matching uses the doublemetaphone algorithm for matching strings (keys). Both primary encodings of each key value must match. See DOUBLEMETAPHONEEQUALS Function.

Ignore caseIgnore case differences between the join key values for matching purposes.
Ignore Special Charactersspecial charactersIgnore all characters that are not alphanumeric, accented Latin characters, or whitespace, prior to testing for a match.
Ignore WhitespacewhitespaceIgnore all whitespace characters, including spaces, tabs, carriage returns, and newlines.

...

Summary:

You can use these metrics to identify the likelihood of accurate matching between the join keys and the row count generated in the output.

Click Next.

Step 3 - Select Output Columns

From the selected datasets, you can specify the columns to include in the output.

Image Added

 

D caption
Select output columns

Select columns:

Review the list of available columns, which are displayed for both sources. 

  • Use the search panel to search for specific columns.
  • To include all columns:
    • Click the All, Current, or Join-In tabs.
    • Click the checkbox at the top of the list.

Advanced options

Name prefixes

You can apply prefixes to column names in the joined dataset, which can be helpful for tracking the source of a column in complex datasets. For example, you may wish to prepend each column from a dataset called, salesRegion01 with the prefix: sR01.

  • Name prefix for columns in Current data: Enter a prefix to apply to the names of columns sourced from your current dataset that appear in the joined output.
  • Name prefix for columns in Joined_in data: Enter a prefix to apply to the names of columns sourced from the joined-in dataset that appear in the joined output.

Dynamically updating Joins

After you have joined in another set of data, subsequent changes to that data are can be automatically reflected in the output of the join

  • Include all columns from Current data: Dynamic updates always include the latest data from your current dataset.
  • Include all columns from Joined-In data: Dynamic updates always include the latest data from the dataset that you are joining in.
Info

NOTE: After you add your join to the recipe, if the data grid is empty, then the keys that you specified in the join may not have a match in the currently selected sample. You should revisit the keys used in your join. If the join still generates an empty grid on the new current sample, you should collect a new sample. See Samples Panel.

...

Tip

Tip: If you must freeze the data in the dataset that you are joining in, you should create a copy of the dataset as a snapshot and join in the copy. See Dataset Details Page.

To join in the copy, edit the join and change the source that is being joined. See Fix Dependency Issues.

Join Column Settings

In the Join Column Settings dialog, you can apply prefixes to column names in the joined dataset, which can be helpful for tracking the source of a column in complex datasets. For example, you may wish to prepend each column from a dataset called, salesRegion01 with the prefix: sR01.

  • Click the checkbox to dynamically update your selected columns in the Join page to always include all columns from the dataset. 
  • To save your changes, click OK.

Join Types

The following types of joins are supported. For example, the following tables contains information about employees and departments.

Employee table:

...

Department table:

...

In the above example, DepartmentID is the key to use in both tables for any joins.

Inner Join

An inner join requires that key values exist in both tables for the records to appear in the results table. Records appear in the merge only if there are matches in both tables for the key values.

  • If you want to include rows containing non-matching values, you must use some form of an outer join. See below.

For the preceding example tables, an inner join on the DepartmentID table produces the following result table:

...

Notes:

  • All fields are included in the merged result set. Fields from the first dataset are listed first.
  • The row for Mary Martin is excluded, since there is no reference in the Department table for her department identifier. The row for Accounting is excluded, since there is no reference in the Employee table for the department identifier.
    • To include these rows, you either need to augment the data or perform a form of an outer join.
  • A null value in one table does not match a null value in another table. So, rows with null values in a join key are never included in an inner join. These values should be fixed.

    Tip

    Tip: An inner join can be used to eliminate rows with null values in their key fields.

Left Outer Join

A left outer join (or left join) does not require that there be matching records for each value in the key value of the source (left) table. Each row in the left table appears in the results, regardless of whether there are matches in the right table.

For the preceding example tables, a left join on the DepartmentID table produces the following result table:

...

Notes:

  • In this left join, the Mary Martin row has been added to the result, since her record in the Employee table does contain an entry for the DepartmentID. However, since there are no corresponding values in the Department table, the corresponding fields in the result table are NULL values.

Right Outer Join

A right outer join (or right join) is the reverse of a left join. A right join does not require that there be matching records for each value in the key value of the secondary (right) table. Each row in the right table appears in the results, regardless of whether there are matches in the left table.

For the preceding example tables, a right join on the DepartmentID table produces the following result table:

...

Notes:

  • In this right join, the Accounting entry is added. However, since there is no entry in the Employee table for the DepartmentID value, those fields are NULL values in the result set.

Full Outer Join

full outer join combines the effects of a left join and a right join. If there is a match between the key values, a row is written in the result.

  • If there is no match for a key value that appears in either table, a single record is written to the result, with NULL values inserted for the fields from the other table. 

...

Notes:

  • Any duplicated rows between joining from left-to-right and from right-to-left are removed from the results.

Joins Together

The following diagram summarizes the relationships between the types of supported joins. In each venn diagram, the area of intersection is the set of records that contain shared key values.

Image Removed

D caption
typefigure
Join Types

After you have created a join, you can modify it through the Recipe panel. For more information, see Edit a JoinClick Save and Continue.

After you have selected your columns and any advanced settings, click Review.

Step 4 - Review Join

Review the join that you have specified. To modify any aspect of it, click the appropriate Edit link.

Image Added

D caption
Review join

To add the specified join to your recipe, click Add to Recipe.