Terminology applicable to .
NOTE: This list is not comprehensive. |
These terms apply to the objects that you import, create, and generate in .
An integration between the product and a datastore, through which data is read from and optionally written to the store. A connection can be read-only or read-write, depending on the type. Some connections are provided by default.
Other connections are created through the .
An imported dataset that has been created with parameterized references, typically used to collect multiple assets stored in similar locations or filenames containing identical structures. For example, if you stored orders in individual files for each week in a single directory, you could create a dataset with parameters to capture all of those files in a single object, even if more files are added at a later time.
The path to the asset or assets is specified with one or more of the following types of parameters: Datetime, , regular expression pattern, wildcard, or variable.
A dataset that is created by apply a custom SELECT
statement to a relational datasource. You can create custom SQL statements to change the scope of your imported dataset from a single, entire table.
A mechanism for versioning publication of your flows. In Deployment Manager, packages are imported as new releases assigned to a deployment. Within a deployment, you can choose which release is active, allowing you to version the publication of your flows. See Overview of Deployment Manager.
A container for holding a set of related imported datasets, recipes, and output objects. Flows are managed in Flow View page.
A reference to an object that contains data to be wrangled in . An imported dataset is created when you specify the file(s) or table(s) that you wish to read through a connection.
A job is the sequence of processing steps that apply each step of your recipe in sequence across the entire dataset to generate the desired set of results.
A macro is a sequence of one or more reusable recipe steps. Macros can be configured to accept parameterized inputs, so that their functionality can be tailored to the recipe in which they are referenced.
Associated with a recipe, an output is a user-defined set of files or tables, formats, and locations where results are written after a job run on the recipe has completed.
An output may contain one or more destinations, each of which defines a file type, filename, and location where the results of the output are written.
You can create variable or timestamp parameters that can be applied to parts of the file or table paths of your outputs. Variable values can be specified at the time of job execution.
A flow imported into a Production instance of the platform. A package contains a JSON-based definition of the flow in a ZIP file. On import, rules must be created to modify any mappings to connections or paths to datasets that may have changed from those of the platform instance from where the package was exported. See Overview of Deployment Manager.
An object that captures a pattern of values that define some part of the path to the set of files or tables to include in a single imported dataset.
The delivery of a set of results generated by to another system.
A sequence of steps that transforms one or more datasets into a desired output. Recipes are built in the Transformer page using a sample of the dataset or datasets. When a job is executed, the steps of the recipe are applied in the listed order to the imported dataset or datasets to generate the output.
A pointer to the output of a recipe. A reference can be used in other flows, so that those flows get the latest version of the output from the referenced recipe.
A reference that has been imported into another flow.
A specific instance of a release that has been imported into the Deployment Manager. See Overview of Deployment Manager.
A set of generated files or tables containing the results of processing a selected recipe, its datasets, and all upstream dependencies.
Optionally, you can create a profile of your generated results. This profile is available through the and may assist in analyzing or troubleshooting issues with your dataset. See Overview of Visual Profiling.
When you review and interact with your data in the data grid, you are seeing the current state of your recipe applied to a sample of the dataset. If the entire dataset is smaller than the defined limit, you are interacting with the entire dataset.
You can create new samples using one of several supported sampling techniques. See Overview of Sampling.
You can associate a single schedule with a flow. A schedule is a combination of one or more trigger times and the one or more scheduled destinations that are generated when the trigger is hit. A schedule must have at least one trigger and at least one scheduled destination in order to work.
When a schedule's trigger is fired, each recipe that has a scheduled destination associated with it is queued for execution. When the job completes, the outputs specified in the scheduled destination are generated. A recipe may have only one scheduled destination, and a scheduled destination may have multiple outputs (publishing actions) associated with it.
A set of columns, their order, and their formats to which you are attempting to wrangle your dataset. A target represents the schema to which you are attempting to wrangle. You can assign a target to your recipe, and the schema can be superimposed on the columns in the data grid, allowing you to make simple selections to transform your dataset to match the column names, order, and formats of the target. See Overview of RapidTarget.
A trigger is a periodic time associated with a schedule. When a trigger's time occurs, all of recipes in the flow with scheduled destinations are queued for execution. A schedule can have multiple triggers.
A replacement for the parts of a file path to data that change with each refresh. A variable can be overwritten as needed at job runtime.
These terms apply to the , a web-based application for interacting with your datasets, flows, and recipes.
Create or modify scheduled executions of your flow.
Feature that enables automated execution of flows according to user-defined schedules. See Overview of Automator.
Browse columns of your dataset, select and perform operations on one or more selected columns. See Column Browser Panel.
Examine details and profile of the data in the selected column. See Column Details Panel.
Perform transformation operations on the selected column from a list of menu options, including changing the column data type. See Column Menus.
At the top of the column, review the counts of values in the column. Select one or more values in the column through the histogram. See Column Histograms.
Create or edit connections to external storage. See Connections Page.
In the Transformer page, the data grid displays a sample of the dataset at the currently selected step in the recipe. Make selections in the dataset to prompt suggestions for transformations to add to your recipe. See Data Grid Panel.
Review color-coded counts of valid, missing, and mismatched values in your column based on the column's data type. Select a color bar to be prompted with suggestions for transformations on the relevant rows. See Data Quality Bars.
Change the data type for the column from the icon to the left of the column header. See Column Menus.
Deploy Production versions of your flows through the Deployment Manager. See Overview of Deployment Manager.
Examine details about your dataset, including source of data and other information. See Dataset Details Page.
Create, manage, and export your flows. See Flows Page.
Build your flow objects, including recipes, outputs, and references. See Flow View Page.
Landing page after login. See Home Page.
Import data from a valid connection as an imported dataset. See Import Data Page.
Manage your imported datasets and reference objects. See Library Page.
Review the list of jobs that you have launched. View status, explore job details, and export results. See Jobs Page.
Review the details of your job, including an optional profile of the resulting data. See Job Details Page.
Publish results to an external system. See Publishing Dialog.
Feature that enables matching of columns and data types of your dataset with a pre-defined target schema.
Add, edit, and remove steps from your current recipe. Apply changes and see updates immediately in the data grid sample.
Configure job, visual profiling, and job outputs before launching. See Run Job Page.
Review, create, and delete samples for the current recipe.
Search for transformations to build as the next step in your recipe. See Search Panel.
Review and modify settings. See Settings Page.
Share your flow or send a copy of it to other users.
Based on selections you make in the data grid, you can review profiling information and a set of suggested transformations to add to your recipe. See Selection Details Panel.
Select from common transformations in a toolbar across the top of the data grid. See Transformer Toolbar.
Review and customize transformation steps. See Transform Builder.
Review sampled data, explore suggestions and previews, and build transformation steps. See Transformer Page.
Review and modify settings applicable to your user account. See User Profile Page.
Review and toggle the visibility of the columns in your dataset. See Visible Columns Panel.
These terms pertain to building recipes in in the Transformer page.
An input to a function. See Wrangle Language.
Several functions can be used to group values in a column into bins, which can assist in preparing your data for downstream use. See Prepare Data for Machine Processing.
A data type is the set of constraints on expected values in a column. When you specify the data type for a column, you provide a means for the platform to identify the values in the column that do not match the selected type, which assists in wrangling the mismatched values. See Supported Data Types.
Data types can be selected from the column menus. See Column Menus.
An input to a recipe that is not the primary datasource for the recipe. For example, if your recipe includes a join step, the dataset that is joined into your recipe is an upstream dependency. Recipe steps and changes outside of the can create dependency errors, in which an upstream object can no longer be found and the reference to it cannot be resolved. These issues must be fixed prior to successful execution of a job. For more information, see Fix Dependency Issues.
A dictionary is an external file that can be used to define the accepted values for a custom data type. You can create custom data types using an enumerated list of accepted values or by regular expression. See Create Custom Data Types.
A file's encoding defines the set of characters that are in use in the file. There are many different encoding systems in use around the world. To represent English language, which uses a 26-character alphabet, UTF-8 is sufficient. However, to represent Asian character sets, which may contain thousands of characters, a different and broader set of characters is required. See Supported File Encoding Types.
When a file is imported, assumes that the file is in the default encoding type. As needed, you can change the encoding type that is used to import the file. See Change File Encoding.
A function in is an action that is applied to a set of values as part of a transformation step. A function can take 0 or more parameters as inputs, yielding a single output of a specific data type. For a list of supported functions, see Language Index.
When a file-based dataset is imported, attempts to detect the format and structure of the data and then to apply a set of initial parsing steps to transform the data for display in tabular form in the data grid. These steps may vary depending on the file format. See Initial Parsing Steps.
These steps do not appear in the recipe. As needed, you can disable the detection of structure on import. When disabled, these steps are added as the first steps of the recipe, where you can edit or remove them as needed. See Remove Initial Structure.
This database concept can be applied to datasets. In a join, two datasets are merged into one, based on a set of key columns. Values in these columns that match across the datasets are used to determined the values from each dataset to include in the joined dataset. See Join Types.
Joins are created as steps in your recipe. See Join Panel.
A retrieval of a row of values from another dataset based on common values in columns in each dataset. A lookup is useful for bringing in reference information based on values in one of the columns of your dataset. See Lookup Wizard.
Values in a column that do not conform to range or format of expected values for the column's data type.
Cell values in the dataset that are empty.
A multi-dataset (MDS) operation refers to any step in your recipe that uses two or more datasets. Joins and unions are examples of multi-dataset operations.
An expression that is inside another expression. Example:
POWER(ABS(colA),colB) |
supports the use of nested expressions in your recipe steps. See Wrangle Language.
A value that does not exist in the dataset. See Manage Null Values.
A single character that represents an arithmetic function or comparison. For example, the Plus sign (+
) represents the add function.
Operator Category | Description |
---|---|
Logical Operators | and, or, and not operators |
Numeric Operators | Add, subtract, multiply, and divide |
Comparison Operators | Compare two values with greater than, equals, not equals, and less than operators |
Ternary Operators | Use ternary operators to create if/then/else logic in your transforms. |
In statistics, an outlier refers to a value that is unusually above or below from the mean. In , an outlier is 4 standard deviations away from the mean.
You can review outliers for column values. See Column Statistics Reference.
An input to a transform in . See Wrangle Language.
In , a pattern is an object that describes a sub-string within a value. Patterns can be described using regular expressions, a common standard, or
, a proprietary simplification of regular expressions. See Text Matching.
Patterns are widely used in the product for identifying and extract values from data, data type validation, and supporting pattern-based suggestions.
Regular expressions are a powerful yet complex method of describing patterns of values for matching purposes. See Text Matching.
The row number for a record as it appeared in the original dataset. Source row number information can be obtained by function. This function may return a null value if multi-dataset operations, such as union and join, have been performed on the dataset. See SOURCEROWNUMBER Function.
A source metadata reference is a programmatic reference to some aspect of the source file for your dataset. Using these programmatic references, you can write source information for your original datasource into your dataset for future reference. For more information, see Source Metadata References.
String collation refers to a method of comparison of strings based on a set of rules. includes the following functions to perform string collation-based comparisons:
A transformation is the unit of action in a recipe step. A transformation applies one or more actions on a set of rows or columns. Transformations are specified in the Transformer page through the Transform Builder. See Transform Builder.
For a list of available transformations, see Transformation Reference.
A transform in is an action that is applied to rows or columns of your dataset. A transform can take zero or more parameters as inputs. A parameter may contain a reference to a column, a literal value, or a function.
NOTE: Transforms are not available through the |
For a list of supported transforms, see Language Index.
A simplification of regular expressions, are custom selectors for patterns in your data and provide a simpler and more readable alternative to regular expressions. See Text Matching.
A union combines two or more datasets such that the rows of the second and later datasets are appended to the end of the first dataset. In a union operation, the columns must be matched up, or the results are a ragged dataset.
Unions are created as steps in your recipe. See Union Page.
An informal term for the process of data preparation. Data wrangling was invented by the co-founders of .
These terms apply to administration of your workspace and the underlying platform.
A page in the where administrators can configure platform users, settings, and other configuration options. See Admin Settings Page.
A page in the where provisioned users can manage their deployments for a Production instance. Users must have the Deployment role in their account, or the entire instance must be configured as a Production instance. See Deployment Manager Page.
These terms apply to the underlying .
A data serialization format for Hadoop. For more information, see Supported File Formats.
Short for Application Protocol Interface, the platform APIs permit programmatic access to developers to platform actions from outside of the application interface. For more information, see API Reference.
A platform service for queued and managing the execution of jobs through external running environments. For more information, see Configure Batch Job Runner.
A data serialization format for Hadoop. For more information, see Supported File Formats .
The can be served through a supported version of Google Chrome. For more information, see Desktop Requirements.
Short for Command Line Interface, the CLI enables a number of platform tasks to be launched from the command line outside of the application interface. Developers can manage, jobs, users, and connections through the CLI.
NOTE: The CLI will be deprecated in a future release. Please use the APIs instead. For more information, see Changes to the Command Line Interface. |
For more information, see Command Line Interface.
Time-based job scheduling format. The supports a modified form of cron. For more information, see cron Schedule Syntax Reference.
A platform service for managing connections and interactions with relational storage. For more information, see Configure Data Service.
A file format for compression and decompression. For more information, see Supported File Formats .
The process by which relational datasources can be retrieved from their origin and transferred to the backend datastore of the platform, which improves performance in sampling and job execution. For more information, see Configure JDBC Ingestion.
Javascript Object Notation (JSON) is a human-readable format for transmitting data objects. For more information, see Supported File Formats .
Microsoft Excel workbooks and worksheets can be used as imported datasets in the platform. For more information, see Import Excel Data .
An open-source relational database management system. MySQL can host the . For more information, see System Requirements.
The process by which computer systems use data as inputs for algorithms and statistical models to make decisions and perform tasks.
The process by which actions in the platform can be applied and scheduled in production environments.
An in-browser client for managing the sampling and transformation of data on the web client. For more information, see Configure Photon Client.
An open-source relational database management system. PostgreSQL can host the . For more information, see System Requirements .
Specific to the , predictive transformation serves as the foundation of design principles for how users interact with their data. For more information, see Overview of Predictive Transformation.
When a job is executed against a dataset, users can optionally choose to generate a visual profile of the results, which is processed as a separate job after the transformation job has completed. For more information, see Run Job Page.
One of several environments where transformation, profiling, and sampling jobs can be executed. The platform integrates with these environments and manages the queuing and monitoring of the jobs asynchronously, minimizing performance impacts on the . For more information, see Running Environment Options.
Users can optionally flows and connections with other users. For more information, see Overview of Sharing.
Short for Single-Sign On, SSO enables users to access multiple systems within the enterprise domain through one set of credentials. The can integrate with multiple types of SSO. For more information, see Configure SSO for AD-LDAP.
A fast compression and decompression format. For more information, see Supported File Formats .
A native format for the Tableau data visualization platform. The can generate results in TDE format. For more information, see Supported File Formats .
The process by which a recipe is applied across the entire dataset to generate results at the specified output locations. For more information, see Run Job Page.
The primary configuration file of the . This file is stored in JSON format on the
.
NOTE: Administrators should perform platform configuration operations through the Admin Settings page, where possible. See Admin Settings Page. |
For more information, see Platform Configuration Methods.
Short for user-defined function, a UDF is an externally developed function that can be used in your recipes to apply custom transformation logic. Building UDFs requires developer skills. For more information, see User-Defined Functions.
A platform service that can be optionally invoked to generate visual profiles on generated results for display in the . For more information, see Overview of Visual Profiling.
Here are a few terms that are specific to Hadoop and Hadoop-based clusters.
A Hadoop-based platform storing large volumes of data and performing analytics on them. For more information, see Supported Deployment Scenarios for Cloudera.
With respect to the platform, a cluster is a remote collection of nodes for processing platform jobs and returning results. The platform supports integration with multiple types of clusters for job processing.
An open-source framework of utilities for managing analytics and data processing jobs across a network of many nodes in a cluster. Hadoop is scalable and extensible and well-suited for processing very large data volumes.
Short for Hadoop Distributed File System, HDFS is a backend datastore for Hadoop-based clusters. Files are stored in large blocks distributed across many nodes of the cluster. Applications and users can interact with the files through a virtual file browser. For more information, see Using HDFS.
High availability refers to a general concept of automated redundancy and failover to backup servers when a primary server is down. The platform can integrate with high availability functions of a Hadoop-based cluster. For more information, see Enable Integration with Cluster High Availability.
Hortonworks is the maker of the Hortonworks Data Platform (HDP), with which the platform can integrate for job execution and data storage. For more information, see Supported Deployment Scenarios for Hortonworks.
One of two supported communications protocols between the platform and HDFS, HttpFS utilizes HTTP protocol and is required in some deployments. For more information, see Enable HttpFS.
Kerberos provides secure protocols for authentication across a variety of platforms. For more information, see Configure for Kerberos Integration.
Short for Key Management System, KMS for Hadoop clusters is supported by the platform. For more information, see Configure for KMS.
Authorization service for Hadoop clusters, from Apache product and supported by Cloudera. For more information, see Configure for KMS for Ranger.
Authorization service for Hadoop clusters, from Apache product and supported by Hortonworks. For more information, see Configure for KMS for Sentry.
WebHDFS is the default protocol for communicating between the platform and HDFS. For more information, see Prepare Hadoop for Integration with the Platform.
Hadoop resource manager.
These terms apply to Amazon Web Services, where the can be hosted.
Short for Amazon Web Services, AWS is a cloud-based platform for developing and deploying applications. For more information, see Configure for AWS.
Elastic Compute Cloud (Amazon EC2) is a web-based service for running applications in the Amazon Web Services (AWS) public cloud. The can be deployed from an EC2 instance.
Short for Elastic Map Reduce, EMR is a Hadoop-based platform purpose built to manage large datasets on AWS. See Configure for EMR.
An identify and access management (IAM) role defines a set of permissions for making AWS requests. Trusted entities assume roles, like IAM users, applications, or AWS services. The can use IAM roles for enabling access to EC2-based resources controlled by the enterprise. For more information, see Configure for EC2 Role-Based Authentication.
Amazon Relational Database System (RDS) is a relational database management system available in the AWS cloud, The databases required by the can be installed on Amazon RDS. See Install Databases on Amazon RDS.
A hosted data warehouse solution available through AWS. The can connect to Redshift databases. See Using Redshift.
Simple Storage Service (S3) is an online storage service provided by AWS. The can use S3 as the backend storage system or can integrate with it as secondary storage. See Using S3.
Metadstore for Hive datasets, which can be used as a source of imported datasets. See Enable AWS Glue Access.
These terms apply to Microsoft Azure, where the can be hosted, and its available datastores and services.
Azure Data Lake Store (ADLS) is a scalable big data repository built on top of HDInsight. See Using ADLS.
Microsoft Azure is a cloud computing service for building, managing, and deploying applications. See Configure for Azure.
Spark-based analytics running environment built specifically for Microsoft Azure. See Configure for Azure Databricks.
An open-source Hadoop-based platform for storage and analytics in the Microsoft Azure platform. See Configure for HDInsight.
Windows Azure Storage Blob (WASB) is an abstraction layer on top of HDFS for storage across multiple clusters. See Using WASB.
Unix time (a.k.a. POSIX time or Epoch time) is a system for describing instants in time, defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not counting leap seconds.