This release includes a number of key bug fixes and updates.
|TD-31581||Editing joins in reconvergent flows fails with an error message.|
|TD-31509||Undo not persisted back to server after sample has been collected and loaded.|
|TD-31399||Join "select-all" performance is slower and can cause browser to hang.|
|TD-31327||Unable to save dataset sourced from multi-line custom SQL on dataset with parameters.|
|TD-31305||Copying a flow invalidates the samples in the new copy. Copying or moving a node within a flow invalidates the node's samples.|
|TD-31165||Job results are incorrect when a sample is collected and then the last transform step is undone.|
The following security-related fixes were completed in this release.
In Apache Log4j 2.x before 2.8.2, when using the TCP socket server or UDP socket server to receive serialized log events from another application, a specially crafted binary payload can be sent that, when deserialized, can execute arbitrary code.
|TD-32712||Upgrade Apache portable runtime to latest version to address security vulnerability.|
|TD-32711||Upgrade Python version to address security vulnerability.|
Multiple integer overflows in libgfortran might allow remote attackers to execute arbitrary code or cause a denial of service (Fortran application crash) via vectors related to array allocation.
Hawk before 3.1.3 and 4.x before 4.1.1 allow remote attackers to cause a denial of service (CPU consumption or partial outage) via a long (1) header or (2) URI that is matched against an improper regular expression. Upgrade version of less to address this security vulnerability.
Spring Security (Spring Security 4.1.x before 4.1.5, 4.2.x before 4.2.4, and 5.0.x before 5.0.1; and Spring Framework 4.3.x before 4.3.14 and 5.0.x before 5.0.3) does not consider URL path parameters when processing security constraints. By adding a URL path parameter with special encodings, an attacker may be able to bypass a security constraint. The root cause of this issue is a lack of clarity regarding the handling of path parameters in the Servlet Specification. Some Servlet containers include path parameters in the value returned for getPathInfo() and some do not. Spring Security uses the value returned by getPathInfo() as part of the process of mapping requests to security constraints. In this particular attack, different character encodings used in path parameters allows secured Spring MVC static resource URLs to be bypassed.
Apache POI in versions prior to release 3.15 allows remote attackers to cause a denial of service (CPU consumption) via a specially crafted OOXML file, aka an XML Entity Expansion (XEE) attack.
If a user of Commons-Email (typically an application programmer) passes unvalidated input as the so-called "Bounce Address", and that input contains line-breaks, then the email details (recipients, contents, etc.) might be manipulated. Mitigation: Users should upgrade to Commons-Email 1.5. You can mitigate this vulnerability for older versions of Commons Email by stripping line-breaks from data, that will be passed to Email.setBounceAddress(String).
Apache Commons FileUpload before 1.3.3 DiskFileItem File Manipulation Remote Code Execution
|TD-31627||Transformer Page - Tools|
Prefixes added to column names in the Join page are not propagated to subsequent recipe steps that already existed.
Transformation job on wide dataset fails on Spark 2.2 and earlier due to exceeding Java JVM limit. For details, see https://issues.apache.org/jira/browse/SPARK-18016.
The following issues are sourced from third-party vendors and are impacting the .
NOTE: For additional details and the latest status, please contact the third-party vendor listed below.
|External Ticket Number||3rd Party Vendor|
Cloudera Issue: OPSAPS-39589
|Cloudera||Publishing to Cloudera Navigator|
Within the CDH 5.x product line, Cloudera Navigator only supports Spark 1.x. The requires Spark 2.1 and later.
When Spark 2.x jobs are published to Cloudera Navigator, Navigator is unable to detect them, so they are never added to Navigator.
Release 5.0 of delivers major enhancements to the Transformer page and workspace, starting with the new Home page. Key management capabilities simplify the completion of your projects and management of scheduled job executions. This major release of the platform supports broader connectivity and integration.
Improving user adoption:
The new workspace features a more intuitive design to assist in building your wrangling workflows with a minimum of navigation. From the new Home page, you can quickly access common tasks, such as creating new datasets or flows, monitoring jobs, or revisiting recent work.
Tip: Check out the new onboarding tour, which provides an end-to-end walkthrough of the data wrangling process. Available to all users on first login of the new release.
Significant improvements have been delivered to the core transformation experience. In the Transformer page, you can now search across dozens of pre-populated transformations and functions, which can be modified in the familiar Transform Builder. Use the new Transformer toolbar to build pre-designed transformations from the menu interface.
New for Release 5.0, target matching allows you to import a representation of the final target schema, against which you can compare your work in the Transformer page. Easy-to-understand visual tags show you mismatches between your current recipe and the target you have imported. Click these tags to insert steps that align your columns with their counterparts in the target.
For multi-dataset operations, the new Auto Align feature in the Union tool improves matching capabilities between datasets, and various enhancements to the Join tool improve the experience.
Over 20 new Wrangle functions deliver new Excel-like capabilities to wrangling.
Previously a beta feature, relational connectivity is now generally available, which broadens access to more diverse data. Out-of-the-box, the platform now supports more relational connections with others available through custom configuration. From the Run Jobs page, you can now publish directly to Amazon Redshift.
Build dynamic datasets with variables and parameters. Through parameters, you can apply rules to match multiple files through one platform object, a dataset with parameters. Rules can contain regular expressions, patterns, wildcards, dates, and variables, which can be overridden during runtime job execution through the UI or API. Variables can also be applied to custom SQL datasets.
Using these parameterized datasets allows schedules to pick up new data each execution run and enables users to pass variable values through the API or UI to select different data apply to the job.
Release 5.0 delivers broader and enhanced integration with Microsoft Azure. With a few clicks in the Azure Marketplace, you can deploy the platform into a new or existing HDI cluster. Your deployment can seamlessly integrate with either ADLS or WASB and can be configured to connect to Microsoft SQL Data Warehouse. As needed, integrate with Azure Active Directory for single-sign on simplicity.
Here's what's new in Release 5.0.
Support for CDH 5.14.
NOTE: Support for CDH 5.11 has been deprecated. See End of Life and Deprecated Features.
Support for Spark 2.2.
NOTE: By default, the is configured to use Spark 2.1.0. Depending on your environment, you may be required to change the configuration to Spark 2.2, particularly if you are integrating with an EMR cluster. For more information, see Configure for Spark.
New Home page and left nav bar allows for more streamlined access to recent flows and jobs, as well as learning resources. See Home Page.
Tip: Try the tutorial available from the Home page. See Home Page.
Run Job Page:
NOTE: If you are upgrading an instance that was integrated with an EMR cluster, the EMR cluster ID must be applied to the . See Admin Settings Page.
NOTE: If you are integrating with an EMR cluster, EMR 5.7 is no longer supported. Please create an EMR 5.11 cluster instead. See End of Life and Deprecated Features.
|TD-28930||Delete other columns causes column lineage to be lost and reorders columns.|
|TD-28573||Photon running environment executes column splits for fixed length columns using byte length, instead of character length. In particular, this issue affects columns containing special characters.|
|TD-27784||Ubuntu 16 install for Azure: supervisord complains about "missing" Python packages.|
|TD-26069||Photon evaluates |
When creating Tableau Server connections, the Test Connection button is missing.
Copying a flow invalidates the samples in the new copy. Copying or moving a node within a flow invalidates the node's samples.
|TD-31252||Transformer Page - Tools|
Assigning a target schema through the Column Browser does not refresh the page.
Job results are incorrect when a sample is collected and then the last transform step is undone.
Matching file path patterns in a large directory can be very slow, especially if using multiple patterns in a single dataset with parameters.
When creating a new dataset from the Export Results window from a CSV dataset with Snappy compression, the resulting dataset is empty when loaded in the Transformer page.
|TD-30820||Compilation/Execution||Some string comparison functions process leading spaces differently when executed on the Photon or the Spark running environment.|
|TD-30717||Connectivity||No validation is performed for Redshift or SQL DW connections or permissions prior to job execution. Jobs are queued and then fail.|
Spark job run on ALDS cluster fails when Snappy compression is applied to the output.
|TD-30342||Connectivity||No data validation is performed during publication to Redshift or SQL DW.|
Redshift: No support via CLI or API for:
Pre-import preview of Bigint values from Hive or Redshift are incorrect.
In reference dataset, UDF from the source dataset is not executed if new recipe contains a join or union step.
When the platform is restarted or an HA failover state is reached, any running jobs are stuck forever In Progress.