Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space DEV and version r0822

...

Info

NOTE: If you are installing custom UDFs and the

D s node
does not have an Internet connection, you should download the Java UDF SDK in an Internet-accessible location, build your customer UDF JAR there, and then upload the JAR to the
D s node

Overview

Each UDF requires at least one input can take one or more inputs and produces a single output value (map only).

Inputs and outputs must be one of the following types:

  • Bool
  • String
  • Long
  • Double
Tip

Tip: In most cases, a user-defined function requires an input value. If your UDF does not require one, you must create a dummy input as part of your UDF definition.

Known Limitations

  • In the
    D s webapp
    , previews are not available for user-defined functions.
  • Retaining state information across the exec method is unstable. More information is provided below.

    Info

    NOTE: When a recipe containing a user-defined function is applied to text data, any null characters cause records to be truncated by the running environment during

    D s photon
    job execution. In these cases, please execute the job in the Spark running environment.

...

  1. init method: Used for setting private variables in the UDF. This method may be a no-op function if no variables must be set. See the 177671793 Example - Concatenate strings below. 

    Tip

    Tip: In this method, perform your data validation on the input parameters, including count, data type, and other constraints.

    Info

    NOTE: The init method must be specified but can be empty, if there are no input parameters.

  2. exec method:  Contains functionality of the UDF. The output of the exec method must be one of the supported types. It is also must match the generic as described. In the following example, TrifactaUDF<String> implements a String. This method is run on each record.

    Tip

    Tip: In this method, you should check the number of input columns.

    Warning

    Keep state that varies across calls to the exec method can lead to unexpected behavior. One-time initialization, such as initializing the regex compiler, is safe, but do not allow state information to mutate across calls to exec. This is a known issue.

  3. inputSchema method: The inputSchema method describes the schema of the list on which the exec method is acting. The classes in the schema must be supported. Essentially, you should support the I/O types described earlier.
  4. finish method: The finish method is run at the end of UDF. Typically, it is a no-op.

    Info

    NOTE: If you are executing your UDF on the Spark running environment, the finish method cannot be invoked at this point. Instead, it is invoked as part of the shutdown of the Java VM. This later execution may result in the finish method failing to be invoked in situations like a JVM crash.

...

  • The first line indicates that the function is part of the com.trifacta.trifactaudfs package.
  • The defined UDF class implements the TrifactaUDF class, which is the base interface for UDFs. 
    • It is parameterized with the return type of the UDF (a Java String in this case). 
    • The input into the function is a list with input parameters in the order they are passed to the function within the
      D s platform
      . See 177671793 Running Your UDF below. 
  • The UDF checks the input data for null values, and if any nulls are detected, returns a null. 
  • The inputSchema describes the input list passed into the exec method. 
    • An error is thrown if the type of the data that is passed into the UDF does not match the schema.
    • The UDF must handle improper data. See 177671793 Error Handling below. 

Example - Add by constant

...

  • The init method consumes a list of objects, each of which can be used to set a variable in the UDF. The input into the init function is a list with parameters in the order they are passed to the function within the
    D s platform
    . See 177671793 Running Your UDF below.
Code Block
languagejava
titleExample UDF: AdderUDF
package com.trifacta.trifactaudfs;
import java.io.IOException;
import java.util.List;

/**
 * Example UDF. Adds a constant amount to an Integer column.
 */
public class AdderUDF implements TrifactaUDF<Long> {
  private Long _addAmount;
  @Override
  public void init(List<Object> initArgs) {
    if (initArgs.size() != 1) {
      System.out.println("AdderUDF takes in exactly one init argument");
    }
    Long addAmount = (Long) initArgs.get(0);
    _addAmount = addAmount;
  }
  @Override
  public Long exec(List<Object> input) {
    if (input == null) {
      return null;
    }
    if (input.size() != 1) {
      return null;
    }
    return (Long) input.get(0) + _addAmount;
  }
  @SuppressWarnings("rawtypes")
  public Class[] inputSchema() {
    return new Class[]{Long.class};
  }
  @Override
  public void finish() throws IOException {
  }
}

...

Info

NOTE: Custom UDFs should be compiled to one or more JAR files. Avoid using the example JAR filename, which can be overwritten on upgrade.

 

JDK version mismatches

To avoid an Unsupported major.minor version error during execution, the JDK version used to compile the UDF JAR file should be less than or equal to the JDK version on the Hadoop cluster.

...