In a Hadoop environment, secure impersonation enables the  and its users to act as the signed-in user when performing actions on Hadoop. When enabled, you can leverage the permissions infrastructure in your Hadoop cluster to control privacy level, collaboration, and data sharing for your user base. For the , their jobs and job outputs are owned by the specified , instead of the Hadoop user . The  is required even in secure impersonation mode.

Please complete these steps to to enable secure impersonation.

NOTE: requires Kerberos to be applied to the Hadoop cluster. However, you can use Kerberos without enabling secure impersonation, if desired. See Configure for Kerberos Integration.


Users and groups for secure impersonation

On the Hadoop cluster, the  requires a common Unix or LDAP group containing the and all .

NOTE: In UNIX environments, usernames and group names are case-sensitive. Please be sure to use the case-sensitive names for users and groups in your Hadoop configuration and .


NOTE: If the HDFS user has restrictions on its use, it is not suitable for use with secure impersonation. Instead, you should enable HttpFS and use a separate HttpFS-specific user account instead. For more information, see Configure for Hadoop.

Assuming this group is named :

  • Create a Unix or LDAP group 
  • Make user  a member of 
  • Verify that all user principals that use the platform are also members of the  group.

Hadoop configuration for secure impersonation

In your Kerberos configuration, you must configure the user as a secure impersonation proxy user for .

NOTE: The following addition must be made to your Hadoop cluster configuration file. This file must be copied to the with the required other cluster configuration files. See Configure for Hadoop.

 

In core-site.xml on the Hadoop cluster, add the following configuration, replacing the values for and with the values appropriate for your environment:

  <!-- Trifacta secure impersonation -->
  <property>
    <name>hadoop.proxyuser.[hadoop.user].hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.[hadoop.user].groups</name>
    <value>[hadoop.group]</value>
  </property>

HDFS Directories

Verify that the shared upload and job results directories are owned and writeable to the  group.

For more information on HDFS directories and their permissions, see Prepare Hadoop for Integration with the Platform.

Stricter directory permissions in an impersonated environment

By default, the directories and sub-directories of the locations for uploaded data and job results are set to 730 in an environment with secure impersonation enabled. This configuration allows impersonated users to do the following: 

NOTE: Stricter permissions sets can adversely affect users' ability to access shared flows.

 

  • The 7 user-permission implies that individual users have full permissions over their own directories.

    • Individual users can read only data in their own upload directory below /trifacta/uploads.

  • The 3 user-permission is used because the top-level directory is owned by the user. Each impersonated user in the group requires write and execute permissions on their own directory to create it and manage it. This permission set implies that the group has read and execute permissions over a user's upload directory.

  • Without access-level controls, these permissions are inherited from the parent directory and have the following implications:

    • Since impersonated group users have execute permissions, they can list all directories in this area.

    • Since impersonated group users have write permissions, they can theoretically write to any other user's upload directory, although this directory is not configurable.

Within the upload area, each user of the is assigned an individual directory. For simplicity, the permissions on these directories are automatically applied to the sub-directories. In an impersonated environment, an individual directory is owned by the Hadoop principal for the user, so if two or more users share the same Hadoop principal, they have theoretical access to each others' directories. This simple scheme can be replaced by a more secure method using access-level controls.

NOTE: To enable these stricter permissions, access-level controls must be enabled on your Hadoop cluster. For more information, please see the documentation for your Hadoop distribution.

If access-level controls are enabled for your impersonated environment, you can apply stricter permissions on these sub-directories for additional security.

Steps:

The following steps apply 730 permissions to the top-level directory and 700 to all user sub-directories. With these stricter permissions on sub-directories, no one other than the user, including the user, can access the user's sub-directory.

  1. The /trifacta/uploads value is the default value for upload location in HDFS.

    1. In an individual deployment, the directory setting is defined in platform configuration . Locate the value for hdfs.pathsConfig.fileUpload .

      NOTE: The following steps can also be applied to the directory where job results are written. By default, this directory is /trifacta/queryResults. For more secure controls over job results, you should also retrieve the value for hdfs.pathsConfig.batchResults.

       

    2. Replace the values in the following steps with the value from your configuration.

  2. Before you begin, you should consider resetting all access-level controls on the upload directories and sub-directories:

    hdfs dfs -setfacl -b /trifacta/uploads


  3. The following command removes the application of the permissions from the uploads directory and any sub-directory to members of the default group. So, an individual group member's permissions are not automatically shared with the group:

    hdfs dfs -setfacl -R -m default:group::--- /trifacta/uploads


  4. The following command is required to enable all users to access dictionaries:

    hdfs dfs -setfacl -R -m default:group::rwx /trifacta/uploads/0


  5. The following step sets the permissions at the top level to 730 :

    hdfs dfs -chmod 730 /trifacta/uploads


  6. Sub-directory permissions are a combination of these permissions and any relevant access-level controls.

  7. Apply to queryResults directory: Repeat the above steps for the /trifacta/queryResults directory as needed.

  8. ACL for Hive: If you need to apply access controls to Hive, you can use the following:

    hdfs dfs -setfacl -R -m default:user:hive:rwx /trifacta/queryResults


User directories for YARN

For YARN deployments, each Hadoop user must have a home directory for which the user has write permissions. This directory must be located in the following location within HDFS:

/user/<username>

where:

  • <username> is the Hadoop principal to use.

NOTE: For jobs executed on the default running environment, user output directories must be created with the same permissions as you want for the transform and sampling jobs executed on the server. Users may be able to see the output directories of other users, but output job files are created with the user umask setting (hdfs.permissions.userUMask), as defined in platform configuration.

Example for Hadoop principal myUser:

hdfs dfs -mkdir /user/myUser
hdfs dfs -chown myUser /user/myUser

Optional:

hdfs dfs -chmod -R 700 /user/myUser

for secure impersonation

Set the following parmeter to true.

"hadoopImpersonation" : true,

If you have enabled the Spark running environment for job execution, you must enable the following parameter as well:

"spark-job-service.sparkImpersonationOn" : true,

For more information, see Configure Spark Running Environment.

Umask permissions

Under secure impersonation, the  utilizes two separate umask permission sets. If secure impersonation is not enabled, the  utilizes the systemUmask for all operations.

NOTE: Umask settings are three-digit codes for defining the bit switches for read, write, and execute permissions for users, groups, and others (in that order) for a file or directory. These settings are inverse settings. For example, the umask value of 077 enables read, write, and execute permissions for users and disables all permissions for groups and others. For more information, see https://en.wikipedia.org/wiki/Umask.


NamePropertyDescription
userUmaskhdfs.permissions.userUMaskControls the output permissions of files and directories that are created by impersonated users. These permissions define private permission settings for individual users.
systemUmaskhdfs.permissions.systemUmask

Controls the output permissions of files and directories that are created by the . These permissions also control resources for the admin user and resources that are shared between .

Notes:

  • In a secure impersonation environment, systemUmask should be defined as 027 (the default value), which enables read access to shared resources for all users in the .
    • For greater security, it is possible to set the userUmask to  077, which locks down individual user directories under /trifacta/queryResults. However, secure impersonation requires more permissions on the systemUMask to enable sharing of resources.
    • Please note that the permission settings for the admin user are controlled by systemUmask.

Provisioning impersonated users

NOTE: A newly created user in the platform cannot log in unless provisioned by a platform administrator, even if self-registration is enabled. The administrator must apply the Hadoop principal to the account.

 

To provision users as admin, log in and visit the Admin Settings page from the drop-down menu on the top right. Locate the Users section and click Edit Users. If you have the Secure Hadoop Impersonation flag on with Kerberos enabled, you should see a Hadoop Principal column. From here, you can assign each user a Hadoop principal. Multiple users can share the same Hadoop principal, but each  must have a Hadoop principal assigned to them. 

Editing users