Herramienta Árbol de decisión

Ejemplo de cada herramienta

La herramienta Ejecutar comando tiene un ejemplo de cada herramienta. Visita Flujos de trabajo de muestra para aprender cómo acceder a este y muchos otros ejemplos directamente en Alteryx Designer.

La herramienta Árbol de decisión crea un conjunto de reglas de división de “si- entonces” para optimizar los criterios de creación de modelos según los métodos de aprendizaje del árbol de decisión. La formación de reglas del árbol de decisión se basa en el tipo de campo del objetivo.

Si el campo objetivo es miembro de un conjunto categórico, se crea un árbol de clasificación.
Si el campo objetivo es una variable continua, se crea un árbol de regresión.

Use the Decision Tree tool when the target field is predicted using one or more variable fields, like a classification or continuous target regression problem.

Esta herramienta utiliza la herramienta R. Dirígete a Opciones> Descargar herramientas predictivase inicia sesión en el portal de Descargas y licencias de Alteryxpara instalar R y los paquetes utilizados por la herramienta R. Visita Descargar y usar herramientas predictivas.

Conectar una entrada

The Decision Tee tool requires an input with...

Un campo objetivo de interés
2 o más campos predictores

Los paquetes utilizados en la estimación del modelo varían según el flujo de datos entrante.

Un flujo de datos Alteryx utiliza la función gbm R de código abierto.
Un flujo de metadatos XDF, procedente de una herramienta de entrada XDF o de una herramienta de salida XDF, utiliza la función RevoScaleR rxBTrees.
Un flujo de datos de SQL Server en base de datos utiliza la función rxBTrees.
La instalación de Microsoft Machine Learning Server aprovecha la función RevoScaleR rxBTrees para los datos en las bases de datos de SQL Server o Teradata. Esto requiere que el equipo y el servidor locales se configuren con Microsoft Machine Learning Server, que permite el procesamiento en el servidor de base de datos y da como resultado una mejora significativa del rendimiento.

RevoScaleR Capabilities

En comparación con las funciones de código abierto R, la función basada en RevoScaleR puede analizar conjuntos de datos mucho más grandes. Sin embargo, la función basada en RevoScaleR debe crear un archivo XDF, que aumenta el costo general, utiliza un algoritmo que hace más pases por los datos, aumenta el tiempo de ejecución y no puede crear algunas salidas de diagnóstico del modelo.

Configurar la herramienta para el procesamiento estándar

These options are required to generate a decision.

Nombre del modelo: nombre para el modelo al que se puede hacer referencia con otras herramientas. El nombre o el prefijo del modelo debe comenzar con una letra y puede contener letras, números y caracteres especiales como el punto (".") y el guión bajo ("_"). R is case-sensitive.
Seleccionar el campo de destino: el campo de datos que se deseas predecir, también conocido como una respuesta o variable dependiente.
Selecciona las variables predictoras: los campos de datos utilizados para influir en el valor de la variable objetivo, también conocido como característica o variable independiente. Se requieren dos campos predictivos como mínimo, pero no hay límite superior en el número de campos predictivos seleccionados. La variable objetivo en sí no debe utilizarse en el cálculo del valor objetivo, por lo que el campo objetivo no debe incluirse en los campos predictivos. Las columnas que contienen identificadores únicos, como claves primarias sustitutas y claves primarias naturales, no deben utilizarse en análisis estadísticos. No tienen ningún valor predictivo y pueden causar excepciones en tiempo de ejecución.

Select Customize to adjust additional settings.

Customize the Model

Model Tab

The options that change how the model evaluates data and is built.

Choose algorithm: Select the rpart function or the C5.0 function. Subsequent options different depending on which algorithm you choose.

rpart: An algorithm based on the work of Breiman, Friedman, Olshen, and Stone; considered the standard. Use rpart if you are creating a regression model or if you need a pruning plot.
- Model Type and Sampling Weights: Controls for the type of model based on the target variable and the handling of sampling weights.
  - Model Type: The type of model used to predict the target variable.
    Auto: The model type is automatically selected based on the target variable type.
    Classification: The model predicts a discrete text value of a category or group.
    Regression: The model predicts continuous numeric values.
  - Utilice ponderacionesde muestreo en la estimación del modelo: Una opción que permite seleccionar un campo que pondera la importancia que se coloca en cada registro al crear una estimación de modelo.
    Si se utiliza un campo tanto como predictor y como ponderación de muestra, el campo de salida con la variable de ponderación se antepone con Right_.
- Splitting Criteria and Surrogates: Controls for how the model determines a split and how surrogates are used in assessing data patterns. The splitting criteria to use: Select the way the model evaluates when a tree should be split.
  - The splitting criteria when using a Regression model is always Least Squares.
    Coeficiente de Gini
    The Gini impurity is used.
    Índice de información
  - Use surrogates to: Select the method for using surrogates in the splitting process. Surrogates are variables related to the primary variable that are used to determine the split outcome for a record with missing information.
    Omit observations with missing value for primary split rule: The record missing the candidate variable is not considered in determining the split.
    Split records missing the candidate variable: All records missing the candidate variable are distributed evenly on the split.
    Send observation in majority direction if all surrogates are missing: All records missing the candidate variable are pushed to the side of the split that contains more records.
  - Select best surrogate split using: Select the criteria for choosing the best variable to split on from a set of possible variables.
    Number of correct classifications for a candidate variable: Chooses the variable to split on based the total number of records that are correctly classified.
    Percentage of correct classifications for a candidate variable Chooses the variable to split on based on the percentage of records that are correctly classified.
- HyperParameters: Controls for the model's prior distribution. Adjust processing based on the prior distribution.
  - The minimum number of records needed to allow for a split: Set the number of records that must exist before a split occurs. If there are fewer records than the minimum number, then no further splits are allowed.
  - The allowed minimum number of records in a terminal node: Set the number of records that can be in a terminal node. A lower number increases the potential number of final terminal nodes at the end of the tree.
  - The number of folds to use in the cross-validation to prune the tree: Set the number of groups (N) the data should be divided into when testing the model. The number defaults to 10, but other common values are 5 and 20. A higher number of folds gives more accuracy to the tree but may take longer to process. When the tree is pruned by using a complexity parameter, cross-validation determines how many splits, or branches, are in the tree. In cross validation, N - 1 of the folds are used to create a model, and the other fold is used as a sample to determine the number of branches that best fits the holdout fold in order to avoid overfitting.
  - The maximum allowed depth of any node in the final tree: Set the number of levels of branches allowed from the root node to the most distant node from the root to limit the overall size of the tree.
  - The maximum number of bins to use for each numeric variable: Enter the number of bins to use for each variable. By default, the value is calculated based on the minimum number of records needed to allow for a split.
    XDF Metadata Stream Only
    This option only applies when the input into the tool is an XDF metadata stream. The Revo ScaleR function (rxDTree) that implements the scalable decision tree handles numeric variables via an equal interval binning process to reduce the computation complexity.
  - Set complexity parameter: A value that controls the size of the decision tree. A smaller value results in more branches in the tree, and a larger value results in fewer branches. If a complexity parameter is not selected, the parameter is determined based on cross-validation.
C5.0: An algorithm based on the work of Quinlan; use C5.0 if your data is sorted into one of a small number of mutually exclusive classes. Properties that may be relevant to the class assignment are provided, although some data may have unknown or non-applicable values.
- Structural Options: Controls for the model's structure. By default, the model is structured as a decision tree.
  - Decomposetree into rule-based model: Change the structure of the output algorithm from a decision tree into a collection of unordered, simple if-then rules. Select Threshold number of bands to group rules into to Select a number of bands to group rules into where the number set is the band threshold.
- Detailed Options: Controls for the model's splits and features.
  - Model should evaluate groups of discrete predictors for splits: Group categorical predictor variables together. Select to reduce overfitting when there are important discrete attributes that have more than four or five values.
  - Use predictor winnowing (i.e. feature selection): Select to simplify the model by attempting to exclude non-useful predictors.
  - Prune tree: Select to simplify the tree to reduce overfitting by removing tree splits.
  - Evaluate advanced splits in the data: Select to perform evaluations with secondary variables to confirm what branch is the most accurate prediction.
  - Use stopping method for boosting: Select to evaluate if boosting iterations are becoming ineffective and, if so, stop boosting.
- Numerical Hyperparameters: Controls for the model's prior distribution that are based on a numeric value.
  - Select number of boosting iterations: Select a 1 to use a single model.
  - Select confidence factor: This is the analog of rpart’s complexity parameter.
  - Select number of samples that must be in at least 2 splits: A larger number gives a smaller, more simplified, tree.
  - Percent of data held from training for model evaluation: Select the portion of the data used to train the model. Use the default value 0 to use all of the data to train the model. Select a larger value to hold that percent of data from training and evaluation of model accuracy
  - Select random seed for algorithm: Select the value of the seed. La marca de hora debe ser un entero positivo.

Cross-validation Tab

Validación cruzada: método de validación con uso eficiente de la información disponible.

Select Use cross-validation to determine estimates of model quality to perform cross-validation to obtain various model quality metrics and graphs. Some metrics and graphs are displayed in the R output, and others are displayed in the I output.

Cantidad de plegamientos de validación cruzada: el número de submuestras en los que se dividen los datos para validación o entrenamiento. Ten en cuenta que una mayor cantidad de plegamientos genera estimaciones más robustas de la calidad del modelo, pero menos plegamientos permiten que la herramienta funcione más rápido.
Number of cross-validation trials: The number of times the cross-validation procedure is repeated. The folds are selected differently in each trial, and the results are averaged across all the trials. Ten en cuenta que una mayor cantidad de plegamientos genera estimaciones más robustas de la calidad del modelo, pero menos plegamientos permiten que la herramienta funcione más rápido.
Valor de semilla aleatorio:Valor que determina la secuencia de selección para el muestreo aleatorio. Esto causa que se selecciónen de los mismos registros dentro de los datos, aunque el método de selección es aleatorio y no depende de los datos. Use Select value of random seed for cross-validation toselect the value of the seed. La marca de hora debe ser un entero positivo.

Plots Tab

Select and configure what graphs appear in the output report.

Display static report: Select to display a summary report of the model from the R output anchor. Seleccionado de forma predeterminada
Tree Plot: A graph of decision tree variables and branches. Use the Display tree plot toggle to include a graph of decision tree variables and branches in the model report output.
- Uniform branch distances: Select to display the tree branches with uniform length or proportional to the relative importance of a split in predicting the target.
- Leaf summary: Determine what is displayed on the final leaf nodes in the tree plot. Select Counts if the number of records is displayed. Select Proportions if the percentage of total records is displayed.
- Plot size: Select if the graph is displayed in Inches or Centimeters.
- Width: Set the width of the graph using the unit selected in Plot size.
- Height: Set the height of the graph using the unit selected in Plot size.
- Resolución del gráfico: selecciona la resolución del gráfico en puntos por pulgada: 1x (96 ppp); 2x (192 ppp); o 3x (288 ppp).
  - La resolución más baja crea un archivo más pequeño y es mejor para ver en un monitor.
  - Una resolución más alta crea un archivo más grande con una mejor calidad para imprimir.
Tamaño de fuente base (puntos): selecciona el tamaño de la fuente del gráfico.
Prune Plot: A simplified graph of the decision tree.
Use a prune plot in the report
- Display prune plot: Click to include a simplified graph of the decision tree in the model report output.
- Plot size: Select if the graph is displayed in Inches or Centimeters.
- Width: Set the width of the graph using the unit selected in Plot size.
- Height: Set the height of the graph using the unit selected in Plot size.
- Resolución del gráfico: selecciona la resolución del gráfico en puntos por pulgada: 1x (96 dpi); 2x (192 dpi); o 3x (288 dpi). La resolución más baja crea un archivo más pequeño y es mejor para ver en un monitor. Una resolución más alta crea un archivo más grande con una mejor calidad para imprimir.
- Tamaño de fuente base (puntos): selecciona el tamaño de la fuente del gráfico.

Configurar la herramienta para el procesamiento en base de datos

La herramienta Modelo de bosque admite el procesamiento en base de datos de Microsoft SQL Server 2016. Visita Información general sobre el procesamiento en base de datos para obtener más información sobre la compatibilidad y las herramientas de la categoría En base de datos.

Cuando se coloca una herramienta Modelo de bosque en el lienzo con otra herramienta de la categoría En base de datos, la herramienta cambia automáticamente a la versión En base de datos. Para cambiar la versión de la herramienta, haz clic con el botón derecho del mouse en la herramienta, selecciona "Elegir la versión de la herramienta" y haz clic en una versión. Visita Analítica predictiva para obtener más información sobre la compatibilidad con analítica predictiva en base de datos.

Pestaña "Parámetros obligatorios"

Nombre del modelo: cada modelo debe tener un nombre para su posterior identificación.
- A specific model name: Enter The model name you wish to use for the model. Los nombres del modelo deben comenzar con una letra y pueden contener letras, números y los caracteres especiales de punto (“.”) y guion bajo (“_”). No se permite el uso de otros caracteres especiales. Además, R distingue entre mayúsculas y minúsculas.
- Automatically generate a model name: Designer automatically generates a model name that meets the required parameters.
Selecciona la variable objetivo: selecciona el campo del flujo de datos que deseas predecir.
Selecciona los campos predictores: selecciona los campos del flujo de datos que crees que “causan” los cambios en el valor de la variable objetivo. Las columnas que contienen identificadores únicos, como claves primarias sustitutas y claves primarias naturales, no deben utilizarse en análisis estadísticos. No tienen ningún valor predictivo y pueden causar excepciones en tiempo de ejecución.
Use sampling weights in model estimation (Optional): Select to choose a field from the input data stream to use for sampling weight.
(Opcional): selecciona la casilla de verificación y, luego, selecciona un campo de ponderación del flujo de datos para estimar un modelo que utilice la ponderación de muestreo. A field is used as both a predictor and the weight variable. The weight variable appears in the model call in the output with the string "Right_" prepended to it.

Pestaña "Personalización del modelo"

Model type: Select what type of model is going to be used.
- Classification: A model to predict a categorical target. If using a classification model, also select the splitting criteria.
  - Coeficiente de Gini
  - Entropy-based Information index
- Regression: A model to predict a continuous numeric target.
The minimum number of records needed to allow for a split: If along a set of branches of a tree there are fewer records than the selected minimum number than no further splits are allowed.
Complexity parameter: This parameter controls how splits are carried out (in other words, the number of branches in the tree). El valor debe ser menor que 1, y cuanto más pequeño sea el valor, más ramas habrá en el árbol final. La presencia de un valor de "Automático" u omitir un valor hará que se seleccione el "mejor" parámetro de complejidad en función de la validación cruzada.
The allowed minimum number of records in a terminal node: The smallest number of records that must be contained in a terminal node. Decreasing this number increases the potential number of final terminal nodes.
Surrogate use: This group of option controls how records with missing data in the predictor variables at a particular split are addressed. The first choice is to omit (remove) a record with a missing value of the variable used in the split. The second is to use "surrogate" splits, in which the direction a record will be sent is based on alternative splits on one or more other variables with nearly the same results. The third choice is to send the observation in the majority direction at the split.
- Omitir una observación cuando falte un valor para la regla de división primaria
- utiliza elementos suplentes para dividir los registros a los que les falte la variable candidata
- Si faltan todos los elementos subrogados, enviar la observación en la dirección mayoritaria
- La cantidad total de clasificaciones correctas correspondientes a una variable candidata potencial
- El porcentaje correcto calculado en relación a los valores no ausentes de una variable candidato
The number of folds to use in the cross validation to prune the tree: When the tree is pruned through the use of a complexity parameter, cross validation is used to determine how many splits, thus branches, are in the tree. This is done via the use of cross validation whereby N - 1 of the folds are used to create a model, and the Nth fold is used as a sample to determine the number of branches that best fits best the holdout fold in order to avoid overfitting. One thing that can be altered by the user is the number of groups (N) into which the data should be divided. The default is 10, but other common values are 5 and 20.
The maximum allowed depth of any node in the final tree: This option limits the overall size of the tree by indicating how many levels are allowed from the root node to the most distant node from the root.
The maximum number of bins to use for each numeric variable: The Revo ScaleR function (rxDTree) that implements the scalable decision tree handles numeric variables via an equal interval binning process to reduce the computation complexity. The choices for these are "Default", which uses a formula based on the minimum number of records needed to allow for a split, but can be manually set by the user. This option only applies in cases where the input into the tool is an XDF metadata stream.

Pestaña "Opciones de gráficos"

Tree plot: This set of options controls a number of options associated with plotting a decision tree.
- Leaf summary: The first choice under this option is the nature of the leaf summary. This option controls whether counts or proportions are printed in the final leaf nodes in the tree plot.
  - Conteos
  - Proporciones
- Uniform branch distances: The second option is whether uniform branch distances should be used. This option controls whether the length of the drawn tree branches reflect the relative importance of a split in predicting the target or are of uniform length in the tree plot.
Plot size: Set the dimensions of the output tree plot.
- Inches: Set the Width and Height of the plot.
- Centimeters: Set the Width and Height of the plot.
- Resolución del gráfico: selecciona la resolución del gráfico en puntos por pulgada: 1x (96 ppp); 2x (192 ppp); o 3x (288 ppp).
  - La resolución más baja crea un archivo más pequeño y es mejor para ver en un monitor.
  - Una resolución más alta crea un archivo más grande con una mejor calidad para imprimir.
- Tamaño de fuente base (puntos): tamaño de fuente en puntos.
Pruning Plot: Select to include a simplified graph of the decision tree in the model report output.
- Plot size: Select if the graph is displayed in Inches or Centimeters.
  - Width: Set the width of the graph using the unit selected in Plot size.
  - Height: Set the height of the graph using the unit selected in Plot size.
- Resolución del gráfico: selecciona la resolución del gráfico en puntos por pulgada: 1x (96 ppp); 2x (192 ppp); o 3x (288 ppp).
  - La resolución más baja crea un archivo más pequeño y es mejor para ver en un monitor.
  - Una resolución más alta crea un archivo más grande con una mejor calidad para imprimir.
- Tamaño de fuente base (puntos): selecciona el tamaño de la fuente del gráfico.

Ver la salida

Conecta una herramienta Examinar a cada ancla de salida para ver los resultados.

Ancla O: muestra el nombre del modelo y el tamaño del objeto en la ventana de Resultados.
Ancla R: muestra un informe resumido del modelo, el cual incluye un resumen y los gráficos.
I (Interactive): Displays an interactive dashboard of supporting visuals that allows you to zoom, hover, and click.

Expected Behavior: Plot Precision

When using the Decision Tree tool for standard processing, the Interactive output shows greater precision with numeric values than the Report output.

En esta sección: