决策树工具
单个工具示例
“运行命令”提供一个“单个工具示例”。访问示例工作流以了解如何在 Alteryx Designer 中访问此示例和其他更多示例。
“决策树”工具根据决策树学习方法创建一组if-then分裂规则来优化模型创建条件。“决策树”规则的形成取决于目标字段类型。
如果目标字段属于分类类别,“决策树”则构建分类树。
如果目标字段是连续变量,则构建回归树。
Use the Decision Tree tool when the target field is predicted using one or more variable fields, like a classification or continuous target regression problem.
此工具使用 R 工具。转至选项下载预测工具,并登录到 Alteryx 下载和许可证门户以安装 R 工具和 R 工具使用的软件包。请访问下载和使用预测工具。
连接输入
The Decision Tee tool requires an input with...
目标字段
2 个或多个预测字段
模型估计中使用的包因输入数据流而异。
Alteryx 数据流使用开放源 R gbm 函数。
来自 XDF 输入工具或 XDF 输出工具的 XDF 元数据流使用 RevoScaleR rxBTrees 函数。
来自 SQL Server 数据库的数据流数据使用 rxBTrees 函数。
微软机器学习服务器安装利用 RevoScaleR rxBTrees 函数来处理 SQL Server 或 Teradata 数据库中的数据。这要求本地计算机和服务器配置 Microsoft Machine Learning Server,从而允许在数据库服务器上进行处理,并且大幅度提高性能。
RevoScaleR Capabilities
与开放源 R 函数相比,基于 RevoScaleR 的函数可以分析更大的数据集。但是,基于 RevoScaleR 的函数必须创建一个 XDF 文件,这会增加间接成本,使用一种多次传递数据的算法,增加运行时间,并且无法创建一些模型诊断的输出。
常规处理下的工具配置
These options are required to generate a decision.
模型名称:可供其它工具引用的模型名称。模型名称或前缀必须以字母开头,可包含字母、数字和特殊字符句点 (“.”)和下划线 (“_”)。 R is case-sensitive.
选择目标变量:要预测的数据字段,也称为响应或因变量。
选择预测变量:影响目标变量值的数据字段,也称为特征或自变量。预测因子字段最少需要两个,但是选择数量没有上限。目标变量本身不应用于计算目标值,因此目标字段不应包含在预测因子字段中。包含唯一标识符的列(如代理主键和自然主键)不应用于统计分析。它们没有预测价值,并可能导致运行时异常。
Select Customize to adjust additional settings.
Customize the Model
Model Tab
The options that change how the model evaluates data and is built.
Choose algorithm: Select the rpart function or the C5.0 function. Subsequent options different depending on which algorithm you choose.
rpart: An algorithm based on the work of Breiman, Friedman, Olshen, and Stone; considered the standard. Use rpart if you are creating a regression model or if you need a pruning plot.
Model Type and Sampling Weights: Controls for the type of model based on the target variable and the handling of sampling weights.
Model Type: The type of model used to predict the target variable.
Auto: The model type is automatically selected based on the target variable type.
Classification: The model predicts a discrete text value of a category or group.
Regression: The model predicts continuous numeric values.
在模型估计中使用样本重:这个选项允许您选择一个字段,该字段在创建模型估计时对每条记录进行加权。
如果字段同时用作预测因子和样本权重,则输出权重变量字段名称前面将加上 Right_。
Splitting Criteria and Surrogates: Controls for how the model determines a split and how surrogates are used in assessing data patterns. The splitting criteria to use: Select the way the model evaluates when a tree should be split.
The splitting criteria when using a Regression model is always Least Squares.
基尼系数
The Gini impurity is used.
信息索引
Use surrogates to: Select the method for using surrogates in the splitting process. Surrogates are variables related to the primary variable that are used to determine the split outcome for a record with missing information.
Omit observations with missing value for primary split rule: The record missing the candidate variable is not considered in determining the split.
Split records missing the candidate variable: All records missing the candidate variable are distributed evenly on the split.
Send observation in majority direction if all surrogates are missing: All records missing the candidate variable are pushed to the side of the split that contains more records.
Select best surrogate split using: Select the criteria for choosing the best variable to split on from a set of possible variables.
Number of correct classifications for a candidate variable: Chooses the variable to split on based the total number of records that are correctly classified.
Percentage of correct classifications for a candidate variable Chooses the variable to split on based on the percentage of records that are correctly classified.
HyperParameters: Controls for the model's prior distribution. Adjust processing based on the prior distribution.
The minimum number of records needed to allow for a split: Set the number of records that must exist before a split occurs. If there are fewer records than the minimum number, then no further splits are allowed.
The allowed minimum number of records in a terminal node: Set the number of records that can be in a terminal node. A lower number increases the potential number of final terminal nodes at the end of the tree.
The number of folds to use in the cross-validation to prune the tree: Set the number of groups (N) the data should be divided into when testing the model. The number defaults to 10, but other common values are 5 and 20. A higher number of folds gives more accuracy to the tree but may take longer to process. When the tree is pruned by using a complexity parameter, cross-validation determines how many splits, or branches, are in the tree. In cross validation, N - 1 of the folds are used to create a model, and the other fold is used as a sample to determine the number of branches that best fits the holdout fold in order to avoid overfitting.
The maximum allowed depth of any node in the final tree: Set the number of levels of branches allowed from the root node to the most distant node from the root to limit the overall size of the tree.
The maximum number of bins to use for each numeric variable: Enter the number of bins to use for each variable. By default, the value is calculated based on the minimum number of records needed to allow for a split.
XDF Metadata Stream Only
This option only applies when the input into the tool is an XDF metadata stream. The Revo ScaleR function (rxDTree) that implements the scalable decision tree handles numeric variables via an equal interval binning process to reduce the computation complexity.
Set complexity parameter: A value that controls the size of the decision tree. A smaller value results in more branches in the tree, and a larger value results in fewer branches. If a complexity parameter is not selected, the parameter is determined based on cross-validation.
C5.0: An algorithm based on the work of Quinlan; use C5.0 if your data is sorted into one of a small number of mutually exclusive classes. Properties that may be relevant to the class assignment are provided, although some data may have unknown or non-applicable values.
Structural Options: Controls for the model's structure. By default, the model is structured as a decision tree.
Decomposetree into rule-based model: Change the structure of the output algorithm from a decision tree into a collection of unordered, simple if-then rules. Select Threshold number of bands to group rules into to Select a number of bands to group rules into where the number set is the band threshold.
Detailed Options: Controls for the model's splits and features.
Model should evaluate groups of discrete predictors for splits: Group categorical predictor variables together. Select to reduce overfitting when there are important discrete attributes that have more than four or five values.
Use predictor winnowing (i.e. feature selection): Select to simplify the model by attempting to exclude non-useful predictors.
Prune tree: Select to simplify the tree to reduce overfitting by removing tree splits.
Evaluate advanced splits in the data: Select to perform evaluations with secondary variables to confirm what branch is the most accurate prediction.
Use stopping method for boosting: Select to evaluate if boosting iterations are becoming ineffective and, if so, stop boosting.
Numerical Hyperparameters: Controls for the model's prior distribution that are based on a numeric value.
Select number of boosting iterations: Select a 1 to use a single model.
Select confidence factor: This is the analog of rpart’s complexity parameter.
Select number of samples that must be in at least 2 splits: A larger number gives a smaller, more simplified, tree.
Percent of data held from training for model evaluation: Select the portion of the data used to train the model. Use the default value 0 to use all of the data to train the model. Select a larger value to hold that percent of data from training and evaluation of model accuracy
Select random seed for algorithm: Select the value of the seed. 时间戳必须是正整数。
Cross-validation Tab
交叉验证:有效利用可用信息的验证方法。
Select Use cross-validation to determine estimates of model quality to perform cross-validation to obtain various model quality metrics and graphs. Some metrics and graphs are displayed in the R output, and others are displayed in the I output.
交叉验证重数:数据被划分为用于验证或训练的子样本数。重数越大,模型估计质量越高,重数越少,工具运行速度越快。
Number of cross-validation trials: The number of times the cross-validation procedure is repeated. The folds are selected differently in each trial, and the results are averaged across all the trials. 重数越大,模型估计质量越高,重数越少,工具运行速度越快。
随机种子值:确定随机抽样顺序的值。尽管选择方法是随机的且与数据无关,但它导致数据中相同的记录被选择。 Use Select value of random seed for cross-validation toselect the value of the seed. 时间戳必须是正整数。
Plots Tab
Select and configure what graphs appear in the output report.
Display static report: Select to display a summary report of the model from the R output anchor. 默认选中
Tree Plot: A graph of decision tree variables and branches. Use the Display tree plot toggle to include a graph of decision tree variables and branches in the model report output.
Uniform branch distances: Select to display the tree branches with uniform length or proportional to the relative importance of a split in predicting the target.
Leaf summary: Determine what is displayed on the final leaf nodes in the tree plot. Select Counts if the number of records is displayed. Select Proportions if the percentage of total records is displayed.
Plot size: Select if the graph is displayed in Inches or Centimeters.
Width: Set the width of the graph using the unit selected in Plot size.
Height: Set the height of the graph using the unit selected in Plot size.
图表分辨率:选择图表的分辨率(以每英寸点数为单位):1 x(96 dpi);2 x(192 dpi);或 3 x(288 dpi)。
较低的分辨创建相对较小的文件,最适合在显示器上查看。
更高的分辨率可以创建一个更大的文件,具有更好的打印质量。
基本字体大小(点):选择图表中字体的大小。
Prune Plot: A simplified graph of the decision tree.
Use a prune plot in the report
Display prune plot: Click to include a simplified graph of the decision tree in the model report output.
Plot size: Select if the graph is displayed in Inches or Centimeters.
Width: Set the width of the graph using the unit selected in Plot size.
Height: Set the height of the graph using the unit selected in Plot size.
图表分辨率:选择图表的分辨率(以每英寸点数为单位):1 x(96 dpi);2 x(192 dpi);或 3 x(288 dpi)。较低的分辨创建相对较小的文件,最适合在显示器上查看。更高的分辨率可以创建一个更大的文件,具有更好的打印质量。
基本字体大小(点):选择图表中字体的大小。
用于数据库内处理的工具配置
“森林模型”工具支持 Microsoft SQL Server 2016 数据库内处理。有关数据库内支持和工具的详细信息,请访问数据库内概述。
当将森林模型工具与另一个数据库内工具一起放置在画布上时,该工具会自动更改为数据库内版本。要更改该工具的版本,请右键单击该工具,指向“选择工具版本”,然后单击该工具的其他版本。有关预测型数据库内工具支持的详细信息,请访问预测分析。
所需参数选项卡
模型名称:需要为每个模型命名,以便以后可以对其进行识别。
A specific model name: Enter The model name you wish to use for the model. 模型名称必须以字母开头,可包含字母、数字和特殊字符句点 (“.”)和下划线 (“_”)。不允许使用其它特殊字符,R 区分大小写。
Automatically generate a model name: Designer automatically generates a model name that meets the required parameters.
选择目标变量:从要预测的数据流中选择字段。
选择预测变量:从您认为“导致”目标变量值更改的数据流中选择字段。包含唯一标识符的列(如代理主键和自然主键)不应用于统计分析。它们没有预测价值,并可能导致运行时异常。
Use sampling weights in model estimation (Optional): Select to choose a field from the input data stream to use for sampling weight.
(可选):选中该复选框,然后从数据流中选择权重字段以估计使用采样权重的模型。 A field is used as both a predictor and the weight variable. The weight variable appears in the model call in the output with the string "Right_" prepended to it.
自定义模型选项卡
Model type: Select what type of model is going to be used.
Classification: A model to predict a categorical target. If using a classification model, also select the splitting criteria.
基尼系数
Entropy-based Information index
Regression: A model to predict a continuous numeric target.
The minimum number of records needed to allow for a split: If along a set of branches of a tree there are fewer records than the selected minimum number than no further splits are allowed.
Complexity parameter: This parameter controls how splits are carried out (in other words, the number of branches in the tree). 值应该小于1,值越小,则最终树中的分支越多。“自动”值或删除值将导致基于交叉验证选择“最高”复杂性参数。
The allowed minimum number of records in a terminal node: The smallest number of records that must be contained in a terminal node. Decreasing this number increases the potential number of final terminal nodes.
Surrogate use: This group of option controls how records with missing data in the predictor variables at a particular split are addressed. The first choice is to omit (remove) a record with a missing value of the variable used in the split. The second is to use "surrogate" splits, in which the direction a record will be sent is based on alternative splits on one or more other variables with nearly the same results. The third choice is to send the observation in the majority direction at the split.
删除主要拆分规则中值缺失的观察值
使用替代法以拆分缺少备选变量的记录
如果所有代理项均缺失,则向大多数方向发送观察值
潜在候选变量的正确分类的总数
针对备选变量的非缺失值上计算的正确百分比
The number of folds to use in the cross validation to prune the tree: When the tree is pruned through the use of a complexity parameter, cross validation is used to determine how many splits, thus branches, are in the tree. This is done via the use of cross validation whereby N - 1 of the folds are used to create a model, and the Nth fold is used as a sample to determine the number of branches that best fits best the holdout fold in order to avoid overfitting. One thing that can be altered by the user is the number of groups (N) into which the data should be divided. The default is 10, but other common values are 5 and 20.
The maximum allowed depth of any node in the final tree: This option limits the overall size of the tree by indicating how many levels are allowed from the root node to the most distant node from the root.
The maximum number of bins to use for each numeric variable: The Revo ScaleR function (rxDTree) that implements the scalable decision tree handles numeric variables via an equal interval binning process to reduce the computation complexity. The choices for these are "Default", which uses a formula based on the minimum number of records needed to allow for a split, but can be manually set by the user. This option only applies in cases where the input into the tool is an XDF metadata stream.
图表选项卡
Tree plot: This set of options controls a number of options associated with plotting a decision tree.
Leaf summary: The first choice under this option is the nature of the leaf summary. This option controls whether counts or proportions are printed in the final leaf nodes in the tree plot.
数量
比例
Uniform branch distances: The second option is whether uniform branch distances should be used. This option controls whether the length of the drawn tree branches reflect the relative importance of a split in predicting the target or are of uniform length in the tree plot.
Plot size: Set the dimensions of the output tree plot.
Inches: Set the Width and Height of the plot.
Centimeters: Set the Width and Height of the plot.
图表分辨率:选择图表的分辨率(以每英寸点数为单位):1 x(96 dpi);2 x(192 dpi);或 3 x(288 dpi)。
较低的分辨创建相对较小的文件,最适合在显示器上查看。
更高的分辨率可以创建一个更大的文件,具有更好的打印质量。
基本字体大小(点):字体大小(以点表示)。
Pruning Plot: Select to include a simplified graph of the decision tree in the model report output.
Plot size: Select if the graph is displayed in Inches or Centimeters.
Width: Set the width of the graph using the unit selected in Plot size.
Height: Set the height of the graph using the unit selected in Plot size.
图表分辨率:选择图表的分辨率(以每英寸点数为单位):1 x(96 dpi);2 x(192 dpi);或 3 x(288 dpi)。
较低的分辨创建相对较小的文件,最适合在显示器上查看。
更高的分辨率可以创建一个更大的文件,具有更好的打印质量。
基本字体大小(点):选择图表中字体的大小。
查看输出
将浏览工具连接到每个输出锚点以查看结果。
O 锚点:在“结果”窗口中显示模型名称和对象大小。
R 锚点:显示模型摘要报告,其中包括摘要和图。
I (Interactive): Displays an interactive dashboard of supporting visuals that allows you to zoom, hover, and click.
Expected Behavior: Plot Precision
When using the Decision Tree tool for standard processing, the Interactive output shows greater precision with numeric values than the Report output.