決定木ツール

ツールごとに学習

コマンド実行ツールには「ツールごとに学習」が用意されています。サンプルワークフローを参照して、Alteryx Designer でこのサンプルやその他の多くのサンプルに直接アクセスする方法を確認してください。

[決定木]ツールは、一連のif-thenスプリットルールを作成し、決定木の学習メソッドに基づいてモデル作成基準を最適化します。ルールの形成は、ターゲットフィールドタイプに基づいています。

ターゲットフィールドがカテゴリセットのメンバーである場合、分類ツリーが構築されます。
ターゲットフィールドが連続変数の場合、回帰ツリーが構築されます。

分類フィールドまたは連続ターゲット回帰問題など、1つ以上の可変フィールドを使用してターゲットフィールドが予測されるとき、[決定木]ツールを使用します。

このツールはRツールを使用します。オプション > 予測ツールをダウンロード の順に進み、Alteryx Downloads and Licenses ポータルにサインインして、R と R ツールで使用されるパッケージをインストールします。予測ツールのダウンロードと使用を参照してください。

入力を接続

[決定木]ツールでは、次のものを入力する必要があります。

対象のターゲットフィールド
1つ以上の予測フィールド

モデルの推定で使用されるパッケージは、入力データストリームによって異なります。

Alteryx データストリームは、オープンソースの R gbm 関数を使用します。
XDF メタデータストリームは、XDF 入力ツールまたは XDF 出力ツールのいずれかから来て、RevoScaleR rxBTrees 関数を使用します。
SQL server インデータベースデータストリームからのデータは、rxBTrees 関数を使用します。
Microsoft Machine Learning Serverをインストールすると、SQL ServerデータベースまたはTeradataデータベースのデータに対してRevoScaleR rxBTrees関数を活用できます。これには、ローカルマシンとサーバーをMicrosoft Machine Learning Serverで構成する必要があります。これにより、データベースサーバーでの処理が可能になり、パフォーマンスが大幅に向上します。

RevoScaleR Capabilities

オープンソースの R 関数と比較して、RevoScaleR ベースの関数はより大きなデータセットを解析できます。しかし、RevoScaleR ベースの関数は XDF ファイルを作成する必要があり、オーバーヘッドコストが増加し、データをより多く通過させるアルゴリズムを使用し、実行時間を増加させ、一部のモデル診断出力を作成することはできないという面があります。

標準処理のためのツール設定

これらのオプションは、ブーストモデルを生成するために必要です。

モデル名: 他のツールから参照できるモデルの名前。モデル名または接頭辞は文字で始まる必要があり、文字、数字、および特殊文字ピリオド "." とアンダースコア "_"を含むことができます。 R is case-sensitive.
ターゲット変数の選択: 予測されるデータフィールドであり、応答または従属変数とも呼ばれます。
予測変数の選択: ターゲット変数の値に影響を与えるために使用されるデータフィールドで、機能または独立変数とも呼ばれます。1つの予測フィールドが最低限必要ですが、選択された予測フィールドの数に上限はありません。ターゲット変数自体をターゲット値の計算に使用すべきではないため、ターゲットフィールドを予測フィールドに含めるべきではありません。サロゲート主キーやナチュラル主キーなどの固有識別子を含む列は、統計分析で使用しないでください。これらの列は予測値がなく、実行時の例外処理を引き起こす可能性があります。

Select Customize to adjust additional settings.

モデルをカスタマイズする

Model Tab

The options that change how the model evaluates data and is built.

rpart関数またはC5.0関数を選択します。 Subsequent options different depending on which algorithm you choose.

rpart: An algorithm based on the work of Breiman, Friedman, Olshen, and Stone; considered the standard. Use rpart if you are creating a regression model or if you need a pruning plot.
- Model Type and Sampling Weights: Controls for the type of model based on the target variable and the handling of sampling weights.
  - ターゲット変数を予測するために使用されるモデルのタイプ。
    モデルタイプは、ターゲット変数タイプに基づいて自動的に選択されます。
    モデルは、カテゴリまたはグループの離散テキスト値を予測します。
    モデルは連続した数値を予測します。
  - モデル推定にサンプリングの重み付けを使用する: モデル推定を作成するときに、各レコードの重要度を重み付けするフィールドを選択できるオプション。
    フィールドが予測とサンプル重みのどちらにも使用されている場合、出力重み変数フィールドの先頭に「Right_」が付きます。
- Splitting Criteria and Surrogates: Controls for how the model determines a split and how surrogates are used in assessing data patterns. The splitting criteria to use: Select the way the model evaluates when a tree should be split.
  - The splitting criteria when using a Regression model is always Least Squares.
    ジニ係数
    The Gini impurity is used.
    情報インデックス
  - スプリットプロセスでサロゲートを使用するメソッドを選択します。 Surrogates are variables related to the primary variable that are used to determine the split outcome for a record with missing information.
    Omit observations with missing value for primary split rule: The record missing the candidate variable is not considered in determining the split.
    Split records missing the candidate variable: All records missing the candidate variable are distributed evenly on the split.
    Send observation in majority direction if all surrogates are missing: All records missing the candidate variable are pushed to the side of the split that contains more records.
  - 可能性がある変数のセットからスプリットする最適変数を選ぶ基準を選択します。
    Number of correct classifications for a candidate variable: Chooses the variable to split on based the total number of records that are correctly classified.
    Percentage of correct classifications for a candidate variable Chooses the variable to split on based on the percentage of records that are correctly classified.
- HyperParameters: Controls for the model's prior distribution. Adjust processing based on the prior distribution.
  - The minimum number of records needed to allow for a split: Set the number of records that must exist before a split occurs. 最小数より少ないレコードがある場合、それ以上のスプリットは許可されません。
  - The allowed minimum number of records in a terminal node: Set the number of records that can be in a terminal node. 数字が小さいほど、ツリーの最後にある最終的な端末ノードの潜在的な数が増えます。
  - The number of folds to use in the cross-validation to prune the tree: Set the number of groups (N) the data should be divided into when testing the model. 値のデフォルトは10ですが、他の一般的な値は5と20です。折り畳み回数が増えるほどツリーの精度は向上しますが、実行に時間がかかることがあります。複雑さパラメーターを使用してツリーを整理すると、クロス確認によってツリーにいくつのスプリットまたはブランチがあるかが決まります。クロス確認では、折り畳みのN - 1がモデルの作成に使用され、もう1つの折り畳みは、過剰フィットを避けるためにホールドアウトフォールに最も適合するブランチの数の決定用サンプルとして使用されます。
  - ツリー全体的なサイズを制限するために、ルートノードからルートまでの最も遠いノードまで許容されるブランチのレベル数を選択します。
  - The maximum number of bins to use for each numeric variable: Enter the number of bins to use for each variable. デフォルト値では、スプリットを可能にするために必要なレコードの最小数に基づいた式が使用されます。
    XDF Metadata Stream Only
    このオプションは、ツールへの入力がXDFメタデータストリームの場合にのみ適用されます。拡張性の高い決定木を実装するRevo ScaleR関数(rxDTree)は、等間隔のビニングプロセスを介して数値変数を処理し、計算の複雑さを軽減します。
  - 決定木のサイズを制御する値。値が小さいほどツリーのブランチが多くなり、値が大きいほどブランチが少なくなります。複雑さパラメーターが選択されていない場合は、クロス確認に基づいて自動的に決定されます。
C5.0: An algorithm based on the work of Quinlan; use C5.0 if your data is sorted into one of a small number of mutually exclusive classes. Properties that may be relevant to the class assignment are provided, although some data may have unknown or non-applicable values.
- Structural Options: Controls for the model's structure. By default, the model is structured as a decision tree.
  - Decomposetree into rule-based model: Change the structure of the output algorithm from a decision tree into a collection of unordered, simple if-then rules. Select Threshold number of bands to group rules into to Select a number of bands to group rules into where the number set is the band threshold.
- Detailed Options: Controls for the model's splits and features.
  - Model should evaluate groups of discrete predictors for splits: Group categorical predictor variables together. Select to reduce overfitting when there are important discrete attributes that have more than four or five values.
  - 予測変数を使用する(つまり、フィーチャー選択)：非有用な予測を除外することでモデルを単純化する場合に選択します。
  - Prune tree: Select to simplify the tree to reduce overfitting by removing tree splits.
  - Evaluate advanced splits in the data: Select to perform evaluations with secondary variables to confirm what branch is the most accurate prediction.
  - ブーストの反復が無効になっているかどうかを評価するために選択し、無効になっていればブーストを停止します。
- Numerical Hyperparameters: Controls for the model's prior distribution that are based on a numeric value.
  - Select number of boosting iterations: Select a 1 to use a single model.
  - これはrpartの複雑さパラメーターの類推です。
  - Select number of samples that must be in at least 2 splits: A larger number gives a smaller, more simplified, tree.
  - Percent of data held from training for model evaluation: Select the portion of the data used to train the model. すべてのデータを使用してモデルをトレーニングするには、デフォルト値の0を使用します。トレーニングとモデル精度の評価からそのデータの割合を保持するために大きな値を選択する
  - Select random seed for algorithm: Select the value of the seed. タイムスタンプは正の整数でなければなりません。

Cross-validation Tab

クロス検証: 利用可能な情報を効率的に使用する検証方法。

Select Use cross-validation to determine estimates of model quality to perform cross-validation to obtain various model quality metrics and graphs. 一部のメトリックとグラフは静的なR出力に表示され、他のメトリックとグラフは相互作用I出力に表示されます。

クロス検証フォールドの数: データが検証またはトレーニングのために分割されるサブサンプルの数。折り畳み数が多いほどモデルの品質はより強固に推定されますが、折り畳みが少ない方がツールは高速に実行されます。
Number of cross-validation trials: The number of times the cross-validation procedure is repeated. 各試行において折り畳みが異なるように選択され、全体の結果はすべての試行にわたって平均化されます。折り畳み数が多いほどモデルの品質はより強固に推定されますが、折り畳みが少ない方がツールは高速に実行されます。
ランダムシード値: ランダムサンプリングの Sequence draws を決定する値。これにより、選択方法はランダムでデータに依存しないが、データ内の同じレコードが選択されます。 Use Select value of random seed for cross-validation toselect the value of the seed. タイムスタンプは正の整数でなければなりません。

Plots Tab

Select and configure what graphs appear in the output report.

R出力アンカーからモデルの要約レポートを表示する場合に選択します。デフォルトで選択されています。
Tree Plot: A graph of decision tree variables and branches. Use the Display tree plot toggle to include a graph of decision tree variables and branches in the model report output.
- 均一な長さを持つツリーブランチを表示するか、ターゲットを予測する際のスプリットの相対的な重要度に比例して表示するかを選択します。
- Leaf summary: Determine what is displayed on the final leaf nodes in the tree plot. Select Counts if the number of records is displayed. Select Proportions if the percentage of total records is displayed.
- Plot size: Select if the graph is displayed in Inches or Centimeters.
- Width: Set the width of the graph using the unit selected in Plot size.
- Height: Set the height of the graph using the unit selected in Plot size.
- グラフの解像度: グラフの解像度を 1 インチあたりのドット数で選択します: 1x (96 dpi)、2x (192 dpi)、3x (288 dpi)
  - 解像度を低くするとファイルサイズが小さくなり、モニターでの表示に最適です。
  - 解像度を高くするとファイルサイズが大きくなり、印刷品質が向上します。
ベースフォントサイズ (ポイント): グラフ内のフォントのサイズを選択します。
Prune Plot: A simplified graph of the decision tree.
Use a prune plot in the report
- モデルレポートの出力で、決定木の単純化されたグラフを表示する場合に選択します。
- Plot size: Select if the graph is displayed in Inches or Centimeters.
- Width: Set the width of the graph using the unit selected in Plot size.
- Height: Set the height of the graph using the unit selected in Plot size.
- グラフの解像度: グラフの解像度を 1 インチあたりのドット数: 1x (96 dpi)、2x (192 dpi)、3x (288 dpi) で選択します。解像度を低くするとファイルサイズが小さくなり、モニターでの表示に最適です。解像度を高くするとファイルサイズが大きくなり、印刷品質が向上します。
- ベースフォントサイズ (ポイント): グラフ内のフォントのサイズを選択します。

インデータベース処理のためのツール設定

[決定木]ツールは、Microsoft SQL Server 2016およびTeradataのインデータベース処理をサポートします。インデータベースのサポートとツールの詳細については、インデータベースの概要を参照してください。

[決定木]ツールがキャンバス上に別のIn-DBツールを使用して配置されると、ツールはIn-DBバージョンに自動的に変更されます。ツールのバージョンを変更するには、ツールを右クリックし、[ツールバージョンを選択]をポイントして、別のバージョンのツールをクリックします。予測インデータベースのサポートの詳細については、予測分析を参照してください。

[必須パラメーター] タブ

モデル名: 各モデルには後で識別できるように名前を付ける必要があります。
- A specific model name: Enter The model name you wish to use for the model. モデル名は文字で始まる必要があり、文字、数字、および特殊文字ピリオド（ "."）とアンダースコア（ "_"）を含むことができます。その他の特殊文字は使用できず、またRは大文字と小文字を区別します。
- Automatically generate a model name: Designer automatically generates a model name that meets the required parameters.
ターゲット変数を選択: 予測するデータストリームからフィールドを選択します。
予測変数を選択: ターゲット変数の値が変更される「原因」と考えられるフィールドをデータストリームから選択します。サロゲート主キーやナチュラル主キーなどの固有識別子を含む列は、統計分析で使用しないでください。これらの列は予測値がなく、実行時の例外処理を引き起こす可能性があります。
Use sampling weights in model estimation (Optional): Select to choose a field from the input data stream to use for sampling weight.
チェックボックスをクリックしてから、データストリームから重みフィールドを選択して、サンプリングの重みを使用するモデルを推定します。 A field is used as both a predictor and the weight variable. The weight variable appears in the model call in the output with the string "Right_" prepended to it.

[モデルのカスタマイズ] タブ

Model type: Select what type of model is going to be used.
- Classification: A model to predict a categorical target. If using a classification model, also select the splitting criteria.
  - ジニ係数
  - Entropy-based Information index
- Regression: A model to predict a continuous numeric target.
ツリーのブランチのセットに沿って、選択された最小数より少ないレコードがあれば、それ以上のスプリットは許されません。
このパラメーターは、スプリットの実行方法(ツリーの分岐の数)を制御します。値は1より小さくなければならず、値が小さいほど、最終的なツリーのブランチが多くなります。「auto」の値、または値の省略は、クロス確認に基づいて「最良の」複雑性パラメーターがもたらされることになります。
The allowed minimum number of records in a terminal node: The smallest number of records that must be contained in a terminal node. この数を減らすと、最終的な端末ノードの潜在的な数が増えます。
このグループのオプションは、特定のスプリットの予測変数のデータが欠落しているレコードがどのように処理されるかを制御します。第1の選択肢は、スプリットに使用された変数の欠損値を含むレコードを省略(削除)することです。2番目の方法は、「サロゲート」スプリットを使用することです。この場合、レコードの送信方向は、ほぼ同じ結果を持つ1つ以上の他の変数の代替分割に基づいています。3番目の選択肢は、スプリット時の多数決方向の観測を送信することです。
- 主分割ルールの値が欠落している観測を省略する
- 候補変数がないレコードを分割するためにサロゲートを使用する
- すべてのサロゲートが欠落している場合は、最も多い方向に観測を送信します
- 潜在的候補変数の正しい分類の総数
- 的中率が候補変数の欠落していない値に対して計算されました
The number of folds to use in the cross validation to prune the tree: When the tree is pruned through the use of a complexity parameter, cross validation is used to determine how many splits, thus branches, are in the tree. これは、折り畳みのN-1がモデルを作成するために使用されるクロスバリデーションの使用を通して行われ、N番目の折り畳みは、オーバーフィットを回避するために、保持フォールドに最もよく適合するブランチの数を決定するためのサンプルとして使用されます。ユーザーが変更できる1つのことは、データを分けるグループ数(N)です。デフォルトは10ですが、他の一般的な値は5と20です。
このオプションは、ルートノードからルートの最も遠いノードまでいくつのレベルが許可されているかを示すことによって、ツリーの全体のサイズを制限します。
拡張性の高い決定木を実装するRevo ScaleR関数(rxDTree)は、等間隔のビニングプロセスを介して数値変数を処理し、計算の複雑さを軽減します。これらの選択肢は、「デフォルト」です。つまり、スプリットを可能にするために必要なレコードの最小数に基づく式が使用されますが、ユーザーが手動で設定することができます。このオプションは、ツールへの入力がXDFメタデータストリームの場合にのみ適用されます。

[グラフィックオプション] タブ

このオプションセットは、決定木をプロットすることに関連する多数のオプションを制御します。
- このオプションの下の最初の選択肢は、葉の要約の性質です。このオプションは、ツリープロットの最後の葉ノードにカウントまたは割合を印刷するかどうかを制御します。
  - カウント
  - 比率
- 第2の選択肢は、均一なブランチ距離を使用すべきかどうかです。このオプションは、描画されたツリーのブランチの長さが、ターゲットを予測する際のスプリットの相対的な重要性を反映するか、ツリープロットの長さが均一であるかを制御します。
Plot size: Set the dimensions of the output tree plot.
- Inches: Set the Width and Height of the plot.
- Centimeters: Set the Width and Height of the plot.
- グラフの解像度: グラフの解像度を 1 インチあたりのドット数で選択します: 1x (96 dpi)、2x (192 dpi)、3x (288 dpi)
  - 解像度を低くするとファイルサイズが小さくなり、モニターでの表示に最適です。
  - 解像度を高くするとファイルサイズが大きくなり、印刷品質が向上します。
- 基本フォントサイズ (ポイント): ポイント単位のフォントサイズ。
モデルレポートの出力で、決定木の単純化されたグラフを表示する場合に選択します。
- Plot size: Select if the graph is displayed in Inches or Centimeters.
  - Width: Set the width of the graph using the unit selected in Plot size.
  - Height: Set the height of the graph using the unit selected in Plot size.
- グラフの解像度: グラフの解像度を 1 インチあたりのドット数で選択します: 1x (96 dpi)、2x (192 dpi)、3x (288 dpi)
  - 解像度を低くするとファイルサイズが小さくなり、モニターでの表示に最適です。
  - 解像度を高くするとファイルサイズが大きくなり、印刷品質が向上します。
- ベースフォントサイズ (ポイント): グラフ内のフォントのサイズを選択します。

出力の表示

各出力アンカーに閲覧ツールを接続して、結果を表示します。

O アンカー: 結果ウィンドウにオブジェクトのモデル名とサイズを表示します。
R アンカー: サマリーとプロットを含むモデルのサマリーレポートを表示します。
サポートする視覚の相互作用的ダッシュボードを表示し、ズーム、ホバー、クリックすることができます。

Expected Behavior: Plot Precision

When using the Decision Tree tool for standard processing, the Interactive output shows greater precision with numeric values than the Report output.