どうも，接点QB です．

KDD 2017の論文を漁っていたら，「Interpretable Predictions」というタイトルに目を惹かれて表題の論文を読んだので紹介したいと思います．機械学習（教師あり）の手法は「予測・分類」を目的としていて，「説明性」に関してはあまり重視されない印象があります．しかし，実際の問題を解決しようとする際には説明性を求められるケースが多いと思います．たとえば，「この人は商品Aを購入しないグループに属する事はわかりました．じゃあ，どうすればこの人に商品Aを買わせることができますか？」といったケースがあります．

今回紹介する論文で提案されている手法は，上記のような例でも対応することが出来ます．

論文の概要

問題設定と前提条件は，

2値分類
使用するのは決定木ベースのアンサンブル分類器

です．そして，論文の目的は

negativeに分類された $x$ を変換して，positiveに分類される $x^{'}$ を作成する

ことです．

Notation

特徴量ベクトル: $x \in X \subset R^{n}$ ． $E [x] = 0, V [x] = 1$ とする．
ラベル: $x$ に対して $y \in {- 1, + 1} = Y$ が対応する．
特徴量ベクトルとラベルの対応を表す写像: $f : X \to Y$ ．分類器はこの写像を推定したもの $\hat{f}$ ．
アンサンブル学習器: 弱学習器（決定木） $T_{1}, \dots, T_{K}$ のアンサンブル学習器を $T$ とする．

手法

まず注意してほしいのは，本稿で紹介する手法は「分類器を作成すること」ではなく「作成した分類器を使って特徴量ベクトルをより良いものに変換すること」です．なので，分類器 $T$ は既に学習されているとします．

$x$ がtrue negativeすなわち $f (x) = \hat{f} (x) = - 1$ を満たすとします．このとき，取り組むタスク $x^{'} \in X such that \hat{f} (x) = + 1$ を作成することです．さらに言えば，変換されたベクトルの中でも何かしらの意味で最良のベクトルを選択するという問題に落とし込むと，求めたいベクトルは一意に定まります：

$\begin{aligned} (1) & x^{'} = \underset{x^{*}}{a r g m i n} {δ (x, x^{'}) | \hat{f} (x) = - 1 and \hat{f} (x^{*}) = + 1} . \end{aligned}$

(1)での $δ$ は，「何かしらの基準」を測るためのコスト関数です．ユークリッド距離でもコサイン類似度でも，目的によって使い分けるのが良いと思います．

Positive and Negative Paths

決定木の根から葉ノードへの経路は，一つの特徴量 $x_{i}$ に関する不等号条件の積み重ねによって形成されます．そして，葉ノードでnegativeかpositiveを分類します．

決定木 $T_{k}$ において $x$ をpositiveと分類する $j$ 番目の経路を $p_{k, j}^{+}$ で表します．簡単のため， $p_{k, j}$ は次のように表せると仮定します： $p_{k, j} = {(x_{1} \begin{matrix} \geq \\ \leq \end{matrix} θ_{1}), \dots, (x_{n} \begin{matrix} \geq \\ \leq \end{matrix} θ_{n})} .$ さらに， $P_{k}^{+} = ⋃_{j \in T_{k}} p_{k, j}$ とします．negativeの場合の $P_{k}^{-}$ も同様に定義します．

さて，このパスを指定することで入力空間から特定の領域（ $p_{k, j}^{+}$ を指定すればpositiveに分類される領域）を切り出す事ができます．たとえば， $x = [x_{1}, x_{2}]$ という特徴量ベクトルを分類する事を考えます．このとき，１つ目のノードの条件が $x_{1} \leq θ_{1}$ ，２つ目のノードでの条件が $x_{2} \leq θ_{2}$ だとすると，これらの条件から $X_{p o s i} = {(x_{1}, x_{2}) | x_{1} \leq θ_{1}, x_{2} \leq θ_{2}}$ という領域がpositiveと分類される $x$ を含む領域である事がわかります．図1で緑になっている領域が $X_{p o s i}$ に対応します．

f:id:setten-QB:20171021204449p:plain

図1

特徴量ベクトルの構成方法

[Tolomei et al., 2016] で提案されている手法は， $x$ から近くて， $\hat{f} (x^{'}) = 1$ となるような $x^{'}$ を発見する方法です．つまり(1)の $x^{'}$ を見つけるという問題になるわけですね．その具体的な方法を説明していきます．まずは $p_{k, j}^{+}$ の $ε$ -satisfactory instance $\begin{matrix} (2) & x_{j (ε)}^{+} [i] = {\begin{cases} θ_{i} - ε & x_{i} \leq θ_{i} \\ θ_{i} + ε & x_{i} > θ_{i} \end{cases} \end{matrix}$ を定義します． (2) において， $x_{j (ε)}^{+} [i]$ は $x_{j (ε)}^{+}$ の第 $i$ 要素です．また，決定木 $T_{k}$ における $x_{j (ε)}^{+}$ の集合を $Γ_{k} = ⋃_{j \in P_{k}^{+}} x_{j (ε)}^{+}$ で定義します．注意してほしいのは， $x_{j (ε)}^{+} \in Γ_{k}$ は必ずしも $\hat{f} (x_{j (ε)}^{+}) = + 1$ とはならないことです．あくまで ${\hat{h}}_{k} (x_{j (ε)}^{+}) = + 1$ ということしか言えません（ $\hat{h}$ は決定木 $k$ での予測ラベル）．

そして， $Γ = ⋃_{k = 1}^{K} Γ_{k}$ （i.e. すべての決定木（弱学習器）での $ε$ -satisfactory instance の集合）に属し $\hat{f} (x_{j (ε)}^{+}) = + 1$ を満たす $x_{j (ε)}^{+}$ の中から， $δ (x, x_{j (ε)}^{+})$ を最小にするものを $x^{'}$ とします．つまり， $x^{'} = \underset{\begin{matrix} x_{j (ε)}^{+} \in Γ \\ \hat{f} (x_{j (ε)}^{+}) = + 1 \end{matrix}}{a r g m i n} δ (x, x_{j (ε)}^{+})$ によって， $x^{'}$ を決定します．

$x_{j (ε)}^{+}$ を決定するために探索する範囲は，決定木の数 $K$ とpositiveパスの本数の $| P_{k}^{+} |$ の積で書けるので $O (K | P_{k}^{+} |)$ になります．

実装 ―Pythonでやってみる―

擬似コードは論文に記載されているので，ここではPythonで本手法を使ってみたいとおもいます．とりあえずはirisデータでやってみます．これまでの説明は2クラスの分類問題で説明してきましたが，簡単に3クラス問題に拡張できるので，3クラス分類でやりたいと思います．

まずはデータを読み込んで，Random Forestで分類器を作ります．

import numpy as np
import pandas as pd
import scipy.stats
import copy
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier


iris = datasets.load_iris()
x = iris.data
mean_x = x.mean(axis=0)
std_x = x.std(axis=1)
x = scipy.stats.zcore(x, axis=0)
y = iris.target

rfc = RandomForestClassifier()
rfc.fit(x, y)

次に，必要な関数やパラメータを決めます．

epsilon = 0.1
aim_label = 2
labels = [0, 1, 2]

def cost_func(x, y):
    return np.linalg.norm(x - y)

def value_to_label(value_list):
    """ tree_.valueからlabelを出力する関数 """
    for i in value_list:
        if i != 0:
            label = np.where(value_list==i)[0][0]
        else:
            continue
    return label

ここからが紹介した手法のメインになります．今回は適当にx[0]を選んで（ラベル $0$ ），ラベルを $2$ にすることを目指します．

x_in = x[0]
y_in = y[0]

for estimator in rfc:
    if (rfc.predict(x_in) == estimator.predict(x_in)
        and estimator.predict(x_in) == y_in):
        children_left = estimator.tree_.children_left
        children_right = estimator.tree_.children_right
        feature = estimator.tree_.feature
        threshold = estimator.tree_.threshold
        # 葉ノードのID
        leaf_nodes = np.where(estimator.tree_.feature == -2)[0]
        tmp = estimator.tree_.value[leaf_nodes].reshape(
            len(leaf_nodes), len(labels))
        leaf_labels = [value_to_label(i) for i in tmp]
        positive_leaf_nodes = leaf_nodes[np.where(np.array(leaf_labels) == 2)]
        paths = {}  # パス情報
        for leaf_node in positive_leaf_nodes:
            """
            各葉ノードに対してパスを検索する
            """
            child_node = leaf_node
            parent_node = -100  # initialize
            parent_lefts = []  # 左側親ノード
            parent_rights = []  # 右側親ノード
            while (parent_node != 0):
                if (np.where(children_left == child_node)[0].shape == (0, )):
                    parent_left = -1  # 左側親ノードが存在しない場合は-1
                    parent_right = np.where(
                        children_right == child_node)[0][0]
                    parent_node = parent_right
                elif (
                    np.where(children_right == child_node)[0].shape == (0, )):
                    parent_right = -1  # 右側親ノードが存在しない場合は-1
                    parent_left = np.where(children_left == child_node)[0][0]
                    parent_node = parent_left
                parent_lefts.append(parent_left)
                parent_rights.append(parent_right)
                """ 次のステップへの処理 """
                child_node = parent_node
            paths[leaf_node] = (parent_lefts, parent_rights)

        path_info = {}
        for i in paths:
            # print("Leaf {0}".format(i))
            node_ids = []  # node ids used in the current node
            # inequality symbols used in the current node
            inequality_symbols = []
            thresholds = []  # thretholds used in the current node
            features = []  # features used in the current node
            parents_left, parents_right = paths[i]
            for idx in range(len(parents_left)):
                if (parents_left[idx] != -1):
                    """ the child node is the left child of the parent """
                    node_id = parents_left[idx]  # node id
                    node_ids.append(node_id)
                    inequality_symbols.append(0)
                    thresholds.append(threshold[node_id])
                    features.append(feature[node_id])
                elif (parents_right[idx] != -1):
                    """ the child node is the right child of the parent """
                    node_id = parents_right[idx]
                    node_ids.append(node_id)
                    inequality_symbols.append(1)
                    thresholds.append(threshold[node_id])
                    features.append(feature[node_id])
                path_info[i] = {'node_id': node_ids,
                                'inequality_symbol': inequality_symbols,
                                'threshold': thresholds,
                                'feature': features}
        """
        make epsilon satisfactory instance
        """
        for path in path_info:
            es_instance = copy.deepcopy(x_in)
            for i in range(len(path_info[path]['feature'])):
                """
                loop by node number (this is not equal to the node index)
                """
                # feature index
                feature_idx = path_info[path]['feature'][i]
                # threshold used in the current node
                threshold_value = path_info[path]['threshold'][i]

                if x_in[:, feature_idx] <= threshold_value:
                    es_instance[:, feature_idx] = threshold_value - epsilon
                elif x_in[:, feature_idx] > threshold_value:
                    es_instance[:, feature_idx] = threshold_value + epsilon
                else:
                    print('Something is wrong.')
            if estimator.predict(es_instance) == aim_label:
                if cost_func(x_in, es_instance) < delta_mini:
                    x_out = es_instance
                    delta_mini = cost_func(x_in, es_instance)
x_out = std_x * x_out + mean_x  # transform output to original distribution

これでx_outに $x_{j (ε)}^{+}$ が入っています．

見ていただいたら分かるように，ラベル２を出力する葉ノードとそこへ至るパスを取り出すのに結構色々と書いています． scikit-learnでもっと楽にやる方法があったらぜひ教えていただきたいです．

まとめ

よく「ブラックボックス」と言われる機械学習ですが，今回紹介した手法を使うと（どうすればpositiveなラベル方向へ改善出来るのかという事がわかるという意味で）解釈性を上げる事ができます．また，Decision Treeのアンサンブル手法であるRandom ForestやGradient Boostingは非常に手軽に試せる手法で，かつ比較的高精度な分類器を用意に作ることが出来ます．そのような分類器において解釈性の向上が得られるということは，実際のデータマイニング応用にとても役立つと思われます．

setten-QBのブログ

【論文紹介】Interpretable Predictions of Tree-based Ensembles via Actionable Feature Tweaking【KDD 2017】

論文の概要

Notation

手法

Positive and Negative Paths

特徴量ベクトルの構成方法

実装 ―Pythonでやってみる―

まとめ