(cache) xgboost package のR とpython の違い

python と xgboost で検索をかけられている方も多く見受けられるので、R とほぼ重複した内容になりますが、記事にまとめておきます。python のxgboost のインストール方法はgithub を参考にされると良いと思います。<a href="https://github.com/dmlc/xgboost/tree/master/python-package">dmlc/xgboost</a>github.com

R とpython のxgboost を使う際に感じる違い
R の利点

視覚化(visualization) が強い

自動化が簡単

early stopping が簡単に使える

python の利点

ハイパーパラメータのチューニングに hyperopt package が使用できる

現状として、R のpackage を使う方がメリットが大きいと思います。

　まず、R の方から見ていきます。python でも主要な機能は実装されていますが、変数重要度を求めたときの視覚化が未実装で(計画はあるみたいです)、変数との対応も分かりにくいです。
実際にiris data で変数重要度を求めると、R の場合には以前のブログ
Xgboost のR における具体例 (クラス分類) - puyokwの日記
に記載したように、どの変数がどのくらい貢献しているのかが分かりやすく、print でもplot でも表示されます。python では、

import numpy as np
import xgboost as xgb
import pandas as pd
from sklearn import datasets
iris=datasets.load_iris()
# python は0 base, R は1 base
# 偶数番目をトレーニングデータ
trainX=iris.data[1::2,:] 
trainY=iris.target[1::2]
# 奇数番目を検証用データ 
testX=iris.data[2::2,:] 
testY=iris.target[2::2]

# DMatrix 型に変換
trainX=pd.DataFrame(trainX)
trainX.columns=iris.feature_names
dtrain = xgb.DMatrix(trainX.as_matrix(),label=trainY.tolist())
testX=pd.DataFrame(testX)
testX.columns=iris.feature_names
dtest=xgb.DMatrix(testX.as_matrix())

np.random.seed(131) # シードを固定

# パラメータの設定
params={'objective': 'multi:softprob',
        'eval_metric': 'mlogloss',
        'eta': 0.3,
        'max_depth': 6, # 木の深さの最大値 [default=6]
        'min_child_weight': 1, # [default=1]
        'subsample': 1, # [default=1]
        'colsample_bytree': 1, # [default=1]
        'num_class': 3
        }
# トレーニング
bst=xgb.train(params,dtrain,num_boost_round=18)
# 変数重要度を求める
imp=bst.get_fscore()
print(imp)
# 以下のものが出力される
{'f0': 15, 'f1': 7, 'f2': 57, 'f3': 28}

これは、f(列番号)となっています。。
一応、算出はできるもののR の方が使い勝手がいいように感じます。

R のもう一つの利点の自動化は、python でも交差検定できますが、

cv=xgb.cv(params,dtrain,num_boost_round=40,nfold=10)
# 出力
(中略)
tree prunning end, 1 roots, 0 extra nodes, 0 pruned nodes ,max_depth=0
tree prunning end, 1 roots, 4 extra nodes, 0 pruned nodes ,max_depth=2
tree prunning end, 1 roots, 2 extra nodes, 0 pruned nodes ,max_depth=1
tree prunning end, 1 roots, 0 extra nodes, 0 pruned nodes ,max_depth=0
tree prunning end, 1 roots, 2 extra nodes, 0 pruned nodes ,max_depth=1
tree prunning end, 1 roots, 2 extra nodes, 0 pruned nodes ,max_depth=1
[39] cv-test-mlogloss:0.150059+0.225514 cv-train-mlogloss:0.038407+0.003179

print(cv)
# 出力
'[0] cv-test-mlogloss:0.761828+0.039446 cv-train-mlogloss:0.752364+0.004754',
(中略)
'[39] cv-test-mlogloss:0.150059+0.225514 cv-train-mlogloss:0.038407+0.003179'

ただの文字列なので、これを処理するのは面倒に感じます。ちなみに、とりあえず、予測までしてみると、

bst=xgb.train(params,dtrain,18)
pred=bst.predict(dtest)
pred=pd.DataFrame(pred)

print confusion_matrix(testY, pred.idxmax(axis=1))

# [[25  0  0]
#  [ 0 23  2]
#  [ 0  0 25]]

となります。
R では以前の記事では紹介していなかったのですが簡単に最小値を探せます。(最後の2,3 行以外は以前の記事とほぼ同じコードです)

dim(iris) # 行数:150, 列数:5
odd.n<-2*(1:75)-1
iris.train<-iris[odd.n,] # 奇数を訓練データ
iris.test<-iris[-odd.n,] # 偶数を検証データ

library(xgboost)
y <- iris.train[,5] # 目的変数
y <- as.integer(y)-1 #xgboost で既定されいるクラスは 0 base

train.x<-iris.train[,1:4]
x <- rbind(train.x,iris.test[,-5]) # xgboost を使うときのため
x <- as.matrix(x)

trind <- 1:length(y) # 先程定義したx の中の訓練データを指すのに使う
teind <- (nrow(train.x)+1):nrow(x) # 先程定義したx の中の検証用データを指すのに使う
# パラメータ設定
set.seed(131) # 固定シードで試す
param <- list("objective" = "multi:softprob", # 多クラスの分類で各クラスに所属する確率を求める
              "eval_metric" = "mlogloss", # 損失関数の設定
              "eta" = 0.3, # default= 0.3
              "max_depth" = 6, # 木の深さの最大値 [default=6]
              "min_child_weight" = 1, # [default=1]
              "subsample" = 1, # [default=1]
              "colsample_bytree" = 1, # [default=1]
              "num_class" = 3 # class がいくつ存在するのか
              )
# 最適な木の数を探す
cv.nround <- 100 # 大きめに設定する
bst.cv <- xgb.cv(param=param, data = x[trind,], label = y,  nfold = 10, nrounds=cv.nround)
min(bst.cv$test.mlogloss.mean)
# [1] 0.108445
which.min(bst.cv$test.mlogloss.mean)
# [1] 27

のようにして、mlogloss であれば、最小値とそのときのnround の値は取得できます。この値を用いて、検証データに適用すればOK です。逆に、最小値が先ほどnrounds の設定した値とほぼ同じであれば、nrounds の数を大きくしてもう一度交差検定するプログラムを書けば自動化できます。パラメータのチューニングも同様の方針でできます。
early stopping は以下のように簡単に使えます。

set.seed(131) # 再現性の確保のため再度
bst.cv <- xgb.cv(param=param, data = x[trind,], label = y,  nfold = 10, nrounds=cv.nround, early.stop.round=10)
# 出力は以下のようになります
# 一部省略
# [23]    train-mlogloss:0.032116+0.002245        test-mlogloss:0.109858+0.128204
# [24]    train-mlogloss:0.031495+0.002149        test-mlogloss:0.109487+0.128595
# [25]    train-mlogloss:0.030918+0.002035        test-mlogloss:0.108704+0.127878
# [26]    train-mlogloss:0.030378+0.001937        test-mlogloss:0.108445+0.127594
# [27]    train-mlogloss:0.029885+0.001852        test-mlogloss:0.108808+0.127736
# [28]    train-mlogloss:0.029433+0.001771        test-mlogloss:0.110306+0.129390
# [29]    train-mlogloss:0.029008+0.001705        test-mlogloss:0.110110+0.129282
# [30]    train-mlogloss:0.028601+0.001637        test-mlogloss:0.110021+0.129223
# [31]    train-mlogloss:0.028222+0.001576        test-mlogloss:0.109943+0.129480
# [32]    train-mlogloss:0.027860+0.001519        test-mlogloss:0.110860+0.130982
# [33]    train-mlogloss:0.027510+0.001469        test-mlogloss:0.110752+0.131017
# [34]    train-mlogloss:0.027191+0.001417        test-mlogloss:0.111216+0.131821
# [35]    train-mlogloss:0.026877+0.001363        test-mlogloss:0.112265+0.133605
# [36]    train-mlogloss:0.026584+0.001322        test-mlogloss:0.112999+0.135183
# Stopping. Best iteration: 27
# 予測する
nround <- which.min(bst.cv$test.mlogloss.mean)
bst <- xgboost(param=param,data=x[trind,],label=y,nrounds=nround)
pred <- predict(bst,x[teind,])
pred <- matrix(pred,3,length(pred)/3) # 今回は3クラスあるので
pred <- t(pred)
table(iris.test[,5],max.col(pred)+1)
#              2  3  4
#  setosa     25  0  0
#  versicolor  0 23  2
#  virginica   0  0 25

直近 early.stop.round 個の間test-mlogloss がすべて上昇していればそこで打ち切ります。python でearly stopping を用いるのは、今のところ面倒なので、hyperopt を用いたやり方を紹介します。kaggle の過去のコンペでその使い方が紹介されていました。
Kaggle-stuff/hyperopt_xgboost.py at master · bamine/Kaggle-stuff · GitHub
こちらはそれを参考にしたコードです。

import numpy as np
import xgboost as xgb
import pandas as pd
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from sklearn.metrics import log_loss
from sklearn import preprocessing
from hyperopt import hp
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
iris=datasets.load_iris()
trainX=iris.data[0::2,:] 
trainY=iris.target[0::2]
testX=iris.data[1::2,:]
testY=iris.target[1::2]

trainX=pd.DataFrame(trainX)
trainX.columns=iris.feature_names
dtrain = xgb.DMatrix(trainX.as_matrix(),label=trainY.tolist())

testX=pd.DataFrame(testX)
testX.columns=iris.feature_names
dtest=xgb.DMatrix(testX.as_matrix())

np.random.seed(131)


def score(params):
    print "Training with params : "
    print params
    num_round = int(params['n_estimators'])
    del params['n_estimators']
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dvalid = xgb.DMatrix(X_test, label=y_test)
    model = xgb.train(params, dtrain, num_round)
    predictions = model.predict(dvalid).reshape((X_test.shape[0], 3))
    score = log_loss(y_test, predictions)
    print "\tScore {0}\n\n".format(score)
    return {'loss': score, 'status': STATUS_OK}

def optimize(trials):
    space = {
             'n_estimators' : hp.quniform('n_estimators', 10, 200, 1),
             'eta' : hp.quniform('eta', 0.1, 0.5, 0.05),
             'max_depth' : hp.quniform('max_depth', 1, 10, 1),
             'min_child_weight' : hp.quniform('min_child_weight', 1, 10, 1),
             'subsample' : hp.quniform('subsample', 0.5, 1, 0.05),
             'gamma' : hp.quniform('gamma', 0, 1, 0.05),
             'colsample_bytree' : hp.quniform('colsample_bytree', 0.5, 1, 0.05),
             'num_class' : 3,
             'eval_metric': 'mlogloss',
             'objective': 'multi:softprob',
             'silent' : 1
             }
    best = fmin(score, space, algo=tpe.suggest, trials=trials, max_evals=250)
    print best

X_train, X_test, y_train, y_test = train_test_split( trainX, trainY, test_size=0.5, random_state=1234)
trials = Trials()
optimize(trials)

params={'objective': 'multi:softprob',
        'eval_metric': 'mlogloss',
        'eta': 0.15,
        'max_depth': 5,
        'gamma': 0.15,
        'min_child_weight': 1,
        'subsample': 0.95,
        'colsample_bytree': 1.0,
        'num_class': 3
        }

bst=xgb.train(params,dtrain,189)
pred=bst.predict(dtest)
pred=pd.DataFrame(pred)

print confusion_matrix(testY, pred.idxmax(axis=1))
# [[25  0  0]
#  [ 0 23  2]
#  [ 0  0 25]]

　結果は、特に変わりませんでしたが、hyperopt を使ってパラメータのチューニングを行うことが、python でxgboost を使う魅力の一つになると思います。