ITエンジニアのための機械学習理論入門 ― NumPy / pandasチュートリアル&サンプルコード解説編

1. 1 Copyright (C) 2015 National Institute of Informatics, All rights reserved. Introduction to Machine Learning Theory for Software Engineers ITエンジニアのための機械学習理論入門 NumPy / pandasチュートリアル＆サンプルコード解説編中井悦司 / Etsuji Nakai Senior Solution Architect and Cloud Evangelist Red Hat K.K. ver1.0 2015/10/25

3. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 3 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編データ分析に利用できるオープンソースのツール ■ R（http://www.r-project.org/） - いわずとしれた。 ■ Enthought Canopy（https://www.enthought.com/products/canopy/） - Pythonのデータ解析ツールセット（下記のツールが含まれる） ● NumPy : ベクトルや行列を扱うライブラリー ● SciPy : 科学計算用ライブラリー ● matplotlib : グラフ作成ライブラリー ● pandas : Rに類似のデータフレームを提供 ● IPython : 対話的操作環境 ● scikit-learn : 機械学習用ライブラリー ■ Anaconda (https://www.continuum.io/) - Canopyと同様のツールセット（scikit-learnも無償で利用可能）「機械学習理論入門」では、これらを使って直接にアルゴリズムを実装したサンプルコードを提供しています。

4. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 4 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 NumPy, pandas, matplotlib について ■ この資料では、主に下記のライブラリーを説明します。 - NumPy : ベクトルや行列の演算の他、主要な数学関数や乱数機能を提供します。 - pandas : Rに類似のデータフレーム（スプレッドシートのように、行／列に属性が付いたデータ構造）を提供します。 - matplotlib : グラフを描画します。 ■ これらの詳細は下記の書籍が参考になります。 - Python for Data Analysis（Wes McKinney） - 邦題は「Pythonによるデータ分析入門」

5. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 5 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 IPythonの使い方 ■ この資料では、すべての操作をIPython（Pythonの対話的操作環境）から行います。 ■ IPythonのシェルからは、「!<コマンド>」でOSコマンドが実行できます。 - 「ls」「cd」「cat」などは、「!」を付けなくても実行できます。 - スクリプトの編集は、「!vi <ファイル名>」でviエディタを起動します。 - 他のウィンドウでエディタを起動して編集しても構いません。RHEL6/CentOS6のデスクトップでGUIエディタを使う際は、デスクトップからフォルダー「ホーム」→「ml4se」→「scripts」を開いて、中のファイルを右クリック→「geditで開く」を選択します。 $ ipython Python 2.7.6 | 64-bit | (default, Sep 15 2014, 17:36:10) Type "copyright", "credits" or "license" for more information. IPython 2.3.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: In [1]: ls ml4se LICENSE README.md config_centos.sh config_mac.sh config_win.bat scripts/ In [2]: cd ~/ml4se/scripts /home/user01/ml4se/scripts In [3]: !vi 01-square_error.py サンプルコードは「ml4se/scripts」の下にあります。

6. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 6 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 IPythonの使い方 ■ スクリプトの実行は、「%run」コマンドで行います。 ■ その他には、ファイル名は [Tab] キーで補完できます。また、[↑][↓] キーで過去のコマンド履歴の呼び出しが可能です。 In [4]: %run 01-square_error.py Table of the coefficients M=0 M=1 M=3 M=9 0 -0.012133 0.737922 0.005026 0.021570 1 NaN -1.500112 9.633393 -121.926645 2 NaN NaN -28.282723 2897.187668 3 NaN NaN 18.422900 -25036.071571 4 NaN NaN NaN 110826.637881 5 NaN NaN NaN -282565.729927 6 NaN NaN NaN 431648.816158 7 NaN NaN NaN -390194.283125 8 NaN NaN NaN 192486.163220 9 NaN NaN NaN -39940.969290 In [5]: exit In [5]: %paste ■ クリップボードの内容をペーストして実行する際は、「%paste」コマンドを実行します。 ■ iPythonを終了する際は「exit」を入力します。

7. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 7 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 IPythonの使い方 ■ 本環境では、下記の初期設定スクリプト（IPython起動時に自動実行されるコマンド）を用意しているものとします。 - numpy, matplotlib.pyplot, pandasの各ライブラリをインポートして、それぞれ、短縮名 np, plt, pd で参照できるようにしています。 - クラス pandas.DataFrame, pandas.Series については、ライブラリ名を省略して、DataFrame, Seriesで参照できるようにしています。 ※ サンプルコードでは、冒頭部分に上記の内容を必ず記述するようにしています。 import numpy as np import matplotlib.pyplot as plt import pandas as pd from pandas import Series, DataFrame ~/.ipython/profile_default/startup/00-setup.py

9. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 9 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 IPythonによる数値計算 ■ IPythonのプロンプトに計算式を入力すると、計算結果が表示されます。 - 冪乗は「**」を用います。直前の結果は、「_」で参照できます。 - 小数点以下のない数値は、整数型とみなされます。実数値として計算する場合は、小数点以下を明示するか、float()で実数型に変換します。 - 安全のために（実数として計算したい）数値の末尾には「.0」を付与する習慣を付けておきましょう。実数型と整数型が混在した計算では、実数型として計算が行われます。 In [1]: 2*(1+3) Out[1]: 8 In [2]: 2**10 Out[2]: 1024 In [3]: _ * 2 Out[3]: 2048 In [1]: 1/2 Out[1]: 0 In [2]: 1.0/2 Out[2]: 0.5 In [3]: float(1)/2 Out[3]: 0.5 整数型として計算されている

10. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 10 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 NumPyが提供する数学関数の利用 ■ NumPyが提供する各種関数や定数値が利用できます。 - 本環境では、省略名「np」を用いて参照します。 - NumPyが提供する関数は、リスト（もしくは、array）を代入すると、それぞれの要素を代入した結果のarrayが返ります。（リストとarrayの違いは、後ほど説明します。） - この性質は、後ほど、関数のグラフを描く際に必要となります。自分で関数を定義する際も、この性質（リストを代入するとarrayが返る）を実装するように心がけましょう。 In [7]: np.pi Out[7]: 3.141592653589793 In [8]: np.e Out[8]: 2.718281828459045 In [9]: np.sin(np.pi/4) Out[9]: 0.70710678118654746 In [10]: np.sqrt(2) Out[10]: 1.4142135623730951 In [1]: np.sqrt([0,1,2,3]) Out[1]: array([ 0. , 1. , 1.41421356, 1.73205081])

11. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 11 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編散布図と折れ線グラフ In [1]: data_x = [0.0, -0.95, -0.59, 0.59, 0.95] In [2]: data_y = [1.0, 0.31, -0.81, -0.81, 0.31] In [3]: plt.scatter(data_x,data_y) Out[3]: <matplotlib.collections.PathCollection at 0x3e58d90> - 散布図は、対象データの「x座標のリスト」と「y座標のリスト」を用意して、plt.scatter()に渡します。「座標(x,y)のリスト」を渡すわけではないので注意してください。 - 折れ線グラフは、対象データの「x座標のリスト」と「y座標のリスト」を用意して、plt.plot()に渡します。 - - - - グラフの見栄えを綺麗にする方法は、後ほど説明します。 In [1]: data_x = [0,1,2,3,4,5] In [2]: data_y = [0,1,4,9,16,25] In [3]: plt.plot(data_x,data_y) Out[3]: [<matplotlib.lines.Line2D at 0x4f40e50>] ■ matplotlibを用いて、散布図と折れ線グラフを表示してみます。

12. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 12 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編散布図と折れ線グラフ - 関数のなめらかなグラフを描く際は、十分に細かく分割した「x座標のリスト」を用意して、対応する「y座標のリスト」を計算します。 - 「x座標のリスト」は、np.linspace()を使って生成すると便利です。「y座標のリスト」（data_y）の計算では、関数にリスト（array）を代入するとarrayが得られる性質を利用しています。 - np.linspace()の代わりに、np.arange()を使用することもできます。 In [2]: data_x = np.linspace(0,1,101) In [2]: data_y = np.sin(2.0*np.pi*data_x) In [3]: plt.plot(data_x,data_y) Out[3]: [<matplotlib.lines.Line2D at 0x4c71890>] [0,1]を100分割した101個の実数を生成 In [2]: data_x = np.arange(0,1.01,0.01)

14. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 14 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編行列とベクトルの計算 ■ 行列／ベクトルは、NumPyのarrayオブジェクトで表現します。 - 2次元リストをnp.array()に渡すと、対応するarrayオブジェクトが得られます。 - 通常の2次元リストではできない、行列の積や逆行列などの演算が用意されています。行列の積と逆行列は、それぞれ、np.dot()、np.linalg.inv()で計算します。 - 転置行列は、Tメソッドを用います。 In [1]: t = np.pi / 3 In [2]: m = np.array([[np.cos(t),-np.sin(t)],[np.sin(t),np.cos(t)]]) In [3]: m Out[3]: array([[ 0.5 , -0.8660254], [ 0.8660254, 0.5 ]]) In [4]: np.dot(m, m) Out[4]: array([[-0.5 , -0.8660254], [ 0.8660254, -0.5 ]]) In [5]: np.linalg.inv(m) Out[5]: array([[ 0.5 , 0.8660254], [-0.8660254, 0.5 ]]) In [6]: m.T Out[6]: array([[ 0.5 , 0.8660254], [-0.8660254, 0.5 ]]) ※ 回転行列について、一般に次の性質が成り立ちます。

15. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 15 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編行列とベクトルの計算 - ベクトルは、　　　行列として定義することで、行列との積や内積／外積が計算できます。 - ベクトルの内積と外積は次のように計算できます。 ※ ベクトルを1次元リストとして定義した場合の演算ルールについては後ほど説明します。 In [7]: x = np.array([[1],[0]]) In [7]: x Out[7]: array([[1], [0]]) In [8]: n = np.dot(m,x) In [9]: n Out[9]: array([[ 0.5 ], [ 0.8660254]]) In [1]: a = np.array([[-1],[0],[1]]) In [2]: b = np.array([[2],[3],[5]]) In [3]: np.dot(a.T, b) Out[3]: array([[3]]) In [4]: np.dot(a, b.T) Out[4]: array([[-2, -3, -5], [ 0, 0, 0], [ 2, 3, 5]]) In [5]: np.dot(a.T, b)[0][0] Out[5]: 3 成分指定でスカラーとして取り出す場合

16. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 16 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編ブロードキャストルール ■ スカラー演算をarrayに適用すると、各成分に対する演算が行われます。これをブロードキャストルールと呼びます。 - 行列／ベクトルのスカラー倍は、ブロードキャストルールとして自然に計算されます。 - 次は、数学の演算としては不自然ですが、ブロードキャストルールが適用される例になります。 In [1]: m = np.array([[1,2],[3,4]]) In [2]: m Out[2]: array([[1, 2], [3, 4]]) In [3]: 2*m Out[3]: array([[2, 4], [6, 8]]) In [4]: m*2 Out[4]: array([[2, 4], [6, 8]]) In [6]: m**2 Out[6]: array([[ 1, 4], [ 9, 16]]) In [7]: m+10 Out[7]: array([[11, 12], [13, 14]]) In [1]: [1,2,3] * 2 Out[1]: [1, 2, 3, 1, 2, 3] In [2]: np.array([1,2,3]) * 2 Out[2]: array([2, 4, 6]) ※次の計算は、リストとarrayで結果が異なるので　注意してください。

17. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 17 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編ブロードキャストルール - ブロードキャストルールを活用すると、リスト／arrayに対して、arrayを返す関数が簡単に作れます。 - 上の例では、リストをarrayに変換していますが、引数は常にarrayで渡すことがわかっている場合、この処理は省略しても構いません。 In [1]: def square(x): ....: if isinstance(x, list): ....: x = np.array(x) ....: return x**2 ....: In [2]: square(3) Out[2]: 9 In [3]: square([1,2,3]) Out[3]: array([1, 4, 9]) In [4]: square(np.array([1,2,3])) Out[4]: array([1, 4, 9]) In [1]: def square(x): ...: return x**2 ...: In [2]: square(np.array([1,2,3])) Out[2]: array([1, 4, 9])

18. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 18 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編ブロードキャストルール ■ 同じサイズのarray同士のスカラー演算は、対応する成分同士の演算になります。 - 行列の和／差は、自然に計算されます。 - 次のような演算も可能です。 ※ サイズの異なるarray同士のスカラー演算にも、一定の法則でブロードキャストルールが適用されますが、　　直感的にわかりにくい結果になるので、なるべく使用しない方がよいでしょう。 In [5]: a = np.array([[10,20],[30,40]]) In [6]: b = np.array([[1,2],[3,4]]) In [7]: a Out[7]: array([[10, 20], [30, 40]]) In [8]: b Out[8]: array([[1, 2], [3, 4]]) In [11]: a**b Out[11]: array([[ 10, 400], [ 27000, 2560000]]) In [9]: a+b Out[9]: array([[11, 22], [33, 44]]) In [10]: a-b Out[10]: array([[ 9, 18], [27, 36]])

19. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 19 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 arrayオブジェクトの生成と変形 ■ arrayオブジェクト生成／変形の定番パターンには、次のようなものがあります。 - np.zeros()、np.ones()を用いると、全成分が0、もしくは、1のarrayが得られます。行列サイズを表すタプル (y, x) を引数として渡します。 - 既存のarrayオブジェクトは、reshape()メソッドで縦横のサイズを変更できます。現在のサイズは、shape属性で参照できます。 In [1]: np.zeros((3,3)) Out[1]: array([[ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]]) In [2]: np.ones((2,3)) Out[2]: array([[ 1., 1., 1.], [ 1., 1., 1.]]) In [1]: a = np.array([1,2,3,4,5,6]) In [2]: a Out[2]: array([1, 2, 3, 4, 5, 6]) In [3]: b = a.reshape((2,3)) In [4]: b Out[4]: array([[1, 2, 3], [4, 5, 6]]) In [8]: c = b.reshape((3,2)) In [9]: c Out[9]: array([[1, 2], [3, 4], [5, 6]]) In [10]: b.shape Out[10]: (2, 3) In [11]: c.shape Out[11]: (3, 2)

20. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 20 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 arrayオブジェクトの生成と変形 - np.vstack()とnp.hstack()は、それぞれ、2つの配列を縦、または、横に結合します。 In [1]: a = np.ones(9).reshape((3,3)) In [2]: b = a*2 In [3]: a Out[3]: array([[ 1., 1., 1.], [ 1., 1., 1.], [ 1., 1., 1.]]) In [4]: b Out[4]: array([[ 2., 2., 2.], [ 2., 2., 2.], [ 2., 2., 2.]]) In [5]: np.vstack((a,b)) Out[5]: array([[ 1., 1., 1.], [ 1., 1., 1.], [ 1., 1., 1.], [ 2., 2., 2.], [ 2., 2., 2.], [ 2., 2., 2.]]) In [6]: np.hstack((a,b)) Out[6]: array([[ 1., 1., 1., 2., 2., 2.], [ 1., 1., 1., 2., 2., 2.], [ 1., 1., 1., 2., 2., 2.]])

21. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 21 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 arrayオブジェクトの生成と変形 - reshape()を用いると、1次元リストを2次元配列としてのベクトルに変換できます。 ※ この変換は次の方法でも可能です。 - 等差数列のarrayは、np.arange()で生成します。np.arange(x, y, s) とした場合、x から y の範囲で公差 s の数列を生成します。終点 y は、数列に含まれない点に注意が必要です。 In [1]: x = [1,2,3,4] In [2]: np.array(x).reshape(len(x),1) Out[2]: array([[1], [2], [3], [4]]) In [1]: np.arange(0, 1, 0.1) Out[1]: array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]) In [3]: np.array([x]) Out[3]: array([[1, 2, 3, 4]]) In [4]: np.array([x]).T Out[4]: array([[1], [2], [3], [4]])

22. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 22 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編１次元のarrayに対するnp.dot()の計算 ■ np.dot()に1次元のarrayを代入した場合は、文脈に合わせて縦ベクトル／横ベクトルの解釈が行われます。 - 1次元arrayどうしは内積になります。 - 2次元arrayと1次元arrayは、行列としての積になります。 ※ 上記以外の組み合わせパターンの場合は、結果が直感とあわない場合もありますので、あまり使用しない方　　がよいでしょう。 In [4]: a Out[4]: array([-1, 0, 1]) In [5]: b Out[5]: array([1, 2, 3]) In [6]: np.dot(a,b) Out[6]: 2 In [7]: np.dot(b,a) Out[7]: 2 In [8]: m Out[8]: array([[1, 1, 1], [2, 2, 2], [3, 3, 3]]) In [9]: np.dot(m,a) Out[9]: array([0, 0, 0]) In [10]: np.dot(a,m) Out[10]: array([2, 2, 2])

24. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 24 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編一様分布からの乱数の取得 ■ rand() （numpy.randomモジュールが提供）、および、randint()を用いると、指定範囲の一様な乱数が得られます。また、一度に多数の乱数を生成することができます。 - rand()は、　　　　　の範囲（1は含まない）の実数値の乱数を指定の個数だけ生成します。指定サイズのarrayとして値を返します。 - 同様に、randint()は指定範囲の整数値の乱数を指定の個数だけ生成します。次は、1〜6の範囲（7は含まない点に注意）の乱数を生成する例です。 In [1]: from numpy.random import rand In [2]: rand() Out[2]: 0.19644145945572267 In [3]: rand(5) Out[3]: array([ 0.64765831, 0.41288461, 0.21530768, 0.3225688 , 0.55119995]) In [4]: rand(2,3) Out[4]: array([[ 0.63193604, 0.48647432, 0.06980617], [ 0.32513886, 0.29350987, 0.58432974]]) In [5]: randint(1,7) Out[5]: 1 In [6]: randint(1,7,10) Out[6]: array([2, 4, 3, 6, 4, 6, 1, 6, 2, 3]) In [7]: randint(1,7,(3,5)) Out[7]: array([[3, 1, 3, 1, 5], [2, 5, 5, 1, 1], [1, 3, 6, 5, 2]])

25. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 25 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編正規分布からの乱数の取得 ■ numpy.randomモジュールの normal() を用いると、正規分布からの乱数が得られます。 - 次のように、loc（平均）、scale（標準偏差）、size（arrayのサイズ）を指定します。最後の例のように、パラメータ名を省略しても構いません。 - 1000個の乱数を発生して、ヒストグラムを表示します。 In [1]: from numpy.random import normal In [2]: normal(loc=0,scale=3,size=10) Out[2]: array([-0.45405421, 1.03407066, -6.06638636, -1.47014096, 2.4127684 , -0.60084586, -2.20008908, -2.49174201, -5.53419474, -2.99053036]) In [3]: normal(loc=0,scale=3,size=(3,2)) Out[3]: array([[ 4.94332685, -2.65134418], [-5.97073959, -0.94864428], [-1.04192588, 2.29266043]]) In [4]: normal(0,3,(3,2)) Out[4]: array([[ 4.55288412, -3.28343893], [ 2.981074 , -2.05497678], [ 0.08205987, 0.77338863]]) In [5]: val = normal(10,3,size=1000) In [6]: plt.hist(val, bins=20)

26. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 26 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編正規分布からの乱数の取得 ■ numpy.randomモジュールの multivariate_normal() を用いると、多次元の正規分布からの乱数が得られます。 - 次のように、mean（平均）、cov（分散共分散行列）、size（arrayのサイズ）を指定します。パラメータ名は省略しても構いません。 - 200個の乱数を発生して、散布図を表示します。 In [1]: from numpy.random import multivariate_normal In [2]: c = np.array([[3,10],[10,3]]) In [3]: c Out[3]: array([[5, 3], [3, 5]]) In [4]: multivariate_normal(mean=[50,10],cov=c,size=4) Out[4]: array([[ 50.27035138, 8.29749294], [ 48.73954203, 9.87872197], [ 49.26177223, 10.40957016], [ 53.91936612, 13.33800988]]) In [5]: vals = multivariate_normal([50,10],c,200) In [6]: data_x = [x for (x,y) in vals] In [7]: data_y = [y for (x,y) in vals] In [8]: plt.scatter(data_x, data_y) Out[8]: <matplotlib.collections.PathCollection at 0x58f4790>

27. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 27 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編同一の乱数を発生する方法 ■ np.random.seed()で乱数の種を指定すると、毎回同じ乱数を発生することができます。 - 「種」の値には、32bit整数を指定します。 In [1]: np.random.seed(10) In [2]: np.random.randint(1,10,10) Out[2]: array([5, 1, 2, 1, 2, 9, 1, 9, 7, 5]) In [3]: randint(1,10,10) Out[3]: array([4, 1, 5, 7, 9, 2, 9, 5, 2, 4]) In [4]: np.random.seed(10) In [5]: randint(1,10,10) Out[5]: array([5, 1, 2, 1, 2, 9, 1, 9, 7, 5]) In [6]: randint(1,10,10) Out[6]: array([4, 1, 5, 7, 9, 2, 9, 5, 2, 4])

29. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 29 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編複数グラフの描画 ■ 次は、描画ウィンドウを分割して、複数のグラフを並べる際の定番パターンです。 - fig.add_subplot(y, x, c) で取得した描画位置を示すオブジェクト（サブプロット）に対して、scatter()、plot()などのメソッドを適用します。（描画位置は左上が「1」で、はじめに下に進みます。） - 次は、サブプロットに対して、タイトル、x軸／y軸の描画範囲、x軸／y軸のタイトルを指定する例です。 fig = plt.figure() subplot = fig.add_subplot(2,3,1) subplot.plot(data1_x, data1_y) subplot = fig.add_subplot(2,3,2) subplot.scatter(data2_x, data2_y) ... fig.show() 描画ウィンドウのオブジェクトを取得描画ウィンドウを縦2 x 横3 に分割した1つ目の描画位置描画ウィンドウを縦2 x 横3 に分割した2つ目の描画位置１２３４５６描画ウィンドウを表示 subplot.set_title('ROC graph') subplot.set_xlim([0, 1]) subplot.set_ylim([0, 1]) subplot.set_xlabel("False positive rate") subplot.set_ylabel("True positive rate")

30. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 30 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編サイコロのシュミレーション ■ 次は、2個のサイコロを100回振った結果をシュミレーションするコードの例です。 # -*- coding: utf-8 -*- # # サイコロのシュミレーション # import numpy as np import matplotlib.pyplot as plt import pandas as pd from pandas import Series, DataFrame from numpy.random import randint # Main if __name__ == '__main__': fig = plt.figure() # 2個のサイコロを100回振った結果 dices = randint(1,7,(100, 2)) # 2個の目の合計 total = np.sum(dices, axis=1) # それぞれの目についてのゾロ目の回数 doublets = [0,0,0,0,0,0] for (x, y) in dices: if x == y: doublets[x-1] += 1 # 目の組み合わせごとの回数 counts = np.zeros((6,6)) for (x, y) in dices: counts[y-1, x-1] += 1 subplot = fig.add_subplot(1,3,1) subplot.set_title('Sum of 2dices') subplot.set_xlabel('Total') subplot.set_ylabel('Count') subplot.set_xlim(1,13) subplot.hist(total, bins=11, range=(2,13), align='left', label='Sum') subplot = fig.add_subplot(1,3,2) subplot.set_title('Doublets counts') subplot.set_xlabel('Number') subplot.set_ylabel('Count') subplot.set_xlim(0.5, 6.5) subplot.bar(range(1,7), doublets, align='center') subplot = fig.add_subplot(1,3,3) subplot.set_title('Pair counts') subplot.set_xlabel('Dice1') subplot.set_ylabel('Dice2') subplot.imshow(counts, origin='lower', extent=(0.5,6.5,0.5,6.5), interpolation='nearest') fig.show()

31. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 31 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編サイコロのシュミレーション ■ 前ページのコードを実行すると以下の統計値がグラフ表示されます。 - 2個の目の合計ごとの出現回数 - それぞれの目についてゾロ目が出た回数 - 目の組み合わせごとの回数の比較「ヒートマップ」と呼ばれるグラフで値が大きいほど「熱い色」になります

32. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 32 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編サイコロのシュミレーション ■ コード内で計算した結果の具体例です。 In [1]: from numpy.random import randint In [2]: %paste # 2個のサイコロを100回振った結果 dices = randint(1,7,(100, 2)) # 2個の目の合計 total = np.sum(dices, axis=1) # それぞれの目についてのゾロ目の回数 doublets = [0,0,0,0,0,0] for (x, y) in dices: if x == y: doublets[x-1] += 1 # 目の組み合わせごとの回数 counts = np.zeros((6,6)) for (x, y) in dices: counts[y-1, x-1] += 1 ## -- End pasted text -- In [3]: dices Out[3]: array([[2, 2], [5, 4], [2, 2], [4, 2], [2, 6], ...（中略）... [4, 4], [5, 3]]) In [4]: total Out[4]: array([ 4, 9, 4, 6, 8, 6, 8, 9, 5, 10, 3, 10, 9, 2, 8, 6, 6, 12, 6, 3, 6, 7, 8, 8, 6, 4, 7, 10, 3, 10, 4, 7, 7, 7, 9, 5, 5, 4, 9, 5, 9, 7, 10, 8, 7, 8, 7, 6, 5, 5, 4, 5, 5, 11, 9, 7, 3, 9, 4, 9, 8, 3, 3, 5, 11, 5, 8, 3, 9, 7, 6, 7, 8, 7, 4, 8, 6, 10, 11, 9, 11, 6, 8, 8, 5, 6, 5, 4, 5, 8, 8, 7, 8, 8, 8, 11, 4, 8, 8, 8]) In [5]: doublets Out[5]: [1, 4, 3, 3, 1, 1] In [6]: counts Out[6]: array([[ 1., 4., 2., 4., 2., 3.], [ 3., 4., 3., 2., 2., 2.], [ 4., 2., 3., 3., 3., 3.], [ 4., 3., 0., 3., 4., 4.], [ 2., 2., 3., 3., 1., 3.], [ 3., 10., 1., 1., 2., 1.]]) 1〜6の乱数で 100 x 2 のarrayを生成 dicesについてx軸方向の合計を計算ゾロ目の回数をカウント目の組合わせごとに回数をカウント

33. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 33 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編乱数データの生成 ■ 次の条件で乱数データを生成してグラフに表示します。 - 　　　　　の範囲を10等分したそれぞれの点 x について、正弦関数　　　　　の値に平均 0、標準偏差 0.3 の正規分布ノイズを加えた値 y を生成します。 # -*- coding: utf-8 -*- # # 乱数によるデータ生成 # import numpy as np import matplotlib.pyplot as plt import pandas as pd from pandas import Series, DataFrame from numpy.random import normal # データ生成（愚直な実装） def generate_data01(n): data_x = [] data_y = [] for i in range(n): x = float(i) / float(n-1) y = np.sin(2 * np.pi * x) + normal(0, 0.3) data_x.append(x) data_y.append(y) return data_x, data_y # データ生成（ブロードキャストルールを利用した実装） def generate_data02(n): data_x = np.linspace(0,1,n) data_y = np.sin(2 * np.pi * data_x) + normal(0, 0.3, n) return data_x, data_y # Main if __name__ == '__main__': fig = plt.figure() data_x, data_y = generate_data01(10) # data_x, data_y = generate_data02(10) subplot = fig.add_subplot(1,1,1) subplot.set_xlabel('Observation point') subplot.set_ylabel('Value') subplot.set_xlim(-0.05,1.05) # 生成したデータを表示 subplot.scatter(data_x, data_y, marker='o', color='blue', label='Observed value') # 三角関数の曲線を表示 linex = np.linspace(0,1,100) liney = np.sin(2 * np.pi * linex) subplot.plot(linex, liney, linestyle='--', color='green', label='Theoretical curve') subplot.legend(loc=1) fig.show()

34. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 34 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編乱数データの生成 ■ 前ページのコードを実行すると以下のグラフ表示されます。 - サブプロットオブジェクトに複数のグラフ描画メソッドを適用するとグラフを重ねて表示することができます。 - labelオプションでグラフに付与したラベルは、legend()メソッドで判例として表示されます。判例の表示位置（locオプション）については、下記を参照。 ● http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.legend subplot = fig.add_subplot(1,1,1) subplot.set_xlabel('Observation point') subplot.set_ylabel('Value') subplot.set_xlim(-0.05,1.05) # 生成したデータを表示 subplot.scatter(data_x, data_y, marker='o', color='blue', label='Observed value') # 三角関数の曲線を表示 linex = np.linspace(0,1,100) liney = np.sin(2 * np.pi * linex) subplot.plot(linex, liney, linestyle='--', color='green', label='Theoretical curve') subplot.legend(loc=1) fig.show()

36. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 36 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 DataFrameとSeries ■ pandasは、Rのデータフレームとアトミックベクトルに相当するオブジェクトである、 DataFrameとSeriesを提供します。 - DataFrameは、2次元のデータについて行／列のラベルを付与したものです。エクセルシートのように操作することができます。１つの列が１つのデータ項目、１つの行が１つのレコードを表します。 - Seriesは、DataFrameから特定の行、もしくは、列を取り出したものになります。 ■ 次は、DataFrameの例です。ここでは、各行のラベル（index）は連番、各列のラベル（column）はデータの種類を示す名前として使用しています。 - cities : 各都市の湿度と気温を表にまとめたDataFrame - diceroll : 2個のサイコロを振った結果をまとめたDataFrme In [10]: cities Out[10]: City Humidity Temperature 0 Tokyo 44 25.0 1 Osaka 42 28.2 2 Nagoya NaN 27.3 3 Okinawa 62 30.9 In [11]: diceroll Out[11]: dice1 dice2 0 6 6 1 2 6 2 2 5 3 1 4 4 2 5

37. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 37 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 DataFrameの作成方法 ■ DataFrameを作成する手順には、次ようなパターンがあります。 - csvファイルからデータを読み込む。 - 各列のデータを表すSeriesオブジェクトを用意して、DataFrameにまとめる。 - データの集合をarrayにまとめておき、DataFrameに変換する。 - columnだけ定義した空のDataFrameを作成して、1行づつデータを加えていく。 - columnすら持たない空のDataFrameを作成して、列を追加していく。

38. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 38 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 arrayをDataFrameに変換する方法 ■ 次は、2次元のarrayをDataFrameに変換する例です。 - columnsオプションで、各列のcolumn名を指定します。 In [1]: dices = randint(1,7,(5,2)) In [2]: dices Out[2]: array([[6, 6], [2, 6], [2, 5], [1, 4], [2, 5]]) In [3]: diceroll = DataFrame(dices, columns=['dice1','dice2']) In [4]: diceroll Out[4]: dice1 dice2 0 6 6 1 2 6 2 2 5 3 1 4 4 2 5

39. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 39 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 Seriesオブジェクトをまとめる方法 ■ 次は、SeriesオブジェクトからDataframeを作成する例です。 - はじめに、各列のデータに対応するSeriesオブジェクトを作成します。Seriesオブジェクトは、 nameオプションでデータの名前が付与できます。 In [1]: city = Series(['Tokyo','Osaka','Nagoya','Okinawa'], name='City') In [2]: temp = Series([25.0,28.2,27.3,30.9], name='Temperature') In [3]: humid = Series([44,42,np.nan,62], name='Humidity') In [4]: city Out[4]: 0 Tokyo 1 Osaka 2 Nagoya 3 Okinawa Name: City, dtype: object In [5]: temp Out[5]: 0 25.0 1 28.2 2 27.3 3 30.9 Name: Temperature, dtype: float64 In [6]: humid Out[6]: 0 44 1 42 2 NaN 3 62 Name: Humidity, dtype: float64 np.nanは欠損値を表すダミーデータです。

40. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 40 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 Seriesオブジェクトをまとめる方法 - 各列のcolumn名と対応するSeriesオブジェクトのディクショナリを与えて、DataFrameを生成します。 - Seriesオブジェクトの代わりに、リストを用いてもDataFrameを生成することができます。 In [7]: cities = DataFrame({'City':city, 'Temperature':temp, 'Humidity':humid}) In [8]: cities Out[8]: City Humidity Temperature 0 Tokyo 44 25.0 1 Osaka 42 28.2 2 Nagoya NaN 27.3 3 Okinawa 62 30.9 In [11]: data = {'City':['Tokyo','Osaka','Nagoya','Okinawa'], ....: 'Temperature':[25.0,28.2,27.3,30.9], ....: 'Humidity':[44,42,np.nan,62]} In [12]: cities = DataFrame(data) In [13]: cities Out[13]: City Humidity Temperature 0 Tokyo 44 25.0 1 Osaka 42 28.2 2 Nagoya NaN 27.3 3 Okinawa 62 30.9

41. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 41 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編空のDataFrameに行を追加していく方法 ■ 次は、空のDataFrameに行を追加する例です。 - はじめに、column名だけを指定したDataFrameを作成します。 - 追加するデータをSeriesオブジェクトとして用意します。この際、indexオプションで、column 名に対応する名前を付けておきます。 - 用意したDataFrameのappend()メソッドで、Seriesオブジェクトを追加します。（Seriesオブジェクトを追加する際は、ignore_index=Trueを指定します。） In [1]: diceroll = DataFrame(columns=['dice1','dice2']) In [2]: diceroll Out[2]: Empty DataFrame Columns: [dice1, dice2] Index: [] In [3]: oneroll = Series(randint(1,7,2), index=['dice1','dice2']) In [4]: oneroll Out[4]: dice1 5 dice2 6 dtype: int64 In [5]: diceroll = diceroll.append(oneroll, ignore_index=True) In [6]: diceroll Out[6]: dice1 dice2 0 5 6

42. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 42 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編空のDataFrameに行を追加していく方法 ■ 次は、2個のサイコロを1000回振った結果をシュミレーションする例です。 - DataFrameのdescribe()メソッドで基本的な統計値を確認することができます。 In [1]: diceroll = DataFrame(columns=['dice1','dice2']) In [2]: for i in range(1000): ....: diceroll = diceroll.append( ....: Series(randint(1,7,2), index=['dice1','dice2']), ....: ignore_index = True) ....: In [3]: diceroll.describe() Out[3]: dice1 dice2 count 1000.000000 1000.000000 mean 3.501000 3.510000 std 1.699673 1.691378 min 1.000000 1.000000 25% 2.000000 2.000000 50% 3.000000 3.000000 75% 5.000000 5.000000 max 6.000000 6.000000

43. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 43 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 DataFrameを結合する方法 ■ DataFrameのappend()メソッドは、2つのDataFrameを結合する際にも利用できます。 - ignore_index=Trueを指定すると、indexは通し番号になるように再割当てが行われます。指定がない場合、元のDataFrameのindexが保存されます。 In [5]: diceroll1.append(diceroll2) Out[5]: dice1 dice2 0 5 3 1 5 2 2 5 2 3 6 4 4 4 1 0 1 4 1 3 4 2 5 6 In [6]: diceroll1.append(diceroll2, ignore_index=True) Out[6]: dice1 dice2 0 5 3 1 5 2 2 5 2 3 6 4 4 4 1 5 1 4 6 3 4 7 5 6 In [1]: diceroll1 = DataFrame(randint(1,7,(5,2)), ....: columns=['dice1','dice2']) In [2]: diceroll2 = DataFrame(randint(1,7,(3,2)), ....: columns=['dice1','dice2']) In [3]: diceroll1 Out[3]: dice1 dice2 0 5 3 1 5 2 2 5 2 3 6 4 4 4 1 In [4]: diceroll2 Out[4]: dice1 dice2 0 1 4 1 3 4 2 5 6 indexを通し番号で採番しなおす。

44. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 44 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 DataFrameに列を追加する方法 ■ 次は、DataFrameに列を追加する例です。 - 配列のindex記法でcolumn名を指定して追加します。 - pd.concat()関数で複数のSeriesを列として結合したり、既存のDataFrameにSeriesを列として追加することもできます。 In [1]: diceroll = DataFrame() In [2]: diceroll['dice1'] = randint(1,7,5) In [3]: diceroll Out[3]: dice1 0 1 1 1 2 3 3 6 4 6 In [4]: diceroll['dice2'] = randint(1,7,5) In [5]: diceroll Out[5]: dice1 dice2 0 1 2 1 1 3 2 3 5 3 6 3 4 6 1 In [1]: dice1 = Series(randint(1,7,5),name='dice1') In [2]: dice2 = Series(randint(1,7,5),name='dice2') In [3]: diceroll = pd.concat([dice1, dice2], axis=1) In [4]: diceroll Out[4]: dice1 dice2 0 2 6 1 6 2 2 2 6 3 3 4 4 5 6 In [5]: dice3 = Series(randint(1,7,5),name='dice3') In [6]: diceroll = pd.concat([diceroll, dice3], axis=1) In [7]: diceroll Out[7]: dice1 dice2 dice3 0 2 6 1 1 6 2 1 2 2 6 2 3 3 4 4 4 5 6 5 列方向での結合

46. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 46 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編列の取り出し ■ データフレームから特定の列をSeriesとして取り出します。 - 配列のindexにcolumn名を指定する方法と、column名を属性として指定する方法があります。 ※ 属性として指定する方法は、すこし紛らわしいかも知れません。 In [3]: diceroll Out[3]: dice1 dice2 0 5 3 1 5 2 2 5 2 3 6 4 4 4 1 In [4]: diceroll['dice1'] Out[4]: 0 5 1 5 2 5 3 6 4 4 Name: dice1, dtype: int64 In [5]: diceroll.dice1 Out[5]: 0 5 1 5 2 5 3 6 4 4 Name: dice1, dtype: int64 配列のindexにcolumn名を指定 column名を属性として指定

47. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 47 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編列の取り出し ■ 複数の列をDataFrameとして取り出すこともできます。 - 配列のindexに複数のcolumn名のリストを指定します。 - 単一の列をSeriesではなく、DataFrameとして取り出す際は次のようにします。 In [3]: cities Out[3]: City Humidity Temperature 0 Tokyo 44 25.0 1 Osaka 42 28.2 2 Nagoya NaN 27.3 3 Okinawa 62 30.9 In [4]: cities[['City', 'Humidity']] Out[4]: City Humidity 0 Tokyo 44 1 Osaka 42 2 Nagoya NaN 3 Okinawa 62 cities['City', 'Humidity'] ではないので注意 In [5]: cities[['City']] Out[5]: City 0 Tokyo 1 Osaka 2 Nagoya 3 Okinawa In [6]: cities['City'] Out[6]: 0 Tokyo 1 Osaka 2 Nagoya 3 Okinawa Name: City, dtype: object DataFrameとして取り出し Seriesとして取り出し

48. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 48 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編行の取り出し ■ 行を指定して取り出す際は、配列の「スライス記法」で行を指定します。 - [開始行:終了行-1] で指定します。 - 次のように、特定の条件を満たす行だけを抽出することもできます。 In [4]: cities[0:2] Out[4]: City Humidity Temperature 0 Tokyo 44 25.0 1 Osaka 42 28.2 In [5]: cities[2:3] Out[5]: City Humidity Temperature 2 Nagoya NaN 27.3 In [6]: cities[1:] Out[6]: City Humidity Temperature 1 Osaka 42 28.2 2 Nagoya NaN 27.3 3 Okinawa 62 30.9 In [7]: cities[cities['Temperature']>28] Out[7]: City Humidity Temperature 1 Osaka 42 28.2 3 Okinawa 62 30.9

49. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 49 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編行と列を指定した取り出し ■ 行と列の両方を指定して取り出す際は、ixフィールドを利用します。 - 行はスライス記法、列はcolumn名のリストで指定します。 In [3]: cities Out[3]: City Humidity Temperature 0 Tokyo 44 25.0 1 Osaka 42 28.2 2 Nagoya NaN 27.3 3 Okinawa 62 30.9 In [4]: cities.ix[1:3, ['City','Humidity']] Out[4]: City Humidity 1 Osaka 42 2 Nagoya NaN 3 Okinawa 62

50. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 50 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編行単位でのIteration処理 ■ DataFrameの行ごとに処理をする際は、iterrows()メソッドを利用します。 - 各行のindexとその行を表すSeriesオブジェクトが順番に取得できます。 In [3]: cities Out[3]: City Humidity Temperature 0 Tokyo 44 25.0 1 Osaka 42 28.2 2 Nagoya NaN 27.3 3 Okinawa 62 30.9 In [4]: for index, line in cities.iterrows(): ....: print 'Index:', index ....: print line, 'n' ....: Index: 0 City Tokyo Humidity 44 Temperature 25 Name: 0, dtype: object Index: 1 City Osaka Humidity 42 Temperature 28.2 Name: 1, dtype: object Index: 2 City Nagoya Humidity NaN Temperature 27.3 Name: 2, dtype: object ...（以下略）...

51. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 51 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 DataFrameから抽出したデータの変更について ■ これまでに説明した方法でDataFrameから抽出したオブジェクトは、参照専用として扱い、値を変更する操作は行わないでください。 - 抽出方法によって、元のDataFrameのオブジェクトを参照している場合とそうでない場合があり、変更の影響範囲が不明確になります。 - 抽出した方の値を変更する際は、copy()メソッドで明示的にオブジェクトのコピーを行います。 In [4]: humidity = cities['Humidity'].copy() In [5]: humidity[2] = 50 In [6]: humidity Out[6]: 0 44 1 42 2 50 3 62 Name: Humidity, dtype: float64 In [7]: cities Out[7]: City Humidity Temperature 0 Tokyo 44 25.0 1 Osaka 42 28.2 2 Nagoya NaN 27.3 3 Okinawa 62 30.9 元のDataFrameは変更されていない

52. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 52 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 DataFrameから抽出したデータの変更について - DataFrameの特定要素を変更する際は、locメソッドで要素を指定して変更します。 - 次は、30より大きいTemperatureをすべて30に揃える処理の例です。 In [4]: cities.loc[2,'Humidity'] = 50 In [5]: cities Out[5]: City Humidity Temperature 0 Tokyo 44 25.0 1 Osaka 42 28.2 2 Nagoya 50 27.3 3 Okinawa 62 30.9 In [6]: for index, line in cities.iterrows(): ....: if line['Temperature'] > 30: ....: cities.loc[index, 'Temperature'] = 30 ....: In [7]: cities Out[7]: City Humidity Temperature 0 Tokyo 44 25.0 1 Osaka 42 28.2 2 Nagoya 50 27.3 3 Okinawa 62 30.0

54. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 54 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編 DataFrame/Seriesのarrayへの変換 ■ DataFrame/Seriesをarrayに変換する方法です。 - as_matrix()メソッドを使用します。 ※arrayを引数とする関数では、DataFrame/Seriesを代入すると自動的にarrayへの変換が　　　行われます。 In [3]: cities Out[3]: City Humidity Temperature 0 Tokyo 44 25.0 1 Osaka 42 28.2 2 Nagoya NaN 27.3 3 Okinawa 62 30.9 In [4]: cities.as_matrix() Out[4]: array([['Tokyo', 44.0, 25.0], ['Osaka', 42.0, 28.2], ['Nagoya', nan, 27.3], ['Okinawa', 62.0, 30.9]], dtype=object) In [12]: cities['City'] Out[12]: 0 Tokyo 1 Osaka 2 Nagoya 3 Okinawa Name: City, dtype: object In [13]: cities['City'].as_matrix() Out[13]: array(['Tokyo', 'Osaka', 'Nagoya', 'Okinawa'], dtype=object)

55. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 55 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編行のシャッフル ■ トランプのカードを集めたDataFrameで、カードのシャッフルを行います。 - トランプのカードを集めたDataFrameを定義します。 - permutation()関数は、arrayの要素の順番をランダム入れ替えます。下記は、index属性で取り出したindexの順番をシャッフルしています。 In [1]: face = ['king','queen','jack','ten','nine','eight', ...: 'seven','six','five','four','three','two','ace'] In [2]: suit = ['spades', 'clubs', 'diamonds', 'hearts'] In [3]: value = range(13,0,-1) In [4]: deck = DataFrame({ ...: 'face': np.tile(face,4), ...: 'suit': np.repeat(suit,13), ...: 'value': np.tile(value,4)}) In [5]: deck.head() Out[5]: face suit value 0 king spades 13 1 queen spades 12 2 jack spades 11 3 ten spades 10 4 nine spades 9 先頭部分のデータだけを取り出すメソッド In [5]: np.random.permutation(deck.index) Out[5]: array([48, 33, 6, 49, 10, 28, 41, 18, 32, 36, 19, 14, 25, 46, 30, 51, 2, 31, 12, 5, 42, 4, 9, 40, 43, 13, 16, 35, 8, 1, 50, 20, 17, 22, 24, 11, 26, 47, 37, 27, 45, 29, 0, 3, 44, 34, 38, 39, 15, 21, 7, 23])

56. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 56 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編行のシャッフル - reindex()メソッドは、DataFrameのindexを付け直します。先のシャッフルしたindexを与えることで、行のシャッフルを行います。 - さらに、reset_index()でindexに通し番号を付け直すこともできます。 ※ 「drop=True」を指定しない場合、古いindexが「index」というcolumnに保存されます。 In [6]: deck = deck.reindex(np.random.permutation(deck.index)) In [7]: deck.head() Out[7]: face suit value 12 ace spades 1 23 three clubs 3 8 five spades 5 34 five diamonds 5 14 queen clubs 12 In [8]: deck = deck.reset_index(drop=True) In [9]: deck.head() Out[9]: face suit value 0 ace spades 1 1 three clubs 3 2 five spades 5 3 five diamonds 5 4 queen clubs 12

57. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 57 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編データフレームによるグラフの描画 ■ DataFrameオブジェクトは自分自身のグラフを描く機能を持っています。 - 次のようにplot()メソッドを用いると、列ごとのデータをまとめてグラフに表示することができます。 ※ plot()メソッドの詳細は下記を参照　http://pandas.pydata.org/pandas-docs/version/0.17.0/visualization.html In [1]: %paste result = DataFrame() for c in range(3): y = 0 t = [] for delta in np.random.normal(loc=0.0, scale=1.0, size=100): y += delta t.append(y) result['Trial %d' % c] = t ## -- End pasted text -- In [2]: result.head() Out[2]: Trial 0 Trial 1 Trial 2 0 -0.928318 -0.269304 1.242675 1 -1.992230 -0.456286 2.970072 2 -1.190998 -1.587571 4.004387 3 -0.913663 -0.756372 4.437244 4 -1.839408 -1.554711 4.547231 In [3]: result.plot(title='Random walk') Out[3]: <matplotlib.axes._subplots.AxesSubplot at 0x5b5f090>

59. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 59 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編サンプルコードの実行例 ■ サンプルコード「02-square_error.py」の主要部分を抜粋して解説します。 - 次は、サンプルコードの実行例です。 In [1]: %run 02-square_error.py Table of the coefficients M=0 M=1 M=3 M=9 0 0.025208 0.643729 0.032449 0.081758 1 NaN -1.237042 8.563735 -68.909851 2 NaN NaN -25.531220 1875.284034 3 NaN NaN 16.898752 -17910.001013 4 NaN NaN NaN 85747.741008 5 NaN NaN NaN -232533.965586 6 NaN NaN NaN 373482.794914 7 NaN NaN NaN -352470.799590 8 NaN NaN NaN 180840.258689 9 NaN NaN NaN -38962.669319

60. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 60 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編サンプルデータの作成 ■ 区間　　　　　を等分したデータセットを生成します。 - (x, y) 座標を列に持つDataFrameとして用意します。 In [1]: %paste def create_dataset(num): dataset = DataFrame(columns=['x','y']) for i in range(num): x = float(i)/float(num-1) y = np.sin(2*np.pi*x) + normal(scale=0.3) dataset = dataset.append(Series([x,y], index=['x','y']), ignore_index=True) return dataset ## -- End pasted text -- In [2]: dataset = create_dataset(10) In [3]: dataset Out[3]: x y 0 0.000000 -0.176290 1 0.111111 0.406402 2 0.222222 0.592576 3 0.333333 1.175387 4 0.444444 0.652480 5 0.555556 -0.477052 6 0.666667 -1.123978 7 0.777778 -0.720408 8 0.888889 -0.417265 9 1.000000 -0.253635 In [4]: dataset.plot(kind='scatter', x='x', y='y', ...: xlim=[-0.1,1.1], ylim=[-1.5,1.5]) Out[4]: <matplotlib.axes._subplots.AxesSubplot at 0x4c23790>

61. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 61 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編係数の決定 ■ 最小二乗法の公式を用いて、多項式の係数を計算します。 - 次の関数では、決定された多項式　　と係数　を返しています。 def resolve(dataset, m): t = dataset.y phi = DataFrame() for i in range(0,m+1): p = dataset.x**i p.name="x**%d" % i phi = pd.concat([phi,p], axis=1) tmp = np.linalg.inv(np.dot(phi.T, phi)) ws = np.dot(np.dot(tmp, phi.T), t) def f(x): y = 0 for i, w in enumerate(ws): y += w * (x ** i) return y return (f, ws)

62. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 62 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編係数の決定 - 行列　は、DataFrameに列を追加する方法で作成しています。 In [5]: m = 3 In [6]: %paste t = dataset.y phi = DataFrame() for i in range(0,m+1): p = dataset.x**i p.name="x**%d" % i phi = pd.concat([phi,p], axis=1) tmp = np.linalg.inv(np.dot(phi.T, phi)) ws = np.dot(np.dot(tmp, phi.T), t) ## -- End pasted text -- In [7]: t Out[7]: 0 -0.176290 1 0.406402 2 0.592576 3 1.175387 4 0.652480 5 -0.477052 6 -1.123978 7 -0.720408 8 -0.417265 9 -0.253635 Name: y, dtype: float64 In [8]: phi Out[8]: x**0 x**1 x**2 x**3 0 1 0.000000 0.000000 0.000000 1 1 0.111111 0.012346 0.001372 2 1 0.222222 0.049383 0.010974 3 1 0.333333 0.111111 0.037037 4 1 0.444444 0.197531 0.087791 5 1 0.555556 0.308642 0.171468 6 1 0.666667 0.444444 0.296296 7 1 0.777778 0.604938 0.470508 8 1 0.888889 0.790123 0.702332 9 1 1.000000 1.000000 1.000000 In [9]: ws Out[9]: array([ -0.29207698, 11.0815224 , -30.56215955, 19.6937635 ])

63. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 63 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編平方根平均二乗誤差の計算 ■ 決定された多項式を用いて、平方根平均二乗誤差を計算します。 - 次はiterrows()メソッドで、データセット dataset の各要素についての繰り返し処理を行っています。 - 次はトレーニングセットとテストセットを用意して、それぞれの平方根平均二乗誤差を計算しています。　　　　の各次数での結果をDataFrameにまとめています。 def rms_error(dataset, f): err = 0 for index, line in dataset.iterrows(): x, y = line.x, line.y err += 0.5 * (y - f(x))**2 return np.sqrt(2 * err / len(dataset)) df = DataFrame(columns=['Training set','Test set']) for m in range(0,10): # 多項式の次数 f, ws = resolve(train_set, m) train_error = rms_error(train_set, f) test_error = rms_error(test_set, f) df = df.append( Series([train_error, test_error], index=['Training set','Test set']), ignore_index=True)

64. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 64 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編平方根平均二乗誤差の計算 - 次は実際に平方根平均二乗誤差をまとめたDataFrameの例です。 In [16]: train_set = create_dataset(10) In [17]: test_set = create_dataset(10) In [28]: %paste df = DataFrame(columns=['Training set','Test set']) for m in range(0,10): # 多項式の次数 f, ws = resolve(train_set, m) train_error = rms_error(train_set, f) test_error = rms_error(test_set, f) df = df.append( Series([train_error, test_error], index=['Training set','Test set']), ignore_index=True) ## -- End pasted text -- In [19]: df Out[19]: Training set Test set 0 0.687219 0.732829 1 0.555685 0.607513 2 0.553308 0.607038 3 0.209720 0.375495 4 0.209711 0.376193 5 0.163047 0.350020 6 0.159958 0.346817 7 0.046236 0.415637 8 0.028049 0.415522 9 0.000047 0.399039 In [20]: df.plot(title='RMS Error', style=['-','--'], ylim=(0,0.9)) Out[20]: <matplotlib.axes._subplots.AxesSubplot at 0x531a650>

66. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 66 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編サンプルコードの実行例 ■ サンプルコード「04-perceptron.py」の主要部分を抜粋して解説します。 - 次は、サンプルコードの実行例です。 In [1]: %run 04-perceptron.py

67. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 67 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編サンプルデータの作成 ■ 　　　の属性を持つデータセットを用意します。 - 次の関数では、(x, y, t) それぞれの値を列とするDataFrameにまとめています。 N1 = 20 # クラス t=+1 のデータ数 Mu1 = [15,10] # クラス t=+1 の中心座標 N2 = 30 # クラス t=-1 のデータ数 Mu2 = [0,0] # クラス t=-1 の中心座標 # データセット {x_n,y_n,type_n} を用意 def prepare_dataset(variance): cov1 = np.array([[variance,0],[0,variance]]) cov2 = np.array([[variance,0],[0,variance]]) df1 = DataFrame(multivariate_normal(Mu1,cov1,N1),columns=['x','y']) df1['type'] = 1 df2 = DataFrame(multivariate_normal(Mu2,cov2,N2),columns=['x','y']) df2['type'] = -1 df = pd.concat([df1,df2],ignore_index=True) df = df.reindex(np.random.permutation(df.index)).reset_index(drop=True) return df

68. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 68 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編サンプルデータの作成 ■ 先の関数の処理の流れは、次のようになります。 - はじめに、　　　それぞれのデータを個別のDataFrameとして用意します。　 - np.random.multivariate_normal() で2次元正規分布のDataFrameを作成して、その後で、type の列を追加しています。 In [1]: %paste cov1 = np.array([[15,0],[0,15]]) cov2 = np.array([[15,0],[0,15]]) df1 = DataFrame(multivariate_normal([15,10],cov1,20),columns=['x','y']) df1['type'] = 1 df2 = DataFrame(multivariate_normal([0,0],cov2,30),columns=['x','y']) df2['type'] = -1 ## -- End pasted text -- In [2]: df1.head() Out[2]: x y type 0 6.713990 6.220122 1 1 21.949116 8.753709 1 2 12.420816 12.581736 1 3 11.377856 12.559347 1 4 18.434834 8.899856 1 In [3]: df2.head() Out[3]: x y type 0 -3.592348 -3.889078 -1 1 0.978584 1.349947 -1 2 1.882370 -0.047328 -1 3 -2.084037 0.825577 -1 4 -5.937611 0.547781 -1

69. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 69 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編サンプルデータの作成 - ２つのDataFrameを結合して行をシャッフルすることで、最終的なデータセットとしています。 In [4]: %paste df = pd.concat([df1,df2],ignore_index=True) df = df.reindex(np.random.permutation(df.index)).reset_index(drop=True) ## -- End pasted text -- In [5]: df.head(10) Out[5]: x y type 0 18.434834 8.899856 1 1 1.666034 -2.902376 -1 2 17.400656 7.974810 1 3 15.674239 11.469079 1 4 1.101150 -5.781190 -1 5 -2.198366 0.042669 -1 6 17.668891 13.150313 1 7 6.713990 6.220122 1 8 18.421506 8.640303 1 9 14.485980 12.099656 1

70. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 70 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編パーセプトロンによる係数の決定 ■ 次はパーセプトロンのアルゴリズムで係数を修正する処理の流れです。 - iterrows()メソッドでトレーニングセットの各データについて、「正しく分類されない点について係数の修正を行う」という処理を行ないます。この処理を30回繰り返して、それぞれの回を終えた時点の係数をDataFrameとして記録していきます。　　　 In [2]: train_set = prepare_dataset(15) In [3]: train_set.head() Out[3]: x y type 0 14.241459 11.795794 1 1 -1.416575 9.704131 -1 2 -7.508590 -3.397712 -1 3 4.835040 -4.651390 -1 4 -4.255462 -2.498300 -1 In [4]: %paste # パラメータの初期値とbias項の設定 w0 = w1 = w2 = 0.0 bias = 0.5 * (train_set.x.mean() + train_set.y.mean()) # Iterationを30回実施 paramhist = DataFrame([[w0,w1,w2]], columns=['w0','w1','w2']) for i in range(30): for index, point in train_set.iterrows(): x, y, type = point.x, point.y, point.type if type * (w0*bias + w1*x + w2*y) <= 0: w0 += type * 1 w1 += type * x w2 += type * y paramhist = paramhist.append( Series([w0,w1,w2], ['w0','w1','w2']), ignore_index=True) ## -- End pasted text -- In [5]: paramhist.head(10) Out[5]: w0 w1 w2 0 0 0.000000 0.000000 1 -8 4.994187 6.120941 2 -13 8.141115 12.120890 3 -16 5.401956 6.230897 4 -16 5.401956 6.230897 5 -16 5.401956 6.230897 6 -16 5.401956 6.230897 7 -16 5.401956 6.230897 8 -16 5.401956 6.230897 9 -16 5.401956 6.230897 In [6]: paramhist.plot() Out[6]: <matplotlib.axes._subplots.Axes Subplot at 0x7f77f08cf410> 正しく分類されない点について係数を修正

71. Copyright (C) 2015 National Institute of Informatics, All rights reserved. 71 ITエンジニアのための機械学習理論入門 − NumPy / pandasチュートリアル＆サンプルコード解説編グラフの描画 ■ トレーニングセットのデータと決定された直線は、サブプロットオブジェクトの scatter()、および、plot()メソッドで表示しています。 - 次の例では、data_graphがサブプロットオブジェクトです。 ■ 係数の変化のグラフは、DataFrameのグラフ描画を機能を用いて表示しています。 - axオプションで描画するサブプロットを指定します。次の例では、param_graphがサブプロットオブジェクトになります。 train_set = prepare_dataset(variance) train_set1 = train_set[train_set['type']==1] train_set2 = train_set[train_set['type']==-1] ymin, ymax = train_set.y.min()-5, train_set.y.max()+10 xmin, xmax = train_set.x.min()-5, train_set.x.max()+10 data_graph.set_ylim([ymin-1, ymax+1]) data_graph.set_xlim([xmin-1, xmax+1]) data_graph.scatter(train_set1.x, train_set1.y, marker='o') data_graph.scatter(train_set2.x, train_set2.y, marker='x') linex = np.arange(xmin-5, xmax+5) liney = - linex * w1 / w2 - bias * w0 / w2 label = "ERR %.2f%%" % err_rate data_graph.plot(linex,liney,label=label,color='red') data_graph.legend(loc=1) paramhist.plot(ax=param_graph) param_graph.legend(loc=1)

ITエンジニアのための機械学習理論入門 ― NumPy / pandasチュートリアル&サンプルコード解説編

Etsuji Nakai

ITエンジニアのための機械学習理論入門 ― NumPy / pandasチュートリアル&サンプルコード解説編

A particular slide catching your eye?