はじめに

文書分類をしたくなったが、fasttextによる自動分類が思ったように上手くいかなかった
その理由は教師データの件数が少なかったかもしれないと考えた
少ない教師データでも上手くいく自動分類が欲しい
少し調べてみたら、gensim+scikit-learnでも分類できそうなので、そちらも試すことにした

前提

Windows 10 pro バージョン1803
Windows Subsystem for Linux(WSL)版ubuntu バージョン8.04 LTS (Bionic Beaver)
Python 3.6.5 :: Anaconda, Inc.
gensim==3.5.0
scikit-learn==0.19.1
mecab-python3==0.7

全体の流れ

Linuxを用意する
pythonを入れる
日本語コーパスを手に入れる
日本語パーサを入れる
分類ライブラリを入れる
教師データを作る
モデルを作る
分類する

Windows Subsystem for Linux(WSL)でubuntuを入れる

こちらの記事を参照のこと

Windows Subsystem for Linuxにssh接続する

こちらの記事を参照のこと
以下、基本的にはteratermで作業

viを初心者でも使いやすくする

こちらの記事を参照のこと
初心者なりに、以下、ソースはvimでがりがり書いていきます

pythonを入れる

こちらの記事を参照のこと

日本語コーパスを手に入れる

こちらの記事を参照のこと

日本語パーサを入れる

pipでお手軽インストール

pip install mecab-python3

上手くいかなければこちらの記事を参照のこと
上記記事にテスト用スクリプトも書いてある

分類ライブラリを入れる

今回はgensim+scikit-learnを使う
今回はタイミングが悪かったのか、pipもアップグレードしろと言われたので、まとめて実行

pip install --upgrade pip
pip install msgpack
pip install msgpack-python
pip install gensim

この流れでインストールするとエラーや警告は出なかった

教師データを作る

こちらの記事を参照のこと

モデルを作る

まず、辞書を作成するスクリプトを作成

vi make_dictionary.py

辞書は、全ての教師データをMeCabで名詞だけの分かち書きにする
つぎに、gensimで辞書にする
ついでにフィルタをかける(オプションの意味は後述)
no_below: 出現文書数N未満の単語を削除
no_above: 出現文書率がN％より上（N%は除かれない）の単語を削除
keen_n: no_below,no_aboveによるフィルターに関わらず、指定した数の単語を保持
keep_tokens: 指定した単語を保持
この辺は今回はフィーリングでザックリ決めた
あとで考え直す必要がありそう

make_dictionary.py

import sys
from gensim import corpora
import MeCab

def get_tokens(content):
    tokens = []
    tagger = MeCab.Tagger('')
    tagger.parse('')
    node = tagger.parseToNode(content)
    while node:
        if node.feature.split(',')[0] == '名詞':
            tokens.append(node.surface)
        node = node.next
    return tokens

def get_content(file_name):
    contexts = []
    with open(file_name, 'r', encoding='utf-8') as f:
        line = f.readline()
        while line:
            contexts.append(line.strip())
            line = f.readline()
    return ''.join(contexts)

def get_tokens_list(input_file):
    tokens_list = []
    file_list = []
    with open(input_file, 'r', encoding='utf-8') as f:
        line = f.readline()
        while line:
            file_list.append(line.strip())
            line = f.readline()
    for file_name in file_list:
        content = get_content(file_name)
        tokens = get_tokens(content)
        tokens_list.append(tokens)
    return tokens_list

def main(argv):
    input_file = argv[0]
    output_file = argv[1]
    tokens_list = get_tokens_list(input_file)
    dictionary = corpora.Dictionary(tokens_list)
    dictionary.filter_extremes(no_below=5, no_above=0.05)
    dictionary.save_as_text(output_file)

if __name__ == '__main__':
    main(sys.argv[1:])

つぎに、分類器(モデル)を作るスクリプトを作る

vi make_model.py

モデルは、再びすべての教師データを形態素に分解して名詞だけ取り出し、先ほど作った辞書を取り出し、辞書を使って「疎」な単語ベクトルを作成し、それを「密」なベクトルにし、密なベクトルのリストとラベルのリストで学習を行う
ラベルは、数字でカテゴリ名はラベル辞書に逃がしておく

make_model.py

import sys, os, re
import MeCab
from gensim import corpora, matutils
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib

def get_tokens(content):
    tokens = []
    tagger = MeCab.Tagger('')
    tagger.parse('')
    node = tagger.parseToNode(content)
    while node:
        if node.feature.split(',')[0] == '名詞':
            tokens.append(node.surface)
        node = node.next
    return tokens

def get_content(file_name):
    contexts = []
    with open(file_name, 'r', encoding='utf-8') as f:
        line = f.readline()
        while line:
            contexts.append(line.strip())
            line = f.readline()
    return ''.join(contexts)

def get_tokens_list(input_file):
    tokens_list = []
    file_list = []
    with open(input_file, 'r', encoding='utf-8') as f:
        line = f.readline()
        while line:
            file_list.append(line.strip())
            line = f.readline()
    for file_name in file_list:
        content = get_content(file_name)
        tokens = get_tokens(content)
        tokens_list.append(tokens)
    return tokens_list

def get_dense_list(tokens_list, dictionary):
    dense_list = []
    num_terms = len(dictionary)
    for tokens in tokens_list:
        vector = dictionary.doc2bow(tokens)
        dense = list( matutils.corpus2dense([vector], num_terms=num_terms).T[0] )
        dense_list.append( dense )
    return dense_list

def get_label_list(input_list_file, output_label_dictionary):
    label_list = []
    label_dict = {}
    label_num = 0
    with open(input_list_file, 'r', encoding='utf-8') as f:
        line = f.readline()
        while line:
            file_name = os.path.basename(line.strip())
            name, ext = os.path.splitext(file_name)
            category_name = re.sub(r'[0-9]+','',name)
            if category_name in label_dict.keys():
                label = label_dict[category_name];
            else:
                label_dict[category_name] = label_num
                label = label_num
                label_num+=1
            label_list.append(label)
            line = f.readline()
    with open(output_label_dictionary, 'w', encoding='utf-8') as f:
        for key, value in label_dict.items():
            f.write(str(value)+','+key+'\n')
    return label_list

def main(argv):
    input_list_file = argv[0]
    input_dictionary_file = argv[1]
    output_label_dictionary = argv[2]
    output_model = argv[3]
    tokens_list = get_tokens_list(input_list_file)
    dictionary = corpora.Dictionary.load_from_text(input_dictionary_file)
    dense_list = get_dense_list(tokens_list, dictionary)
    label_list = get_label_list(input_list_file, output_label_dictionary)
    estimator = RandomForestClassifier()
    estimator.fit(dense_list, label_list)
    joblib.dump(estimator, output_model)

if __name__ == '__main__':
    main(sys.argv[1:])

二つのスクリプトは以下のように使う

python make_dictionary.py 教師データのファイルリスト(categorized_file_list.txt) 辞書ファイル(dictionary)
python make_model.py 教師データのファイルリスト(categorized_file_list.txt) 辞書ファイル(dictionary) ラベル辞書ファイル(label_dict) モデルファイル(model)

分類する

おまちかねの分類タイム

vi predict.py

与えた文字列は形態素に分解する
辞書とモデルを読み込み、辞書を使って形態素を密なベクトルにし、分類器に渡す
うけとったラベルをラベル辞書でカテゴリ名に変換する

predict.py

import sys
from gensim import corpora, matutils
import MeCab
from sklearn.externals import joblib

def get_tokens(content):
    tokens = []
    tagger = MeCab.Tagger('')
    tagger.parse('')
    node = tagger.parseToNode(content)
    while node:
        if node.feature.split(',')[0] == '名詞':
            tokens.append(node.surface)
        node = node.next
    return tokens

def get_label_dictionary(input_label_dictionary):
    label_dictionary = {}
    with open(input_label_dictionary, 'r', encoding='utf-8') as f:
        line = f.readline()
        while line:
            label, category = line.strip().split(',',1)
            label_dictionary[int(label)] = category
            line = f.readline()
    return label_dictionary

def main(argv):
    input_dictionary_file = argv[0]
    input_model_file = argv[1]
    input_label_dictionary = argv[2]
    input_text = argv[3]
    dictionary = corpora.Dictionary.load_from_text(input_dictionary_file)
    estimator = joblib.load(input_model_file)
    label_dictionary = get_label_dictionary(input_label_dictionary)
    tokens = get_tokens(input_text)
    vector = dictionary.doc2bow(tokens)
    dense = list( matutils.corpus2dense([vector], num_terms=len(dictionary)).T[0] )
    label = estimator.predict([dense])
    print(label_dictionary[label[0]])

if __name__ == '__main__':
    main(sys.argv[1:])

以下は、「Politics」に分類された
いい感じ

python predict.py dictionary model label_dicty 自民党の党・政治制度改革実行本部（塩崎恭久本部長）は１３日、党本部で総会を開き、党改革に関する提言案をまとめた。

おわりに

fasttextに比べ、gensim+scikit-learnの方が少ない文章量でも分類できるようだった
ただし、体感レベルでは処理速度は明らかに fasttext > gensim+scikit-learn であり、gensim+scikit-learn少ない文書量でもしばらく待たされる感じがした
機会があったら、精度と処理時間を計ってみたい

pythonでgensim+scikit-learnを使って文書分類してみた

はじめに

前提

全体の流れ

Windows Subsystem for Linux(WSL)でubuntuを入れる

Windows Subsystem for Linuxにssh接続する

viを初心者でも使いやすくする

pythonを入れる

日本語コーパスを手に入れる

日本語パーサを入れる

分類ライブラリを入れる

教師データを作る

モデルを作る

分類する

おわりに