Javaの形態素解析ライブラリIgoにハマる．

2011-03-24 Javaの形態素解析ライブラリIgoにハマる．

■ Calbee ギザギザ ポテトサワークリームオニオン 23:45

タマネギの香りが良い感じですね～

サワーとありますが，酸味はほとんど無くて酸味の強いワインのようなお酒のツマミに合います．

原材料名は，「じゃがいも（遺伝子組換えでない）」「デキストリン」「食塩」「オニオンパウダー（大豆を含む）」「粉末酢」「ホエイパウダー」「サワークリームパウダー」「チーズパウダー」「砂糖」「パセリフリーク」「ガーリックパウダー」「香料（卵・オレンジを含む）」等で，カロリーは60グラム当たり332kcal，ナトリウムの食塩相当量は0.8グラムだそうです．

■IgoのNoSuchMethodErrorにハマる 00:09

日本語の形態素解析ソフトとしてはMeCabやChasenが有名ですが，Javaで書かれたものとしてはSenやGoSenが有名です．

Windows上で形態素解析Sen - なぜか数学者にはワイン好きが多い

ところがメンテされていないものが多いので，比較的新しいIgoを試してみました．

Igo - a morphological analyzer

Apache LuceneのAnalyzerを作りたかったのです．

Japanese_Luceneというディレクトリ以下に色々と展開することにします．

Luceneは普通にダウンロードして展開．

> wget http://ftp.jaist.ac.jp/pub/apache/lucene/java/lucene-3.0.3.tar.gz
> tar xvf lucene-3.0.3.tar.gz
> mkdir -p Japanese_Lucene/lib
> cp -v lucene-3.0.3/lucene-core-3.0.3.jar Japanese_Lucene/lib/

Igoもjarを取ってきてlibに格納．

> cd Japanese_Lucene/lib
> wget http://iij.dl.sourceforge.jp/igo/46696/igo-0.4.2.jar
> wget http://jaist.dl.sourceforge.jp/igo/46413/igo-analyzer-0.0.1.jar

NAIST 辞書を取ってきて形態素解析用に変換します．

> cd ../..
> wget http://iij.dl.sourceforge.jp/naist-jdic/48487/mecab-naist-jdic-0.6.3-20100801.tar.gz
> tar xvf mecab-naist-jdic-0.6.3-20100801.tar.gz
> cd mecab-naist-jdic-0.6.3-20100801
> grep -v -E '^\"' naist-jdic.csv  > naist-jdic.tmp; mv naist-jdic.tmp naist-jdic.csv
> make clean; ./configure; make; cd ..
> java -cp ./Japanese_Lucene/lib/igo-0.4.2.jar net.reduls.igo.bin.BuildDic ipadic mecab-naist-jdic-0.6.3-20100801 EUC-JP
> ls -l ipadic/

なんかファイルが出来てます．

> java -Dfile.encoding=UTF-8 -cp ./Japanese_Lucene/lib/igo-0.4.2.jar net.reduls.igo.bin.Igo ipadic
すもももももももものうち
すもも     名詞,一般,*,*,*,*,すもも,スモモ,スモモ,,
も       助詞,係助詞,*,*,*,*,も,モ,モ,,
もも      名詞,一般,*,*,*,*,もも,モモ,モモ,,
も       助詞,係助詞,*,*,*,*,も,モ,モ,,
もも      名詞,一般,*,*,*,*,もも,モモ,モモ,,
の       助詞,連体化,*,*,*,*,の,ノ,ノ,,
うち      名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ,,
EOS

なんかうまくいっている気がします．

ではせっかくIgoのAnalyzerも取ってきたので，LuceneのIndexserとSearcherを作ります．

（ここで最初，クラス名をMyIndexerじゃなくてIndexerにして，Luceneの同名のクラス名と被って3分ハマった）

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Date;
import java.util.List;
import java.io.FileReader;
import java.io.StringReader;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.document.Field;
import org.apache.lucene.analysis.Analyzer;
import net.reduls.igo.Tagger;
import net.reduls.igo.analysis.ipadic.IpadicAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.FilterIndexReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class MyIndexer
{
  public MyIndexer() {}
  static final File INDEX_DIR = new File("index");
  static public void main(String[] args)
  {
    final File docDir = new File("data");
    try {
      final Tagger tagger = new Tagger("ipadic");
      IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR),new IpadicAnalyzer(tagger), indexWriter.MaxFieldLength.LIMITED);
      indexDocs(writer, docDir);
      writer.optimize();
      writer.close();
    } catch (IOException e) {
      System.out.println(" caught a " + e.getClass() +"\n with message: " + e.getMessage());
    }
  }
  static void indexDocs(IndexWriter writer, File file) throws IOException
  {
    if (file.canRead())
    {
      if (file.isDirectory())
      {
        String[] files = file.list();
        if (files != null)
        {
          for (int i = 0; i < files.length; i++) {
            indexDocs(writer, new File(file, files[i]));
          }
        }
      } else
      {
        System.out.println("adding " + file);
        writer.addDocument(FileDocument.Document(file));
      }
    }
  }
  private static class FileDocument
  {
    public static Document Document(File f) throws java.io.FileNotFoundException
    {
      Document doc = new Document();
      doc.add(new Field("path", f.getPath(), Field.Store.YES,Field.Index.NOT_ANALYZED));
      doc.add(new Field("modified", DateTools.timeToString(f.lastModified(),DateTools.Resolution.MINUTE),
         Field.Store.YES, Field.Index.NOT_ANALYZED));
      doc.add(new Field("contents", new FileReader(f)));
      return doc;
    }
     private FileDocument() {}
  }
}

ビルドしてjarに固めちゃいます．

> javac -cp Japanese_Lucene/lib/igo-0.4.2.jar:Japanese_Lucene/lib/igo-analyzer-0.0.1.jar:Japanese_Lucene/lib/lucene-core-3.0.3.jar MyIndexer.java
> jar cvf MyIndexer.jar MyIndexer.class MyIndexer$FileDocument.class

ディレクトリを作って日本語テキストファイルを放り込みます．

> mkdir data
てきとーなファイルをdataの下に入れてください

そしてLuceneのインデックスを作ると．．．

> java -cp .:Japanese_Lucene/lib/igo-0.4.2.jar:Japanese_Lucene/lib/igo-analyzer-0.0.1.jar:Japanese_Lucene/lib/lucene-ore-3.0.3.jar MyIndexer
adding data/1040.txt
Exception in thread "main" java.lang.NoSuchMethodError: net.reduls.igo.Tagger.parse(Ljava/lang/String;)Ljava/util/List;
        at net.reduls.igo.analysis.ipadic.IpadicTokenizer.readMorpheme(Unknown Source)
        at net.reduls.igo.analysis.ipadic.IpadicTokenizer.incrementToken(Unknown Source)
        at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:137)
        at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:826)
        at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:802)
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1998)
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1972)
        at MyIndexer.indexDocs(MyIndexer.java:57)
        at MyIndexer.indexDocs(MyIndexer.java:51)
        at MyIndexer.main(MyIndexer.java:33)

このエラーで3時間はハマりました．

最終的に分かったことは，http://jaist.dl.sourceforge.jp/igo/46413/igo-analyzer-0.0.1.jarを使ってはいけない．．．

http://jaist.dl.sourceforge.jp/igo/46413/igo-analyzer-0.0.1-src.tar.gzを取ってきて，antでビルドした結果できたigo-analyzer-0.0.1.jarを使うと，

> java -cp .:Japanese_Lucene/lib/igo-0.4.2.jar:Japanese_Lucene/lib/igo-analyzer-0.0.1.jar:Japanese_Lucene/lib/lucene-core-3.0.3.jar MyIndexer
adding data/1040.txt
adding data/2498.txt
adding data/5587.txt
adding data/2457.txt
adding data/12296.txt
adding data/6715.txt
adding data/812.txt
adding data/8456.txt
adding data/8917.txt

どどーっと出力が出て，ls -l indexで何かファイルが出来ているのを確認できました．

コメントを書く

トラックバック - http://d.hatena.ne.jp/tullio/20110324

最新トラックバック一覧

人気blogランキングへ

		2011/03
日	月	火	水	木	金	土
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

なぜか数学者にはワイン好きが多い

2011-03-24 Javaの形態素解析ライブラリIgoにハマる．

■ Calbee ギザギザポテト サワークリームオニオン 23:45

■IgoのNoSuchMethodErrorにハマる 00:09

■ Calbee ギザギザポテトサワークリームオニオン 23:45