PythonによるWikipediaを活用した自然言語処理

1.  

2. STUDIO OUSIA 2   ‣ ‣ ‣ ‣ ‣

4. STUDIO OUSIA ‣   ‣   ✦ ✦ ✦ ‣   4

5. STUDIO OUSIA ‣ ‣ 5  

6.  

7. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣   ‣ ‣ 7

8. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣   ‣ ‣   8 import bz2 import sys from rdflib import Graph def read_ttl(f): lines = [] for line in f: lines.append(line.decode('utf-8').rstrip()) if len(lines) == 1000: #1000行をまとめて処理 for triple in parse_lines(lines): yield triple lines = [] if lines: for triple in parse_lines(lines): yield triple def parse_lines(lines): g = Graph() g.parse(data=u'n'.join(lines), format='n3') return g with bz2.BZ2File(sys.argv[1]) as in_file: for (_, p, o) in read_ttl(in_file): if p.toPython() == 'http://persistence.uni- leipzig.org/nlp2rdf/ontologies/nif- core#isString': print(o.toPython()) % wget http://downloads.dbpedia.org/2016-10/core-i18n/ja/nif_context_ja.ttl.bz2 % python wiki_corpus.py nif_context_ja.ttl.bz2 > corpus.txt wiki_corpus.py

9. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣   ‣   9 import logging import sys from gensim.models.word2vec import Word2Vec, LineSentence logging.basicConfig(level=logging.INFO) model = Word2Vec(LineSentence(sys.argv[1]), sg=1) model.save(sys.argv[2]) % mecab -Owakati corpus.txt -o corpus_wakati.txt % python word2vec.py corpus_wakati.txt wiki_w2v word2vec.py >>> model = Word2Vec.load(‘wiki_w2v’) >>> model.most_similar(‘日本’)[:3] [('韓国', 0.6719746589660645), ('台湾', 0.6447558403015137), ('英国', 0.6377681493759155)]

10. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ✦ ✦ ‣ 11

11. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp Wikipedia2Vec ‣ ‣ 12 % pip install wikipedia2vec

12. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ‣ 13 % wget http://wikipedia2vec.s3.amazonaws.com/models/ja/2018-04-20/ jawiki_20180420.db.bz2 -O jawiki.db.bz2 % bunzip2 jawiki.db.bz2 % wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages- articles.xml.bz2 % wikipedia2vec build_dump_db jawiki-latest-pages-articles.xml.bz2 jawiki.db

13. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ‣ ‣ 14 import sys import Levenshtein from collections import Counter from wikipedia2vec.dump_db import DumpDB dump_db = DumpDB(sys.argv[1]) pair_counter = Counter() for (title1, title2) in dump_db.redirects(): ops = Levenshtein.editops(title1.lower(), title2.lower()) if len(ops) == 1: (op, p1, p2) = ops[0] if op == 'replace': pair_counter[frozenset((title1[p1], title2[p2]))] += 1 for (pair, count) in pair_counter.most_common(): print('%st%st%d' % (*list(pair), count)) similar_char.py % python similar_char.py jawiki.db > out.tsv % cat out.tsv イー 1857 澤沢 1747 ・＝ 1124

14. ‣ ‣ ‣ ‣ 15

15. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp 17 ‣ ‣ ‣ ‣  

16. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ‣ ✦ ✦ ‣   18

17. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ✦ ✦ ✦ ‣ ✦ ✦ ✦       19

18. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ‣ 20 % wget http://wikipedia2vec.s3.amazonaws.com/models/ja/2018-04-20/ jawiki_20180420_mention.pkl.bz2 -O jawiki_mention.pkl.bz2 % wget http://wikipedia2vec.s3.amazonaws.com/models/ja/2018-04-20/ jawiki_20180420_dic.pkl.bz2 -O jawiki_dic.pkl.bz2 % bunzip2 jawiki_dic.pkl.bz2 jawiki_mention.pkl.bz2 % wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages- articles.xml.bz2 % wikipedia2vec build_dump_db jawiki-latest-pages-articles.xml.bz2 jawiki.db % wikipedia2vec build_dictionary jawiki.db jawiki_dic.pkl % wikipedia2vec build_mention_db jawiki.db jawiki_dic.pkl jawiki_mention.pkl

19. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣   ‣   ‣ ✦ 21 自民党単語・フレーズとそのリンク確率の例

20. import sys from wikipedia2vec.dictionary import Dictionary from wikipedia2vec.mention_db import MentionDB dic = Dictionary.load(sys.argv[1]) db = MentionDB.load(sys.argv[2], dic) words = set() for mention in db: if mention.link_prob >= 0.2: if mention.text not in words: words.add(mention.text) print(mention.text) 22 word_dic.py % python word_dic.py jawiki_dic.pkl jawiki_mention.pkl > out.txt % cat out.txt | sort -R | less % cat out.txt | wc -l 1441724

21. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ‣ ✦ ✦ 23 This scientist names a constant that is equal to Loschmidt’s Constant times “RT over P” and is equal to the Faraday constant over the elementary charge. Wikipedia: Elementary_chargeWikipedia: Faraday_constant Wikipedia: Loschmidt_constant

22. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp 24 entity_linking.py % python entity_linking.py jawiki_dic.pkl jawiki_mention.pkl input.txt <Mention NHK連続テレビ小説 -> 連続テレビ小説> <Mention 半分、青い。 -> 半分、青い。> <Mention 永野芽郁 -> 永野芽郁> <Mention コラムニスト -> コラムニスト> <Mention 木村隆志 -> 木村隆志> import sys from wikipedia2vec.dictionary import Dictionary from wikipedia2vec.mention_db import MentionDB from wikipedia2vec.utils.tokenizer.mecab_tokenizer import MeCabTokenizer dic = Dictionary.load(sys.argv[2]) db = MentionDB.load(sys.argv[3], dic) with open(sys.argv[1]) as f: text = f.read() tokenizer = MeCabTokenizer() tokens = tokenizer.tokenize(text) for mention in db.detect_mentions(text, tokens): print(mention)

23. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp ‣ ✦ ✦ ‣   ✦ 26 Wikipedia2Vec:  

24. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp Wikipedia2Vec:   ‣ ‣ ‣ ‣   27 Aristotle was a philosopher + Logic Science Europe Socrates Renaissance Metaphysics Philosopher Philosophy AvicennaAristotle Plato

25. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp Wikipedia2Vec: 28 https://wikipedia2vec.github.io ‣

26. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp Wikipedia2Vec ‣ ‣ ✦ ✦   ✦ ✦ 29 f_dot = <float32_t>(blas.sdot(&dim_size, &syn0[index1, 0], &one, &syn1[index, 0], &one)) cdef inline void _train_pair(int32_t index1, int32_t index2, float32_t alpha, int32_t negative, int32_t [:] neg_table) nogil:

27. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp Wikipedia2Vec ‣ ‣   ‣   ‣   30 % wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages- articles.xml.bz2 % wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 OUT_FILE

28. Wikipedia2Vec ‣   ‣     ‣   ‣ 31

29. STUDIO OUSIA https://github.com/ikuyamada/wikipedia-nlp Wikipedia2Vec 32

30. Wikipedia2Vec

31. STUDIO OUSIA ‣ ‣ ‣ 34 Human-Computer Question Answering Match @ NIPS 2017

32. STUDIO OUSIA ‣ ‣   ‣ 35 With%the%assistence%of%his%chief%minister,%the%Duc%de%Sully,%he%lowered% taxes%on%peasantry,%promoted%economic%recovery,%and%ins:tuted%a%tax%on% the%Paule<e.%Victor%at%Ivry%and%Arquet,%he%was%excluded%from%succession% by%the%Treaty%of%Nemours,%but%won%a%great%victory%at%Coutras. Henry%IV%of%France

33. STUDIO OUSIA Words Entities Sum Average The protagonist of … Protagonist Novel Author … … Answers Franz Kafka Tokyo Calcium Dot Softmax ‣ ‣ ✦ ‣ ‣ 36 The protagonist of a novel by this author is evicted from the Bridge Inn and is talked into becoming a school janitor…

34. STUDIO OUSIA ‣ pw, qe, ae ‣ pw qe vD   ‣ vD ae Words Entities Sum Average The protagonist of … Protagonist Novel Author … … Answers Franz Kafka Tokyo Calcium Dot Softmax 37 The protagonist of a novel by this author is evicted from the Bridge Inn and is talked into becoming a school janitor…

35. STUDIO OUSIA ‣ ‣ ✦ ‣ ‣ 38

36. STUDIO OUSIA ‣ ‣ ‣ ✦ ✦ 39 AI間の対戦でのシステムの解答精度クイズエクスパートとの対戦の様子

37. STUDIO OUSIA ‣ ‣ ‣ 40

38. STUDIO OUSIA

PythonによるWikipediaを活用した自然言語処理

Ikuya Yamada

PythonによるWikipediaを活用した自然言語処理