機械学習によるニュース記事の分類

やったこと

Scikit learnのTutorial Working With Text Dataに倣う
教師付き学習、テキスト（ニュース記事）の分類
データセット 20 news groups を使用

テキストデータからの特徴量の抽出

20 newsgroups dataset のうち、トレーニング用の分をロード

20 Newsgroupsは、ニュース記事約20，000本を、20カテゴリーに分類したアノテーション済のありがた〜いデータセット。

import numpy as np
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle='true',random_state=1)

データセットの内容の確認

ニュース記事のカテゴリー名はこんな感じ。

twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

続いてデータ本体の中身の確認。ニュース記事のテキストデータが格納されている

print len(twenty_train.data)
print type(twenty_train.data)
print(type(twenty_train.data[1]))
print "==============================="
print(twenty_train.data[1])

11314
<type 'list'>
<type 'unicode'>
===============================
From: timmbake@mcl.ucsb.edu (Bake Timmons)
Subject: Re: Amusing atheists and agnostics
Lines: 66


James Hogan writes:

timmbake@mcl.ucsb.edu (Bake Timmons) writes:
>>Jim Hogan quips:

>>... (summary of Jim's stuff)

>>Jim, I'm afraid _you've_ missed the point.

>>>Thus, I think you'll have to admit that  atheists have a lot
>>more up their sleeve than you might have suspected.

>>Nah.  I will encourage people to learn about atheism to see how little atheists
>>have up their sleeves.  Whatever I might have suspected is actually quite
>>meager.  If you want I'll send them your address to learn less about your
>>faith.
（以下略）

これは、alt.atheismという0番目のクラスらしい

print(twenty_train.target_names[twenty_train.target[1]])
print(twenty_train.target[1])

alt.atheism
0

Bag of Words

文章に登場するそれぞれの単語の登場回数を特徴量として、文章をベクトルで表現するやり方。登場する単語の種類=特徴の次元数となるので、高次元のベクトルになる。一般的には10万次元以上。 10万次元で、データ数が仮に1万サンプル、float32だとすると必要なメモリ量は 100,000 x 10,000 x 4 bytes = 4GB と大きい。ほとんどの特徴量の値は、0になるので、高次元だけど疎(Sparse)な行列になるので、non-zeroな要素だけを記憶すれば、使用するメモリ量を節約できる。

Vectorizer

sklearnのVectorizerを使って、文章をベクトルに変換。使い方のメモ書き。

from sklearn.feature_extraction.text import CountVectorizer
test = ['aa aa aa aa', 'aa bb cc dd ee', 'aa', 'aa bb', 'cc dd ee']
CountVect = CountVectorizer(min_df=1)
X_count = CountVect.fit_transform(test)
print X_count.todense()
print 
print CountVect.get_feature_names()
print 
print CountVect.vocabulary_.get(u'dd')

[[4 0 0 0 0]
 [1 1 1 1 1]
 [1 0 0 0 0]
 [1 1 0 0 0]
 [0 0 1 1 1]]

[u'aa', u'bb', u'cc', u'dd', u'ee']

3

というわけで、ニュース記事のデータセットを読ませてみる。

CountVect = CountVectorizer()
X_train_counts = CountVect.fit_transform(twenty_train.data)

print X_train_counts.shape
print type(X_train_counts)
X_train_counts[1,:10] # ndarrayのようには表示されない

(11314, 130107)
<class 'scipy.sparse.csr.csr_matrix'>

<1x10 sparse matrix of type '<type 'numpy.int64'>'
    with 0 stored elements in Compressed Sparse Row format>

特徴量の変換：回数から頻度へ

登場回数を特徴とするのは、次のような問題点がある。長い記事だと、短い記事に比べて平均の登場回数は大きくなる。そこで一つの記事内に登場した単語数の和で割って正規化する。この特徴量をTerm Frequencies、略してTFと言う。いろんな文章（記事）に登場する単語は情報量が少ない。言い換えると一部の記事にのみ、登場するような単語がその記事の特徴をよく表していると言える。そこで、いろんな記事に頻出する単語は重みが小さくなるよう調整する。この特徴量をTerm Frequency times Inverse Document Frequency (TF-IDF)と呼ぶ。

まずはTFの例。こちらでは、 fit と Transform の2段階でやっている。

from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(11314, 130107)

次に TF-IDF の例。今度は fit と transform を一括でやる fit_transform() を使っても同じ結果が得られる

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

分類器の学習

いよいよデータを使って学習、分類器はナイーブベイズを使用。

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

新しい記事のデータセットを与えて、分類してみる。ここでは fit は必要なく、transform のみでよい。

docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = CountVect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)

識別結果を出力させる

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => rec.autos

パイプライン

ベクトル変換、正規化、分類をパイプラインで一つに。fitで学習も一行で。簡単だけど慣れるまでは中身がわかりにくくなるから、分けて書いた方が良さそう。

from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())
                     ])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

モデルの評価

20newsgroupsに用意されているテスト用データセットを使って、モデルの精度を評価する。

twenty_test = fetch_20newsgroups(subset='test', shuffle='true', random_state=1)
predicted = text_clf.predict(twenty_test.data)
print np.mean(predicted == twenty_test.target)
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,target_names=twenty_test.target_names))

0.77389803505
                          precision    recall  f1-score   support

             alt.atheism       0.80      0.52      0.63       319
           comp.graphics       0.81      0.65      0.72       389
 comp.os.ms-windows.misc       0.82      0.65      0.73       394
comp.sys.ibm.pc.hardware       0.67      0.78      0.72       392
   comp.sys.mac.hardware       0.86      0.77      0.81       385
          comp.windows.x       0.89      0.75      0.82       395
            misc.forsale       0.93      0.69      0.80       390
               rec.autos       0.85      0.92      0.88       396
         rec.motorcycles       0.94      0.93      0.93       398
      rec.sport.baseball       0.92      0.90      0.91       397
        rec.sport.hockey       0.89      0.97      0.93       399
               sci.crypt       0.59      0.97      0.74       396
         sci.electronics       0.84      0.60      0.70       393
                 sci.med       0.92      0.74      0.82       396
               sci.space       0.84      0.89      0.87       394
  soc.religion.christian       0.44      0.98      0.61       398
      talk.politics.guns       0.64      0.94      0.76       364
   talk.politics.mideast       0.93      0.91      0.92       376
      talk.politics.misc       0.96      0.42      0.58       310
      talk.religion.misc       0.97      0.14      0.24       251

             avg / total       0.82      0.77      0.77      7532

正答率は77%。分類器をSVMに変えて同じことをやってみる。

from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     # ('clf', MultinomialNB())
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=1))
                     ])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(twenty_test.data)
print np.mean(predicted == twenty_test.target)
print(metrics.classification_report(twenty_test.target, predicted,target_names=twenty_test.target_names))

0.823552841211
                          precision    recall  f1-score   support

             alt.atheism       0.73      0.71      0.72       319
           comp.graphics       0.81      0.69      0.75       389
 comp.os.ms-windows.misc       0.72      0.78      0.75       394
comp.sys.ibm.pc.hardware       0.74      0.68      0.71       392
   comp.sys.mac.hardware       0.82      0.82      0.82       385
          comp.windows.x       0.84      0.76      0.80       395
            misc.forsale       0.84      0.89      0.87       390
               rec.autos       0.91      0.89      0.90       396
         rec.motorcycles       0.92      0.96      0.94       398
      rec.sport.baseball       0.88      0.91      0.89       397
        rec.sport.hockey       0.89      0.99      0.93       399
               sci.crypt       0.84      0.96      0.90       396
         sci.electronics       0.81      0.63      0.71       393
                 sci.med       0.89      0.85      0.87       396
               sci.space       0.83      0.96      0.89       394
  soc.religion.christian       0.74      0.94      0.83       398
      talk.politics.guns       0.69      0.93      0.79       364
   talk.politics.mideast       0.91      0.93      0.92       376
      talk.politics.misc       0.88      0.54      0.67       310
      talk.religion.misc       0.85      0.39      0.53       251

             avg / total       0.83      0.82      0.82      7532

SVMの方が計算に時間はかかるが、正答率は高い。

グリッドサーチを使ったパラメータ調整

from sklearn.grid_search import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3)}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1) # n_jobs=-1で全CPUコア使用

gs_clf = gs_clf.fit(twenty_train.data[:5000], twenty_train.target[:5000])

gs_clf.grid_scores_

[mean: 0.85100, std: 0.01120, params: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': True, 'clf__alpha': 0.01},
 mean: 0.85600, std: 0.00817, params: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': True, 'clf__alpha': 0.01},
 mean: 0.63180, std: 0.02141, params: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': False, 'clf__alpha': 0.01},
 mean: 0.63180, std: 0.02189, params: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': False, 'clf__alpha': 0.01},
 mean: 0.85900, std: 0.01004, params: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': True, 'clf__alpha': 0.001},
 mean: 0.85800, std: 0.00623, params: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': True, 'clf__alpha': 0.001},
 mean: 0.77340, std: 0.01131, params: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': False, 'clf__alpha': 0.001},
 mean: 0.79520, std: 0.01469, params: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': False, 'clf__alpha': 0.001}]

best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))

clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)