機械学習第12回：Ensemble Learning

今回は、Ensemble Learning(Ensemble Method)についてご紹介します。Ensemble Learningとは、異なる機械学習アルゴリズムを持つ複数の分類器(Classifier)の予測結果を投票にかけ、多数決により予測値を決定するアルゴリズムです。Emsemble Learningには、「Majority Vote」と「Plurality Vote」があります。Majority Voteとは、分類クラスがbinary(２通りのみ）の２項分類器(Binary Classifier)の場合の投票の仕組みで、これに対して、Plurality Voteとは、分類クラスが3通り以上ある多項分類器(Multinominal Classifier)の場合の投票の仕組みです。今回は、２項分類器を扱うMajority Voteをご紹介します。Majority Voteでは、機械学習アルゴリズムが異なる奇数個の２項分類器を用意し、投票により予測結果を決定します。例えば、'0'か’1'を分類する２項分類器が3つあった場合、それぞれの予測値は、全て同じか、2対1で必ず、'0'か’1'に結果が決まります。このように各２項分類器にそれぞれの予測値を計算させ、投票によって最終的な予測値を計算するのが、Majority Voteになります。それでは、早速、具体例で説明して行きます。

今回、使用するデータは、オープンソースであるUCI機械学習リポジトリから、「ウィスコンシン肺がんデータ」を使用します。569のデータからなり、第1列目はサンプルのID番号、2列目に’M’(悪性)または'B'(良性)かの分類クラスが、3列目から32列目までが変量になります。

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data

30の変量がありますが、２次元の図にプロットする都合上、主成分分析により、２つの主成分を抽出し、Pythonを使用して肺がんが悪性か良性かを予測する機械学習モデルを作成します。主成分分析とPythonを使用した機械学習の流れは、概ね以下になります。なお、Pythonのバージョンは3.5以上を想定しています。

Majority Voteの関数を定義する。
プロット出力用の関数を定義する。
データを入力する。
入力データを、トレーニングデータとテストデータに分ける。
トレーニングデータを使用してデータの標準偏差と平均値を求める。
標準偏差と平均値を使用して、トレーニングデータとテストデータを、それぞれ標準化する。
適切な２項分類器(Binary Classifier)を３つ選択する。
主成分分析により、入力データから上位２つの主成分を抽出する。
主成分分析により抽出されたトレーニングデータを使用して、３つの2項分類器にそれぞれ機械学習させ、MajorityVoteにより投票を行う。
テストデータを使用して、ラベルの分類を行い、各モデル単独の精度とMajority Voteによる精度を比較評価する。
投票結果を図にプロットする。

では、各ステップを詳しく見ていきましょう。

①Majority Voteの関数を定義する。

まず、対話形式ではなく、Pythonのスクリプトをファイルで用意します。Pythonスクリプトの先頭に以下の２行を添付しておくことをお勧めします。

#!/usr/bin/env python
# -*- coding:utf-8 -*-

ここでは、MajorityVoteClassifierの名前で、引数に指定された2項分類器のリストから投票を行う関数を定義します。なお、詳細については、以下の本の第7章を是非、読んでみてください。

「Python Machine Learning: Unlock Deeper Insights into Machine Learning With This Vital Guide to Cutting-edge Predictive Analytics」

Sebastian Raschka (著)

出版社: Packt Publishing (2015/9/23)

言語: 英語

ISBN-10: 1783555130

ISBN-13: 978-1783555130

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.externals import six
from sklearn.base import clone
from sklearn.pipeline import _name_estimators
import numpy as np
import operator


class MajorityVoteClassifier(BaseEstimator,
                             ClassifierMixin):
    """ A majority vote ensemble classifier

    Parameters
    ----------
    classifiers : array-like, shape = [n_classifiers]
      Different classifiers for the ensemble

    vote : str, {'classlabel', 'probability'} (default='label')
      If 'classlabel' the prediction is based on the argmax of
        class labels. Else if 'probability', the argmax of
        the sum of probabilities is used to predict the class label
        (recommended for calibrated classifiers).

    weights : array-like, shape = [n_classifiers], optional (default=None)
      If a list of `int` or `float` values are provided, the classifiers
      are weighted by importance; Uses uniform weights if `weights=None`.

    """
    def __init__(self, classifiers, vote='classlabel', weights=None):

        self.classifiers = classifiers
        self.named_classifiers = {key: value for key, value
                                  in _name_estimators(classifiers)}
        self.vote = vote
        self.weights = weights
        
    def fit(self, X, y):
        """ Fit classifiers.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Matrix of training samples.

        y : array-like, shape = [n_samples]
            Vector of target class labels.

        Returns
        -------
        self : object

        """
        if self.vote not in ('probability', 'classlabel'):
            raise ValueError("vote must be 'probability' or 'classlabel'"
                             "; got (vote=%r)"
                             % self.vote)

        if self.weights and len(self.weights) != len(self.classifiers):
            raise ValueError('Number of classifiers and weights must be equal'
                             '; got %d weights, %d classifiers'
                             % (len(self.weights), len(self.classifiers)))

        # Use LabelEncoder to ensure class labels start with 0, which
        # is important for np.argmax call in self.predict
        self.lablenc_ = LabelEncoder()
        self.lablenc_.fit(y)
        self.classes_ = self.lablenc_.classes_
        self.classifiers_ = []
        for clf in self.classifiers:
            fitted_clf = clone(clf).fit(X, self.lablenc_.transform(y))
            self.classifiers_.append(fitted_clf)
        return self
        
    def predict(self, X):
        """ Predict class labels for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Matrix of training samples.

        Returns
        ----------
        maj_vote : array-like, shape = [n_samples]
            Predicted class labels.

        """
        if self.vote == 'probability':
            maj_vote = np.argmax(self.predict_proba(X), axis=1)
        else:  # 'classlabel' vote

            #  Collect results from clf.predict calls
            predictions = np.asarray([clf.predict(X)
                                      for clf in self.classifiers_]).T

            maj_vote = np.apply_along_axis(
                                      lambda x:
                                      np.argmax(np.bincount(x,
                                                weights=self.weights)),
                                      axis=1,
                                      arr=predictions)
        maj_vote = self.lablenc_.inverse_transform(maj_vote)
        return maj_vote

    def predict_proba(self, X):
        """ Predict class probabilities for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.

        Returns
        ----------
        avg_proba : array-like, shape = [n_samples, n_classes]
            Weighted average probability for each class per sample.

        """
        probas = np.asarray([clf.predict_proba(X)
                             for clf in self.classifiers_])
        avg_proba = np.average(probas, axis=0, weights=self.weights)
        return avg_proba
        
    def get_params(self, deep=True):
        """ Get classifier parameter names for GridSearch"""
        if not deep:
            return super(MajorityVoteClassifier, self).get_params(deep=False)
        else:
            out = self.named_classifiers.copy()
            for name, step in six.iteritems(self.named_classifiers):
                for key, value in six.iteritems(step.get_params(deep=True)):
                    out['%s__%s' % (name, key)] = value
            return out

②プロット出力用の関数を定義する。

ここでは、以下のように、「plot_decision_regions」という名前のプロット出力用の関数を定義します。第６回から第10回で使用したものと全く同じ関数です。

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, resolution=0.02):

    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                         np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    # plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                    alpha=0.8, c=cmap(idx),
                    marker=markers[idx], label=cl)

③データを入力する。

今回は、冒頭でご紹介した「ウィスコンシン肺がんデータ」を、オープンソースとして提供しているサイトのURLから、pandasライブラリーを使用して以下のようにデータを抽出します。変量Xの配列(569 x 30)に、ラベル(悪性か良性か)を y(569x 1)という配列に569サンプル分のデータを格納します。sklearn.preprocessingライブラリーのLabelEncoder関数を使用して、ラベルの'M'(悪性)を数字の'1'に、'B'(良性)を数字の'0'に、それぞれ変換しています。

import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)

from sklearn.preprocessing import LabelEncoder

X = df.loc[:, 2:].values

y = df.loc[:, 1].values

le = LabelEncoder()

y = le.fit_transform(y)

④入力データを、トレーニングデータとテストデータに分ける。

scikit-learning.model_selectionライブラリーのtrain_test_split関数を使用して、

変量配列Xとラベル配列yについて、トレーニングデータとテストデータに分けます。変量配列Xを、それぞれ、X_train配列, X_test配列に分割し、ラベル配列yは、y_tarin配列, y_test配列へそれぞれ分割します。test_sizeのパラメータにより、テストデータの割合を指定できます。ここでは、0.2を指定することで、テストデータの割合を全体の20%と指定しています。全569サンプルの20%（= 114サンプル）がテストデータで、残りの455サンプルがトレーニングデータとなります。random_state=1を指定することにより、ランダムにトレーニングデータとテストデータを分割することができます。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

⑤トレーニングデータを使用してデータの標準偏差と平均値を求める。

sklearn.preprocessingライブラリーのStandardScaler関数を用いて、変量配列X_trainとX_testを標準化します。まず、標準化のための標準偏差と平均値は、トレーニングデータのみを使用して計算しなければなりません。fitメソッドを使用して以下のように行います。

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)

⑥標準偏差と平均値を使用して、トレーニングデータとテストデータを、それぞれ標準化する。

次に、変量配列のトレーニングデータとテストデータを、transformメソッドを用いて、それぞれ標準化します。標準化した変量配列をそれぞれ、X_train_std, X_test_stdに格納します。

X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

⑦適切な２項分類器(Binary Classifier)を３つ選択する。

様々な2項分類器がscikit-learnライブラーの中でサポートされています。Perceptron, Adaptive Linear Neuron(Adaline),Logistic regulation, Support Vector Machines(SVM),Decision tree, Random forests, K-nearest neighbors(KNN)などがあります。

今回は、 K-nearest neighbors(KNN)、Support Vector Machines(SVM)及びDecision treeの3つの２項分類器を選択することにいたします。sklearn.neighborsライブラリーのKNeighborsClassifier関数、sklearn.svmライブラリーのSVC関数、及びsklearn.treeライブラリーのDecisionTreeClassifier関数を用いて以下のように記述します。更に①で定義したMajorityVoteClassifier関数の引数に、これらの３つの２項分類器のリストを引き渡します。

from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier(n_neighbors=5,
                            metric='euclidean')
from sklearn.svm import SVC
svm = SVC(kernel='linear',C=1.5, random_state=0)
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=5,
                              criterion='entropy',
                              random_state=0)
mv = MajorityVoteClassifier(
                classifiers=[kn,svm,dt])

⑧主成分分析により、入力データから上位２つの主成分を抽出する。

sklearn.decompositionライブラリーの中でサポートされているPCA関数を用いて、トレーニングデータについて主成分分析を行い、トレーニングデータ及びテストデータについて、２つの主成分を抽出します。ただし、Decision Treeに関しては、データの標準化は不要であるため、標準化前の入力元データX_train とX_testから主成分分析し、2つの主成分をそれぞれ抽出し、X_train_pca2 とX_test_pca2に保管します。

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

X_train_pca = pca.fit_transform(X_train_std,y_train)

X_test_pca = pca.transform(X_test_std)

X_train_pca2 = pca.fit_transform(X_train,y_train)

X_test_pca2 = pca.transform(X_test)

⑨主成分分析により抽出されたトレーニングデータを使用して、３つの2項分類器にそれぞれ機械学習させ、MajorityVoteにより投票を行う。

トレーニングデータにfitメソッドを適用して、3つの２項分類器に、それぞれ学習させ、Majority Voteにより投票を行います。

kn.fit(X_train_pca, y_train)

svm.fit(X_train_pca, y_train)

dt.fit(X_train_pca2, y_train)

mv.fit(X_train_pca, y_train)

⑩テストデータを使用して、ラベルの分類を行い、各モデル単独の精度とMajority Voteによる精度を比較評価する。

テストデータを使用して、ラベルの分類を行い、sklearn.metricsライブラリーのaccuracy_score関数を用いて、３つの各モデルとMajority Voteの精度を評価します。

from sklearn.metrics import accuracy_score

y_pred = kn.predict(X_test_pca)

print('KNN Accuracy: %.3f' % accuracy_score(y_test,y_pred))

y_pred = svm.predict(X_test_pca)

print('SVM Accuracy: %.3f' % accuracy_score(y_test,y_pred))

y_pred = dt.predict(X_test_pca2)

print('DecisionTree Accuracy: %.3f' % accuracy_score(y_test,y_pred))

y_pred = mv.predict(X_test_pca)

print('MajorityVoting Accuracy: %.3f' % accuracy_score(y_test,y_pred))

出力結果は、以下のように表示されます。

KNN Accuracy: 0.956

SVM Accuracy: 0.956

DecisionTree Accuracy: 0.939

MajorityVoting Accuracy: 0.965

KNN、SVM、Decision Treeの個別の精度よりも、Majority Voteの方が高い精度になっていることがわかります。

11. 投票結果を図にプロットする。

①で定義したplot_decision_regions関数を用いて、Majority Voteのトレーニングデータとテストデータについて、それぞれ、抽出した第1主成分(PC1)を横軸に、第２主成分(PC2)を縦軸にした２次元領域にプロットします。

まず、トレーニングデータについてプロットします。

plot_decision_regions(X_train_pca, y_train, classifier=mv)

plt.xlabel('PC 1')

plt.ylabel('PC 2')

plt.legend(loc='lower left')

plt.tight_layout()

plt.show()

×印と青い領域が「悪性: 1」、■印と赤い領域が「良性:0」を表しています。

複数のサンプルに例外がありますが、トレーニングデータについて、概ね、入り組んだ曲線で、分類されていることがわかります。

次に、テストデータについてプロットします。

plot_decision_regions(X_test_pca, y_test, classifier=mv)

plt.xlabel('PC 1')

plt.ylabel('PC 2')

plt.legend(loc='lower left')

plt.tight_layout()

plt.show()

テストデータについても、4つのサンプルに例外がありますが、概ね、入り組んだ曲線で、分類されていることがわかります。

全体を通してのコードは以下のようになります。なお、本コードの稼働環境は、Python3.5以上を想定しています。

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import LabelEncoder
from sklearn.externals import six
from sklearn.base import clone
from sklearn.pipeline import _name_estimators
import numpy as np
import operator


class MajorityVoteClassifier(BaseEstimator,
                             ClassifierMixin):
    """ A majority vote ensemble classifier

    Parameters
    ----------
    classifiers : array-like, shape = [n_classifiers]
      Different classifiers for the ensemble

    vote : str, {'classlabel', 'probability'} (default='label')
      If 'classlabel' the prediction is based on the argmax of
        class labels. Else if 'probability', the argmax of
        the sum of probabilities is used to predict the class label
        (recommended for calibrated classifiers).

    weights : array-like, shape = [n_classifiers], optional (default=None)
      If a list of `int` or `float` values are provided, the classifiers
      are weighted by importance; Uses uniform weights if `weights=None`.

    """
    def __init__(self, classifiers, vote='classlabel', weights=None):

        self.classifiers = classifiers
        self.named_classifiers = {key: value for key, value
                                  in _name_estimators(classifiers)}
        self.vote = vote
        self.weights = weights
        
    def fit(self, X, y):
        """ Fit classifiers.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Matrix of training samples.

        y : array-like, shape = [n_samples]
            Vector of target class labels.

        Returns
        -------
        self : object

        """
        if self.vote not in ('probability', 'classlabel'):
            raise ValueError("vote must be 'probability' or 'classlabel'"
                             "; got (vote=%r)"
                             % self.vote)

        if self.weights and len(self.weights) != len(self.classifiers):
            raise ValueError('Number of classifiers and weights must be equal'
                             '; got %d weights, %d classifiers'
                             % (len(self.weights), len(self.classifiers)))

        # Use LabelEncoder to ensure class labels start with 0, which
        # is important for np.argmax call in self.predict
        self.lablenc_ = LabelEncoder()
        self.lablenc_.fit(y)
        self.classes_ = self.lablenc_.classes_
        self.classifiers_ = []
        for clf in self.classifiers:
            fitted_clf = clone(clf).fit(X, self.lablenc_.transform(y))
            self.classifiers_.append(fitted_clf)
        return self
        
    def predict(self, X):
        """ Predict class labels for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Matrix of training samples.

        Returns
        ----------
        maj_vote : array-like, shape = [n_samples]
            Predicted class labels.

        """
        if self.vote == 'probability':
            maj_vote = np.argmax(self.predict_proba(X), axis=1)
        else:  # 'classlabel' vote

            #  Collect results from clf.predict calls
            predictions = np.asarray([clf.predict(X)
                                      for clf in self.classifiers_]).T

            maj_vote = np.apply_along_axis(
                                      lambda x:
                                      np.argmax(np.bincount(x,
                                                weights=self.weights)),
                                      axis=1,
                                      arr=predictions)
        maj_vote = self.lablenc_.inverse_transform(maj_vote)
        return maj_vote

    def predict_proba(self, X):
        """ Predict class probabilities for X.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.

        Returns
        ----------
        avg_proba : array-like, shape = [n_samples, n_classes]
            Weighted average probability for each class per sample.

        """
        probas = np.asarray([clf.predict_proba(X)
                             for clf in self.classifiers_])
        avg_proba = np.average(probas, axis=0, weights=self.weights)
        return avg_proba
        
    def get_params(self, deep=True):
        """ Get classifier parameter names for GridSearch"""
        if not deep:
            return super(MajorityVoteClassifier, self).get_params(deep=False)
        else:
            out = self.named_classifiers.copy()
            for name, step in six.iteritems(self.named_classifiers):
                for key, value in six.iteritems(step.get_params(deep=True)):
                    out['%s__%s' % (name, key)] = value
            return out


import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, resolution=0.02):

    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                         np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    # plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                    alpha=0.8, c=cmap(idx),
                    marker=markers[idx], label=cl)
                    

from sklearn.model_selection import train_test_split

import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)

from sklearn.preprocessing import LabelEncoder
X = df.loc[:, 2:].values
y = df.loc[:, 1].values
le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)

X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier(n_neighbors=5,
                            metric='euclidean')
from sklearn.svm import SVC
svm = SVC(kernel='linear',C=1.5, random_state=0)
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=5,
                              criterion='entropy',
                              random_state=0)
mv = MajorityVoteClassifier(
                classifiers=[kn,svm,dt])

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_std,y_train)
X_test_pca = pca.transform(X_test_std)
X_train_pca2 = pca.fit_transform(X_train,y_train)
X_test_pca2 = pca.transform(X_test)


kn.fit(X_train_pca, y_train)
svm.fit(X_train_pca, y_train)
dt.fit(X_train_pca2, y_train)
mv.fit(X_train_pca, y_train)

from sklearn.metrics import accuracy_score
y_pred = kn.predict(X_test_pca)
print('KNN Accuracy: %.3f' % accuracy_score(y_test,y_pred))
y_pred = svm.predict(X_test_pca)
print('SVM Accuracy: %.3f' % accuracy_score(y_test,y_pred))
y_pred = dt.predict(X_test_pca2)
print('DecisionTree Accuracy: %.3f' % accuracy_score(y_test,y_pred))
y_pred = mv.predict(X_test_pca)
print('MajorityVoting Accuracy: %.3f' % accuracy_score(y_test,y_pred))

plot_decision_regions(X_train_pca, y_train, classifier=mv)
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.tight_layout()
plt.show()

plot_decision_regions(X_test_pca, y_test, classifier=mv)
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.tight_layout()
plt.show()

機械学習 第12回：Ensemble Learning

機械学習第12回：Ensemble Learning