模型评估和验证 Model Evaluation and Validation






from sklearn.model_selection import train_test_split
from numpy import random
X = random.random(size=(12,4))
y = random.random(size=(12,1))
X_train, X_test, y_train,  y_test = train_test_split(X,y,test_size=0.25)
print ('X_train:\n')
print (X_train)
print ('\ny_train:\n')
print (y_train)
print ('\nX_test:\n')
print (X_test)
print ('\ny_test:\n')
print (y_test)

[[ 0.4203678   0.33033482  0.20464863  0.61927097]
 [ 0.22030621  0.34982629  0.46778748  0.20174323]
 [ 0.12715997  0.59674531  0.226012    0.10694568]
 [ 0.4359949   0.02592623  0.54966248  0.43532239]
 [ 0.79363745  0.58000418  0.1622986   0.70075235]
 [ 0.13457995  0.51357812  0.18443987  0.78533515]
 [ 0.64040673  0.48306984  0.50523672  0.38689265]
 [ 0.50524609  0.0652865   0.42812233  0.09653092]
 [ 0.85397529  0.49423684  0.84656149  0.07964548]]


[[ 0.95374223]
 [ 0.02720237]
 [ 0.40627504]
 [ 0.53560417]
 [ 0.06714437]
 [ 0.08209492]
 [ 0.24717724]
 [ 0.8508505 ]
 [ 0.3663424 ]]


[[ 0.29965467  0.26682728  0.62113383  0.52914209]
 [ 0.96455108  0.50000836  0.88952006  0.34161365]
 [ 0.56714413  0.42754596  0.43674726  0.77655918]]


[[ 0.54420816]
 [ 0.99385201]
 [ 0.97058031]]

sklearn.model_selection.train_test_split(arrays, *options)[source]

Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

*arrays : sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size. If train size is also None, test size is set to 0.25.
train_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
random_state : int or RandomState
Pseudo-random number generator state used for random sampling.
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the class labels.
splitting : list, length=2 * len(arrays)
List containing train-test split of inputs.
New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

Confusion Matrix (Error Matrix)


A confusion matrix (Kohavi and Provost, 1998) contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two class classifier.
The entries in the confusion matrix have the following meaning in the context of our study:

a is the number of correct predictions that an instance is negative,
b is the number of incorrect predictions that an instance is positive,
c is the number of incorrect of predictions that an instance negative, and
d is the number of correct predictions that an instance is positive.
Several standard terms have been defined for the 2 class matrix:

  • The accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined using the equation:


  • The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified, as calculated using the equation:


  • precision (P) is the proportion of the predicted positive cases that were correct, as calculated using the equation:

The accuracy determined using equation 1 may not be an adequate performance measure when the number of negative cases is much greater than the number of positive cases (Kubat et al., 1998).
Accuracy在negative cases远多于positive cases的时候是不合适的,因为即使true prositive为0,accuracy依然可以很高。

Suppose there are 1000 cases, 995 of which are negative cases and 5 of which are positive cases. If the system classifies them all as negative, the accuracy would be 99.5%, even though the classifier missed all positive cases. Other performance measures account for this by including TP in a product: for example, geometric mean (g-mean) (Kubat et al., 1998), and F-Measure (Lewis and Gale, 1994).

$$g-mean=\sqrt{R\cdot P}$$


F1 Score 就是F-Measure当$$\beta = 1$$时的特例


ompute confusion matrix to evaluate the accuracy of a classification
By definition a confusion matrix C is such that $C_{i, j}$ is equal to the number of observations known to be in group i but predicted to be in group j.
Thus in binary classification, the count of true negatives is $C_{0,0}$, false negatives is $C_{1,0}$, true positives is $C_{1,1}$ and false positives is $C_{0,1}$.
Read more in the User Guide.

y_true : array, shape = [n_samples]
Ground truth (correct) target values.
y_pred : array, shape = [n_samples]
Estimated targets as returned by a classifier.
labels : array, shape = [n_classes], optional
List of labels to index the matrix. This may be used to reorder or select a subset of labels. If none is given, those that appear at least once in y_true or y_pred are used in sorted order.
sample_weight : array-like of shape = [n_samples], optional
Sample weights.
Returns:C : array, shape = [n_classes, n_classes]
Confusion matrix


from sklearn.metrics import confusion_matrix
y_true = [1, 0, 0, 1, 0, 1]
confusion_matrix(y_true, y_pred)
array([[2, 1],
       [1, 2]])
from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])


一 roc曲线

roc曲线:接收者操作特征(receiver operating characteristic),roc曲线上每个点反映着对同一信号刺激的感受性。

  • 横轴:负正类率(false postive rate FPR)特异度,划分实例中所有负例占所有负例的比例;(1-Specificity)

  • 纵轴:真正类率(true postive rate TPR)灵敏度,Sensitivity(正类覆盖率)


  1. TP:正确的肯定数目
    若一个实例是正类并且被预测为正类,即为真正类(True Postive TP)

  • FN:漏报,没有找到正确匹配的数目
    若一个实例是正类,但是被预测成为负类,即为假负类(False Negative FN)

  • FP:误报,没有的匹配不正确
    若一个实例是负类,但是被预测成为正类,即为假正类(False Postive FP)

  • TN:正确拒绝的非匹配数目
    若一个实例是负类,但是被预测成为负类,即为真负类(True Negative TN)

(1)真正类率(True Postive Rate)TPR: TP/(TP+FN),代表分类器预测的正类中实际正实例占所有正实例的比例。Sensitivity

(2)负正类率(False Postive Rate)FPR: FP/(FP+TN),代表分类器预测的正类中实际负实例占所有负实例的比例。1-Specificity

(3)真负类率(True Negative Rate)TNR: TN/(FP+TN),代表分类器预测的负类中实际负实例占所有负实例的比例,TNR=1-FPR。Specificity



问题在于“as its discrimination threashold is varied”。如何理解这里的“discrimination threashold”呢?我们忽略了分类器的一个重要功能“概率输出”,即表示分类器认为某个样本具有多大的概率属于正样本(或负样本)。通过更深入地了解各个分类器的内部机理,我们总能想办法得到一种概率输出。通常来说,是将一个实数范围通过某个变换映射到(0,1)区间。

假如我们已经得到了所有样本的概率输出(属于正样本的概率),现在的问题是如何改变“discrimination threashold”?我们根据每个测试样本属于正样本的概率值从大到小排序。下图是一个示例,图中共有20个测试样本,“Class”一栏表示每个测试样本真正的标签(p表示正样本,n表示负样本),“Score”表示每个测试样本属于正样本的概率.

《模型评估和验证 Model Evaluation and Validation》


AUC(Area under Curve): Roc曲线下的面积,介于0.1和1之间。Auc作为数值可以直观的评价分类器的好坏,值越大越好。



from sklearn import datasets,svm,metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

fpr, tpr, thresholds = metrics.roc_curve(y_test,y_pred, pos_label=2)
print('gamma=1 AUC= ',metrics.auc(fpr, tpr))

fpr_ga1, tpr_ga1, thresholds_ga1 = metrics.roc_curve(y_test,y_pred_rbf, pos_label=2)
print('gamma=10 AUC= ',metrics.auc(fpr_ga1, tpr_ga1))

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train) 
fpr_knn, tpr_knn, thresholds_knn = metrics.roc_curve(y_test,y_pred_knn, pos_label=2)
print('knn AUC= ',metrics.auc(fpr_knn, tpr_knn))

plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
gamma=1 AUC=  0.927469135802
gamma=10 AUC=  0.927469135802
knn AUC=  0.936342592593

从上图可以看出,SVM gamma 取10要明显好于取1.

# Author: Tim Head <betatim@gmail.com>
# License: BSD 3 clause

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.pipeline import make_pipeline

n_estimator = 10
X, y = make_classification(n_samples=80000)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
# It is important to train the ensemble of trees on a different subset
# of the training data than the linear regression model to avoid
# overfitting, in particular if the total number of leaves is
# similar to the number of training samples
X_train, X_train_lr, y_train, y_train_lr = train_test_split(X_train,

# Unsupervised transformation based on totally random trees
rt = RandomTreesEmbedding(max_depth=3, n_estimators=n_estimator,

rt_lm = LogisticRegression()
pipeline = make_pipeline(rt, rt_lm)
pipeline.fit(X_train, y_train)
y_pred_rt = pipeline.predict_proba(X_test)[:, 1]
fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_test, y_pred_rt)

# Supervised transformation based on random forests
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
rf_enc = OneHotEncoder()
rf_lm = LogisticRegression()
rf.fit(X_train, y_train)
rf_lm.fit(rf_enc.transform(rf.apply(X_train_lr)), y_train_lr)

y_pred_rf_lm = rf_lm.predict_proba(rf_enc.transform(rf.apply(X_test)))[:, 1]
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)

grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
grd_enc.fit(grd.apply(X_train)[:, :, 0])
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)

y_pred_grd_lm = grd_lm.predict_proba(
    grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1]
fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)

# The gradient boosted model by itself
y_pred_grd = grd.predict_proba(X_test)[:, 1]
fpr_grd, tpr_grd, _ = roc_curve(y_test, y_pred_grd)

# The random forest model by itself
y_pred_rf = rf.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)

plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')

plt.xlim(0, 0.2)
plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve (zoomed in at top left)')

mean absoulte error 均绝对值误差


mean squre error 均方误差

from sklearn import datasets,svm,metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

print('mean_absolute_error: ',error)
print('mean_square_error: ',metrics.mean_squared_error(y_test,y_pred))
mean_absolute_error:  0.04
mean_square_error:  0.04

K Fold

from sklearn.model_selection import KFold
kf=KFold(n_splits=10, random_state=3, shuffle=True)
for train_indices,test_indices in kf.split(X):
    print (train_indices,test_indices)
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 5 6 7 8 9] [4]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 3 4 5 6 7 8] [9]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[1 2 3 4 5 6 7 8 9] [0]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 4 5 6 7 9] [8]
