Logistic regression intuition and conditional probabilities

《Logistic regression intuition and conditional probabilities》

logit函数输入参与p属于(0,1),函数值为整个实数域,可以在特征值与逻辑比率之间建立线性关系

《Logistic regression intuition and conditional probabilities》

这里

《Logistic regression intuition and conditional probabilities》

样本x属于分类1的条件概率

现在如何去预测一个特定样本属于一个特定类的概率,转化一个函数形式,
《Logistic regression intuition and conditional probabilities》

《Logistic regression intuition and conditional probabilities》

《Logistic regression intuition and conditional probabilities》

注意这里是y=1的概率,后面算似然函数特别注意这点,这里y|x只有两个分类,y=1|x,y=0|x,P(y=0|x)=1-P(y=1|x),注意下面的处理手法,

似然函数

《Logistic regression intuition and conditional probabilities》

将问题转化一下,求lnL最大值,也就是求-lnL最小值,易知-lnL>0

《Logistic regression intuition and conditional probabilities》

为了更好掌握损失函数J(w),看一下单个样本的例子

《Logistic regression intuition and conditional probabilities》

y=1或者y=0

《Logistic regression intuition and conditional probabilities》

import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0.001, 0.999, 100)
plt.plot(x, -np.log(x),'b', label='y=1')
plt.plot(x, -np.log(1-x), 'r--', label='y=0')
plt.ylim(0, 5)
plt.xlim(0, 1)
plt.xlabel('$\phi(x)$')
plt.ylabel('$J(w)$')
plt.legend()
plt.show()

《Logistic regression intuition and conditional probabilities》

可以发现如果预测错误,损失函数将变的无穷大

使用sc-learn训练logistic regression 模型

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
import numpy as np
import matplotlib
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from matplotlib.colors import ListedColormap
iris = datasets.load_iris()
# print iris
X = iris.data[:, [2, 3]]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
# 测试集做同样的标准化,就是对测试集做相同的平移伸缩操作
X_test_std = sc.transform(X_test)
'''sc.scale_标准差, sc.mean_平均值, sc.var_方差'''

lr = LogisticRegression(C=1000.0, random_state=0)
lr.fit(X_train_std, y_train)

# 预测
y_pred = lr.predict(X_test_std)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))

x1_min, x1_max = X_train_std[:, 0].min() - 1, X_train_std[:, 0].max() + 1
x2_min, x2_max = X_train_std[:, 1].min() - 1, X_train_std[:, 1].max() + 1

resolution = 0.01
# xx1 X轴,每一个横都是x的分布,所以每一列元素一样,xx2 y轴 每一列y分布,所以每一横元素一样
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),np.arange(x2_min, x2_max, resolution))

# .ravel() 函数是将多维数组降位一维,注意是原数组的视图,转置之后成为两列元素
z = lr.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
'''
contourf画登高线函数要求 *X* and *Y* must both be 2-D with the same shape as *Z*, or they
    must both be 1-D such that ``len(X)`` is the number of columns in
    *Z* and ``len(Y)`` is the number of rows in *Z*.
'''
# z形状要做调整
z = z.reshape(xx1.shape)

# 填充等高线的颜色, 8是等高线分为几部分
markers = ('s', 'x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])

for i, value in enumerate(np.unique(y)):
    temp = X_train_std[np.where(y_train==value)]
    plt.scatter(x=temp[:,0],y=temp[:,1], marker=markers[value],s=69, c=colors[value], label=value)

plt.scatter(x=X_test_std[:, 0],y=X_test_std[:,1], marker= 'o',s=69, c='none', edgecolors='r', label='test test')

plt.xlim(xx1.min(), xx1.max())
plt.ylim(xx2.min(), xx2.max())
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.contourf(xx1, xx2, z, len(np.unique(y)), alpha = 0.4, cmap = cmap)
plt.legend(loc='upper left')
plt.show()

《Logistic regression intuition and conditional probabilities》

对个样本的损失函数求导

《Logistic regression intuition and conditional probabilities》

使用正则化(regularization)处理过拟合(overftting )

在机器学习中,过拟合是一个普遍问题,模型对于训练数据有很好表现,但对于测试数据的表现却很差。如果一个模型出现过拟合,也可以说模型具有高方差,包含太多参数,这对于潜在数据太过复杂;再就是低度拟合(underftting),对训练数据和测试数据都不能很好的刻画

《Logistic regression intuition and conditional probabilities》

一种方法就是通过正则化达到偏差-方差权衡( bias-variance tradeoff),调整模型的复杂度。正则化是一种非常有用的方法,例如处理共线性、过滤噪音、防止过拟合。
最普遍的正则化就是所谓的L2regularization

《Logistic regression intuition and conditional probabilities》

为了应用正则化,需要在损失函数加上正则化项!

《Logistic regression intuition and conditional probabilities》

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator, FormatStrFormatter

xmajorLocator = MultipleLocator(2)
xminorLocator = MultipleLocator(1)
ymajorLocator = MultipleLocator(0.5)

N = 7
x = np.linspace(-N, N, 100)
z = 1/(1+np.e**(-x))
fig = plt.figure(1)
axes = plt.subplot(111)

plt.ylim(-0.1, 1.1)
plt.axvline(0.0, color='k')
plt.axhline(0, ls='dotted',color='r')
plt.axhline(0.5, ls='dotted',color='r')
plt.axhline(1, ls='dotted',color='r')
#plt.axhspan(0.0,1.0, facecolor='1.0', alpha=1.0, ls='dotted')
axes.xaxis.set_major_locator(xmajorLocator)
axes.xaxis.set_minor_locator(xminorLocator)
axes.yaxis.set_major_locator(ymajorLocator)
plt.xlabel("z")
plt.ylabel("$\phi (z)$")
plt.plot(x, z)
plt.show()

《Logistic regression intuition and conditional probabilities》

    原文作者:14142135623731
    原文地址: https://www.jianshu.com/p/58d7fa9c07bf
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞