我有一个如下所示的数据集:
Out Revolver Ratio Num ...
0 1 0.766127 0.802982 0 ...
1 0 0.957151 0.121876 1
2 0 0.658180 0.085113 0
3 0 0.233810 0.036050 3
4 1 0.907239 0.024926 5
...
Out只能取值0和1.
然后,我尝试使用下面的代码生成PCA和LCA图,类似于此处:http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html
features = Train.columns[1:]
Xf = newTrain[features]
yf = newTrain.Out
pca = PCA(n_components=2)
X_r = pca.fit(Xf).transform(Xf)
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(Xf, yf).transform(Xf)
plt.figure()
for c, i, name in zip("rgb", [0, 1], names):
plt.scatter(X_r[yf == i, 0], X_r[yf == i, 1], c=c, label=name)
plt.legend()
plt.title('PCA plt')
plt.figure()
for c, i, name in zip("rgb", [0, 1], names):
plt.scatter(X_r2[yf == i, 0], X_r2[yf == i, 1], c=c, label=name)
plt.legend()
plt.title('LDA plt')
我可以让PCA情节发挥作用.然而,它没有意义,因为它只显示2个点.一个在(-4000,30)左右,另一个在(2400,23.7).我没有在该链接的图中看到一堆数据点
LDA图不起作用并给出错误
IndexError: index 1 is out of bounds for axis 1 with size 1
我也尝试了下面的代码生成LDA图但得到了同样的错误
for c, i, name in zip("rgb", [0, 1], names):
plt.scatter(x=X_LDA_sklearn[:, 0][yf==i], y=X_LDA_sklearn[:, 1][yf==i], c=c, label=name)
plt.legend()
谁知道这有什么问题?
编辑:这是我的进口
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import csv
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.lda import LDA
至于错误发生的地方:
我明白了
FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
plt.scatter(X_r[yf == i,0], X_r[yf == i, 1], c=c, label=name)
在PCA图的for循环内的线上
至于线路上的LDA
plt.scatter(X_r2[yf == i, 0], X_r2[yf == i, 1], c=c, label=name)
我明白了
FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
和
IndexError: index 1 is out of bounds for axis 1 with size 1
最佳答案 您看到此错误的原因是X_r2只包含一列(至少给出了您提供的数据).但是,在命令y = X_LDA_sklearn [:,1] [yf == i]中,您尝试访问第二列,因此会抛出您观察到的错误.
我为您提供的示例数据添加了第三个类(两个类的维数降低不合理)并且还将数据帧转换为数组.它现在很好地运行并生成以下图(由于少量数据而不是信息量):
这是更新的代码:
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import csv
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
trainDF = pd.DataFrame({'Out': [1, 0, 0, 0, 1, 3, 3],
'Revolver': [0.766, 0.957, 0.658, 0.233, 0.907, 0.1, 0.15],
'Ratio': [0.803, 0.121, 0.085, 0.036, 0.024, 0.6, 0.8],
'Num': [0, 1, 0, 3, 5, 4, 4]})
#drop NA values
trainDF = trainDF.dropna()
trainDF['Num'].loc[(trainDF['Num']==8) | (trainDF['Num']==17)] = trainDF['Num'].median()
# convert dataframe to numpy array
y = trainDF['Out'].as_matrix()
# convert dataframe to numpy array
X = trainDF.drop('Out', 1).as_matrix()
target_names = ['out', 'in']
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
# Percentage of variance explained for each components
print('explained variance ratio (first two components): %s'
% str(pca.explained_variance_ratio_))
plt.figure()
for c, i, target_name in zip("rgb", [0, 1], target_names):
plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title('PCA of Out')
plt.figure()
for c, i, target_name in zip("rgb", [0, 1], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title('LDA of Out')
plt.show()
因此,当您遇到这些“索引超出界限”错误时,请始终先检查阵列的尺寸.