# Kaggle 比赛: 德国信用卡违约数据分析

### 数据描述

German Credit Data， 我们来看看数据的格式,

A1 到 A15 为 15个不同类别的特征，A16 为 label 列，一共有 690条数据，下面列举其中一条当作例子：

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
b 30.83 0 u g w v 1.25 t t 01 f g 00202 0 +

#### Attribute Information:

``````
A1:    b, a.
A2:    continuous.
A3:    continuous.
A4:    u, y, l, t.
A5:    g, p, gg.
A6:    c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
A7:    v, h, bb, j, n, z, dd, ff, o.
A8:    continuous.
A9:    t, f.
A10:    t, f.
A11:    continuous.
A12:    t, f.
A13:    g, p, s.
A14:    continuous.
A15:    continuous.
A16: +,-         (class attribute)
``````

#### Missing Attribute Values:

``````37 cases (5%) have one or more missing values.  The missing
values from particular attributes are:

A1:  12
A2:  12
A4:   6
A5:   6
A6:   9
A7:   9
A14: 13
``````

#### Class Distribution

``````+: 307 (44.5%)
-: 383 (55.5%)
``````

#### 数据处理与数据分析

``````import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

# 读取数据

# 给数据增加列标签
data.columns = ["f1", "f2", "f3", "f4", "f5", "f6", "f7", "f8", "f9", "f10", "f11", "f12", "f13", "f14", "f15", "label"]

# 替换 label 映射
label_mapping = {
"+": 1,
"-": 0
}

data["label"] = data["label"].map(label_mapping)

# 处理缺省值的方法
data = data.replace("?", np.nan)

# 将 object 类型的列转换为 float型
data["f2"] = pd.to_numeric(data["f2"])
data["f14"] = pd.to_numeric(data["f14"])

# 连续型特征如果有缺失值的话，用它们的平均值替代
data["f2"] = data["f2"].fillna(data["f2"].mean())
data["f3"] = data["f3"].fillna(data["f3"].mean())
data["f8"] = data["f8"].fillna(data["f8"].mean())
data["f11"] = data["f11"].fillna(data["f11"].mean())
data["f14"] = data["f14"].fillna(data["f14"].mean())
data["f15"] = data["f15"].fillna(data["f15"].mean())

# 离散型特征如果有缺失值的话，用另外一个不同的值替代
data["f1"] = data["f1"].fillna("c")
data["f4"] = data["f4"].fillna("s")
data["f5"] = data["f5"].fillna("gp")
data["f6"] = data["f6"].fillna("hh")
data["f7"] = data["f7"].fillna("ee")
data["f13"] = data["f13"].fillna("ps")

tf_mapping = {
"t": 1,
"f": 0
}

data["f9"] = data["f9"].map(tf_mapping)
data["f10"] = data["f10"].map(tf_mapping)
data["f12"] = data["f12"].map(tf_mapping)``````
``````# 给离散的特征进行 one-hot 编码
data = pd.get_dummies(data)``````
``````from sklearn.linear_model import LogisticRegression

# 打乱顺序
shuffled_rows = np.random.permutation(data.index)

# 划分本地测试集和训练集
highest_train_row = int(data.shape[0] * 0.70)
train = data.iloc[0:highest_train_row]
loc_test = data.iloc[highest_train_row:]

# 去掉最后一列 label 之后的才是 feature
features = train.drop(["label"], axis = 1).columns

model = LogisticRegression()
X_train = train[features]
y_train = train["label"] == 1

model.fit(X_train, y_train)
X_test = loc_test[features]

test_prob = model.predict(X_test)
test_label = loc_test['label']

# 本地测试集上的准确率
accuracy_test = (test_prob == loc_test["label"]).mean()
print accuracy_test``````
``````0.835748792271
``````
``````from sklearn import cross_validation, metrics

#验证集上的auc值
test_auc = metrics.roc_auc_score(test_label, test_prob)#验证集上的auc值
print test_auc ``````
``````0.835748792271
``````

原文作者：小沙文
原文地址: https://segmentfault.com/a/1190000007607452
本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。