数据挖掘之分类算法（补）

2019年5月11日 218次阅读来源: 七八音

01. 基于规则的分类器rule-based classifier

0.1 相关概念

通过一系列规则“如果。。。就。。。”，来进行分类

规则：（condition）–> y

condition：属性的合取

规则覆盖率：符合规则的记录在所有记录中所占据的比例

规则准确率：规则正确预测在规则覆盖的记录中所占的比例

0.2特点

互斥规则：规则之间是相互排斥的；每条记录最多被一条规则所覆盖；穷举规则

0.3规则简化及其解决方法

简化后，规则不在相互排斥：一条记录可能满足多条规则；

简化后，规则不再详细（全面）：一条记录可能不会触发任意一条规则

解决方法：

使用默认分类

决策树可以转换成基于规则的分类器

0.4 有序规则

规则按照优先级一次排列—->决策列表

测试时，数据记录的预测分类：满足的规则最高优先级规则作出的预测；如果数据记录没有满足的规则，默认类别分类

规则排序模式

（1）基于规则排序：规则的分类能力

（2）基于类的排序：相同类别的规则排在一块

0.5构建分类器规则

1.直接方法：直接从数据中抽取规则

eg：RIPPER,CN2，Holte’s 1R

顺序覆盖

步骤：

（1）start from an empty rule

（2）grow a rule using the Learn-One-Rule function

（3）remove training records covered by the rule

（4）repeat step 2 and 3 until stopping criterion is met

涉及的背景知识

（1）rule growing 生成规则

两种策略：general-to-specific 和 specific-to-general

（2）instance elimination 删除记录

why do we need to eliminate instances?

–otherwise,the next rule is identical to previous rule

why do we remove positive instances?

–ensure that the next rule is different

why do we remove negative instances?

–prevent underestimationg accuracy of rule

（3）rule evaluation 规则评估

accuracy=nc/n

laplace=(nc+1)/(n+k)

M-estimate=(nc+kp)/(n+k)

n:数据集记录数目

nc：规则覆盖数据记录数目

k：类别数目

p：先验概率

（4）stopping criterion 停止条件

计算信息增益

如果信息增益不理想，丢弃该规则

（5）rule pruning 规则修剪

和决策树的post-pruning后修剪类似

reduced error pruning：

确定一条规则

在修剪之前和修剪后，分别计算比较在验证集上的错误率

如果错误率变高，丢弃该规则

总结
grow a single rule
remove intances from rule
prune the rule(if necessary)
add rule to current rule set
repeat

2.间接方法

从其他分类器（如：决策树，神经网络等）中抽取规则

eg:C4.5规则

由决策树生成：C4.5rules

步骤

extract rules from an unpruned dicision tree

for each rule , r: A–> y,

（1）consider an alternative rule r’: A’–>y where A’ is obtained by removing one of the conjuncts in A

（2）compare the pessimistic error rate for r against all r’ s

（3）prune if one of the r’ s has lower pessimistic error rate

（4）repeat until we can no longer improve generalization error

优势

和决策树一样具有很高的表述能力

容易理解、解释

容易生成

对新的样例划分速度快

划分效果和决策树相当

02Instance-based Classifiers 基于距离的分类器

0.1 两种方法

rote-learner：记录所有的训练数据，当新样本的属性正好和训练样本的属性相匹配时，进行分类

nearest neighbor：k近邻（k个最近的点neighbors），来进行分类

0.2准备工作：

存储的数据集

用来计算记录之间的距离矩阵

确定k值的大小

分类 to classify an unknown record:

compute distance to other training records

identify k nearest neighbors

use class labels of nearest neighbors to determine the class label of unknown record(eg:by taking majority vote)

0.3其他问题

（1）距离的度量

欧几里得距离：计算两个向量差的模

高纬度数据—维度爆炸 curse of dimensionality

处理方法：标准化

（2）k的选择

太小，对噪声点太过敏感

太大，neighborhood可能包括其他类比的数据点

（3）缩放 scale

prevent distance measures from being dominated by one of the attributes（避免距离的计算被某个属性所主导）

03贝叶斯分类器

0.1相关概念

基于概率的分类器

consider each attribute and class label as random variables 各变量之间相互独立

given a record with attributes (A1,A2,…An)

goal: predict class C

find the value of C that maximizes P(C| A1,A2,…,An)

estimate P(C|A1,A2,…,An) from data

compute the posterior probability P(C|A1,A2,…,An)–后验概率 for all values of C using the Bayes theorem

choose value of C that maximizes P(C|A1,A2,…,An)

equivalent to choosing value of C that maximizes P(C|A1,A2,…,An)

estimate P(A1,A2,…,An|C)

假设各变量之间相互独立

从现有数据中统计计算出P(Ai|C)，然后在计算P(A1,A2,…,An|C)

0.2连续数据怎么使用贝叶斯？

（1）discretize the range into bins 区间划分

one ordinal attribute per bin 每个区间变成一个序数型数据

violates independence assumption 强制假设独立

（2）two-ways split （A>v）or（A<v)

A. choose only one of the two splits as new attribute 二分后，选择其中的一个作为新的属性

B. probability density estimation：概率密度估计

C. assume attribute follows a normal distribution 假设该连续属性服从正态分布

D. use data to estimate parameters of distribution(eg: mean and standard deviation)；利用数据去估计计算正态分布的参数值：均值and方差（或标准差）

E. once probability distribution is known,can we compute the probability ；知道分布函数，可以计算概率

（3）应用过程中遇到的问题

A. 零概率

one of the conditional probability is zero ,then the entire expression becomes zero 如果单变量的条件概率中出现了零概率，那么相乘以后，整个条件概率值也是0

处理方法：

origin：P(Ai|C)=Nic/Nc（在类别为c的数据集中，Ai属性为某一值时，数据所占的比例）

Laplace：P(Ai|C)=（Nic+1）/（Nc+c）【c：类别数目；】

m-estimate：P(Ai|C)=（Nic+mp）/（Nc+m）【p：先验概率；m：一个参数】

（4）summary

robust to isolated noise points 对噪声点具有健壮性

handle missing values by ignoring the instance during probability estimate calculations；针对缺失值的数据，丢弃；

robust to irrelevant attributes 对不相关的数据有健壮性

independence assumption may not hold for some attributes 对于一些属性不满足独立性假设：使用其他技术，像贝叶斯信念网络

use other techniques such as Bayesian Belief Networks(BBN)

04SVM支持向量机

find a linear hyperplane that will separate the data 超平面

数据线性可分

对于不可分的数据—>变得线性可分；核函数

评价指标：distance

maximize

等价于 minimize

05 ensemble methods 集成的方法

construct a set of classifiers from the training data 通过训练数据构建一系列的分类器

predict class label of previously unseen records by aggregating predictions made by multiple classifiers；预测新的数据记录时，通过多个分类的的所有分类结果进行分类

general idea

《数据挖掘之分类算法（补）》

相当于：投票选取大会

0.1集成分类器的生成方法

（1）bagging

（2）boosting

迭代算法：adaptively change distribution of training data by focusing more on previously misclassified records自适应的变动训练数据的分布；关注于先前分类器误分的数据记录

初始时，所有的N条记录分配相同的权重系数

和bagging算法不同，权重系数在每轮boosting结束以后，变动

误分的记录的权重系数–增大

正确的记录的权重系数–减小

adaboost

    原文作者：七八音
    原文地址: https://www.jianshu.com/p/fc738701b598#comments
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。