用Apriori算法挖掘关联特征
频繁项集(frequent itemset)
FP-growth:频繁项集挖掘算法(比Apriori有改进)
Eclat:(比Apriori有改进)
挖掘亲和性分析所用的关联规则之前,我们先用Apriori算法生成频繁项集,再通过检测频繁项集中前提和结论的组合,生成关联规则。
(1)为Apriori算法指定一个项集成为频繁项集所需的最小支持度;(2)找到频繁项集后,根据置信度选取关联规则。
## 1、背景介绍 根据Grouplens团队的电影数据,做电影推荐。 ## 2、获取数据 数据下载地址为为http://grouplens.org/datasets/movielens/ ,包含了100万条电影数据。下载后,解压到文件夹。
#import os
#data_folder = os.path.join(os.path.expanduser("~"),"ml_20m")
#ratings_filename = os.path.join(data_folder, "u.data")
## 3、加载数据 ratings为csv格式文件,表头为:userId,movieId,rating,timestamp。
import pandas as pd
all_ratings = pd.read_csv('ratings.csv')
all_ratings['timestamp'] = pd.to_datetime(all_ratings['timestamp'], unit='s') # 将时间戳数据转换为时间格式数据
print all_ratings.head() #看看数据长什么样的
print all_ratings.describe()
print all_ratings[all_ratings['userId'] == 100].sort('movieId')#查看下第100名用户的打分情况
# **************************Apriori算法的实现*****************************************
all_ratings['Favorable'] = all_ratings['rating'] > 3 #评价大于3,标记为喜欢
print all_ratings[10:15]
print all_ratings[all_ratings['userId'] == 100].head() #查看ID为100的用户的评价
ratings = all_ratings[all_ratings['userId'].isin(range(200))] #筛选前200名用户
favorable_ratings = ratings[ratings['Favorable']] #创建一个只包括用户喜欢某部电影的数据集
#需要知道每个用户各喜欢哪些电影,按照ID进行分组,并遍历每个用户看过的每一部电影。
favorable_reviews_by_users = dict((k, frozenset(v.values))
for k, v in favorable_ratings.groupby('userId')['movieId']) #frozenset()为不可变集合
#把v.values存储为frozenset,便于快速判断用户是否为某部电影打过分。集合比列表快。
print len(favorable_reviews_by_users)
#创建一个数据框,以便于了解每部电影的影迷数量。
num_favorable_by_movie = ratings[['movieId', 'Favorable']].groupby('movieId').sum()
#查看最受欢迎的五部电影。
num_favorable_by_movie.sort('Favorable', ascending=False)[:5]
userId movieId rating timestamp 0 1 2 3.5 2005-04-02 23:53:47 1 1 29 3.5 2005-04-02 23:31:16 2 1 32 3.5 2005-04-02 23:33:39 3 1 47 3.5 2005-04-02 23:32:07 4 1 50 3.5 2005-04-02 23:29:40 userId movieId rating count 2.000026e+07 2.000026e+07 2.000026e+07 mean 6.904587e+04 9.041567e+03 3.525529e+00 std 4.003863e+04 1.978948e+04 1.051989e+00 min 1.000000e+00 1.000000e+00 5.000000e-01 25% 3.439500e+04 9.020000e+02 3.000000e+00 50% 6.914100e+04 2.167000e+03 3.500000e+00 75% 1.036370e+05 4.770000e+03 4.000000e+00 max 1.384930e+05 1.312620e+05 5.000000e+00 userId movieId rating timestamp 11049 100 14 3.0 1996-06-25 16:40:02 11050 100 25 4.0 1996-06-25 16:31:02 11051 100 32 3.0 1996-06-25 16:24:49 11052 100 39 3.0 1996-06-25 16:25:12 11053 100 50 5.0 1996-06-25 16:24:49 11054 100 70 3.0 1996-06-25 16:38:47 11055 100 161 3.0 1996-06-25 16:23:18 11056 100 162 4.0 1996-06-25 16:43:19 11057 100 185 2.0 1996-06-25 16:23:45 11058 100 194 3.0 1996-06-25 16:40:13 11059 100 223 4.0 1996-06-25 16:31:02 11060 100 235 4.0 1996-06-25 16:28:27 11061 100 260 4.0 1997-06-09 16:40:56 11062 100 265 4.0 1996-06-25 16:29:49 11063 100 288 4.0 1996-06-25 16:24:07 11064 100 293 5.0 1996-06-25 16:28:27 11065 100 296 4.0 1996-06-25 16:21:49 11066 100 318 3.0 1996-06-25 16:22:54 11067 100 329 3.0 1996-06-25 16:22:54 11068 100 337 3.0 1996-06-25 16:25:52 11069 100 339 3.0 1996-06-25 16:23:18 11070 100 342 4.0 1996-06-25 16:33:36 11071 100 344 3.0 1996-06-25 16:22:14 11072 100 356 4.0 1996-06-25 16:25:52 11073 100 427 2.0 1996-06-25 16:36:08 11074 100 431 3.0 1996-06-25 16:34:10 11075 100 434 2.0 1996-06-25 16:23:18 11076 100 435 3.0 1996-06-25 16:25:33 11077 100 471 3.0 1996-06-25 16:37:19 11078 100 481 3.0 1996-06-25 16:47:57 11079 100 500 2.0 1996-06-25 16:30:44 11080 100 508 3.0 1996-06-25 16:35:35 11081 100 527 4.0 1996-06-25 16:30:44 11082 100 535 4.0 1996-06-25 16:46:16 11083 100 538 4.0 1996-06-25 16:47:44 11084 100 562 4.0 1996-07-29 14:57:42 11085 100 586 1.0 1996-06-25 16:32:37 11086 100 587 3.0 1996-06-25 16:31:42 11087 100 589 3.0 1996-06-25 16:29:49 11088 100 593 4.0 1996-06-25 16:23:45 11089 100 608 4.0 1996-06-25 16:33:06 11090 100 610 4.0 1996-06-25 16:35:35 11091 100 673 4.0 1996-06-25 16:58:05 11092 100 680 5.0 1996-06-25 16:58:31 11093 100 708 4.0 1996-06-25 16:44:04 11094 100 728 4.0 1996-07-16 16:26:17 11095 100 778 4.0 1997-06-09 16:41:27 11096 100 780 3.0 1996-07-11 16:20:12 11097 100 1112 4.0 1996-11-13 14:12:25 11098 100 1210 4.0 1997-06-09 16:43:14 11099 100 1449 5.0 1997-06-09 16:38:17 11100 100 1527 4.0 1997-06-09 16:40:04 D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:7: FutureWarning: sort(columns=….) is deprecated, use sort_values(by=…..) userId movieId rating timestamp Favorable 10 1 293 4.0 2005-04-02 23:31:43 True 11 1 296 4.0 2005-04-02 23:32:47 True 12 1 318 4.0 2005-04-02 23:33:18 True 13 1 337 3.5 2004-09-10 03:08:29 True 14 1 367 3.5 2005-04-02 23:53:00 True userId movieId rating timestamp Favorable 11049 100 14 3.0 1996-06-25 16:40:02 False 11050 100 25 4.0 1996-06-25 16:31:02 True 11051 100 32 3.0 1996-06-25 16:24:49 False 11052 100 39 3.0 1996-06-25 16:25:12 False 11053 100 50 5.0 1996-06-25 16:24:49 True 199 D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:28: FutureWarning: sort(columns=….) is deprecated, use sort_values(by=…..)
Favorable | |
---|---|
movieId | |
296 | 80.0 |
356 | 78.0 |
318 | 76.0 |
593 | 63.0 |
480 | 58.0 |
# Apripri算法专门用于查找数据集中的频繁项。基本流程为从前一步找到的频繁项集中找到新的备选集合,接着检测备选集合的频繁程度是否够高,然后再迭代。
#(1)把各项目放到只包含自己的项集中,生成最初的频繁项集。只使用达到最小支持度的项目。
#(2)查找现有频繁项集的超集,发现新的频繁项集,并用其生成新的备选项集。
#(3)测试新生成的备选项集的频繁程度,如果不够频繁,则舍弃。如果没有新的频繁项集,就跳到最后一步。
#(4)储存新发现的频繁项集,调到步骤(2).
#(5)返回发现的所有频繁项集。
# 接着,用一个函数来实现步骤(2)和(3),它接受新发现的频繁项集,检测频繁程度。
from collections import defaultdict
def find_frequent_itemsets(favorable_reviews_by_users, k_l_itemsets, min_support):
counts = defaultdict(int)
# 遍历所有用户和他们的打分数据
for user, reviews in favorable_reviews_by_users.items():
# 遍历前面找出的项集,判断它们是否当前评分项集的子集,如果是,表明用户已经为子集中的电影打过分。
for itemset in k_l_itemsets:
if itemset.issubset(reviews):
for other_reviewed_movie in reviews - itemset:
current_superset = itemset | frozenset((other_reviewed_movie,))
counts[current_superset] += 1
# 函数最后检测达到支持度要求的项集,看它的频繁程度够不,并返回其中的频繁项集
return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])
import sys
frequent_itemsets = {} #初始化一个字典
min_support = 50 #设置最小支持度,建议每次只改动10个百分点
# 第一步,为每部电影生成只包含自己的项集,检测它是否频繁。
frequent_itemsets[1] = dict((frozenset((movie_id,)), row['Favorable']) for movie_id, row in num_favorable_by_movie.iterrows()
if row['Favorable'] > min_support)
print "There are {} movies with more than {} favorable reviews".format(len(frequent_itemsets[1]), min_support)
sys.stdout.flush()
# 创建循环,运行Ariori算法,存储算法运行中的发现的新项集。k表示即将发现的频繁项集的长度。
# 用键k-1可以从frequence_itemsets字典中获取刚发现的频繁项集。
# 新发现的频繁项集以长度为键,将其保存到字典中。
for k in range(2, 20):
cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1],min_support)
# 如果在上述循环中没能找到任何新的频繁项集,就跳出循环。
if len(cur_frequent_itemsets) == 0:
print "Did not find any frequent itemsets of length {}".format(k)
sys.stdout.flush() # 确保代码在运行时,把缓冲区内容输出到终端,不宜多用,拖慢运行速度。
break
# 如果找到了频繁项集,输出信息。
else:
print "I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k)
sys.stdout.flush()
frequent_itemsets[k] = cur_frequent_itemsets
# 删除长度为1的项集,对生成关联规则没用。
del frequent_itemsets[1]
print "Found a total of {0} frequent itemsets".format(sum(len(itemsets) for itemsets in frequent_itemsets.values()))
There are 11 movies with more than 50 favorable reviews I found 34 frequent itemsets of length 2 I found 49 frequent itemsets of length 3 I found 36 frequent itemsets of length 4 I found 12 frequent itemsets of length 5 I found 1 frequent itemsets of length 6 Did not find any frequent itemsets of length 7 Found a total of 132 frequent itemsets
# 抽取关联规则
# Apriori算法运行结束后,得到一系列频繁项集,而不是关联规则。
# 频繁项集是一组达到最小支持度的项目,而关联规则由前提和结论组成。
# 从频繁项集中抽取关联规则,把其中几部电影作为前提,另一部电影作为结论,组成规则:如果用户喜欢前提中的所有电影,那么他们也会喜欢结论中的电影。
# 每一个项集都可以用这种方式生成一条规则
# 通过遍历不同长度的频繁项集,为每个项集生成规则
candidate_rules = []
for itemset_length, itemset_counts in frequent_itemsets.items():
for itemset in itemset_counts.keys():
#然后遍历项集中的每一部电影,把他作为结论。项集中的其他电影作为前提,用前提和结论组成备选规则。
for conclusion in itemset:
premise = itemset - set((conclusion,))
candidate_rules.append((premise, conclusion))
# 得到了大量的备选项,查看前五条规则。
print "There are {} candidate rules".format(len(candidate_rules))
# frozenset是作为规则前提的电影编号,后面数字表示作为结论的电影编号。
candidate_rules[:5]
There are 425 candidate rules [(frozenset({47}), 50), (frozenset({50}), 47), (frozenset({318}), 480), (frozenset({480}), 318), (frozenset({356}), 480)]
# 计算每条规则的置信度。
# 分别存储规则应验(正例)和不适应的次数。
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
# 遍历所有用户及其喜欢的电影数据,在这个过程中遍历每条关联规则。
for user, reviews in favorable_reviews_by_users.items():
for candidate_rule in candidate_rules:
premise, conclusion = candidate_rule
# 测试每条规则的前提对用户是否适用,即用户是否喜欢前提中的所有电影。
if premise.issubset(reviews):
correct_counts[candidate_rule] += 1
else:
incorrect_counts[candidate_rule] += 1
# 用规则应验的次数除以前提条件出现的次数,计算每条规则的置信度。
rule_confidence = {candidate_rule: correct_counts[candidate_rule]/float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
for candidate_rule in candidate_rules}
print len(rule_confidence)
425
#min_confidence = 0.9
#rule_confidence = {rule: confidence for rule,confidence in rule_confidence.items() if confidence > min_confidence}
#print len(rule_confidence)
# 对置信度字典进行排序后,输出置信度最高的前五条规则。
from operator import itemgetter
sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True)
print max(sorted_confidence) #输出最大置信度
print min(sorted_confidence) #输出最小置信度
for index in range(5):
print "Rule #{0}".format(index + 1)
(premise, conclusion) = sorted_confidence[index][0]
print "Rule: If a person recommends {0} they will also recommend {1}".format(premise, conclusion)
print "- Confidence: {0: .3f}".format(rule_confidence[(premise, conclusion)])
print ""
((frozenset([296, 593, 50, 318, 47]), 356), 0.11055276381909548) ((frozenset([296]), 47), 0.4020100502512563) Rule #1 Rule: If a person recommends frozenset([296]) they will also recommend 527 – Confidence: 0.402 Rule #2 Rule: If a person recommends frozenset([296]) they will also recommend 2858 – Confidence: 0.402 Rule #3 Rule: If a person recommends frozenset([296]) they will also recommend 480 – Confidence: 0.402 Rule #4 Rule: If a person recommends frozenset([296]) they will also recommend 50 – Confidence: 0.402 Rule #5 Rule: If a person recommends frozenset([296]) they will also recommend 593 – Confidence: 0.402
#分析电影数据
#数据:movies
#表头:movieId,title,genres
movie_name_data = pd.read_csv("movies.csv")
movie_name_data.head()
# 创建一个用电影编号获取名称的函数
def get_movie_name(movie_id):
title_object = movie_name_data[movie_name_data['movieId'] == movie_id]['title']
title = title_object.values[0]
return title
get_movie_name(4)
‘Waiting to Exhale (1995)’
# 在输出的规则中显示电影名称
for index in range(5):
print "Rule #{0}".format(index + 1)
(premise, conclusion) = sorted_confidence[index][0]
premise_names = ", ".join(get_movie_name(idx) for idx in premise)
conclusion_name = get_movie_name(conclusion)
print "Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name)
print " - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)])
print ""
Rule #1 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Schindler’s List (1993) – Confidence: 0.402 Rule #2 Rule: If a person recommends Pulp Fiction (1994) they will also recommend American Beauty (1999) – Confidence: 0.402 Rule #3 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Jurassic Park (1993) – Confidence: 0.402 Rule #4 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Usual Suspects, The (1995) – Confidence: 0.402 Rule #5 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Silence of the Lambs, The (1991) – Confidence: 0.402
# 评估
# 只是简单的看下每条规则的表现
# 抽取所有没有训练的数据作为测试集
test_dataset = all_ratings[~all_ratings['userId'].isin(range(200))]
test_favorable = test_dataset[test_dataset['Favorable']]
test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby('userId')['movieId'])
test_dataset.head()
userId | movieId | rating | timestamp | Favorable | |
---|---|---|---|---|---|
25048 | 200 | 6 | 5.0 | 1996-08-11 12:59:30 | True |
25049 | 200 | 10 | 3.0 | 1996-08-11 12:53:11 | False |
25050 | 200 | 17 | 4.0 | 1996-08-11 12:57:25 | True |
25051 | 200 | 19 | 2.0 | 1996-08-11 12:54:08 | False |
25052 | 200 | 20 | 4.0 | 1996-08-11 13:05:27 | True |
# 使用测试数据计算规则应验的数量
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
for user, reviews in test_favorable_by_users.items():
for candidate_rule in candidate_rules:
premise, conclusion = candidate_rule
if premise.issubset(reviews):
if conclusion in reviews:
correct_counts[candidate_rule] += 1
else:
incorrect_counts[candidate_rule] += 1
print len(correct_counts)
425
# 计算所有应验规则的置信度
test_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
for candidate_rule in rule_confidence}
print len(test_confidence)
sorted_test_confidence = sorted(test_confidence.items(), key=itemgetter(1), reverse=True)
print sorted_test_confidence[:5]
425
[((frozenset([296]), 2858), 0.4020100502512563), ((frozenset([296]), 480), 0.4020100502512563), ((frozenset([296]), 50), 0.4020100502512563), ((frozenset([296]), 593), 0.4020100502512563), ((frozenset([296]), 47), 0.4020100502512563)]
# 输出电影名称表示的最佳关联规则
for index in range(5):
print "Rule #{0}".format(index + 1)
(premise, conclusion) = sorted_confidence[index][0]
premise_names = ",".join(get_movie_name(idx) for idx in premise)
conlusion_name = get_movie_name(conclusion)
print "Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conlusion_name)
print "- Train Confidence: {0:.3f}".format(rule_confidence.get((premise, conclusion), -1))
print "- Test Confidence: {0:.3f}".format(test_confidence.get((premise, conclusion), -1))
print ""
Rule #1
Rule: If a person recommends Pulp Fiction (1994) they will also recommend Schindler's List (1993)
- Train Confidence: 0.402
- Test Confidence: 0.402
Rule #2
Rule: If a person recommends Pulp Fiction (1994) they will also recommend American Beauty (1999)
- Train Confidence: 0.402
- Test Confidence: 0.402
Rule #3
Rule: If a person recommends Pulp Fiction (1994) they will also recommend Jurassic Park (1993)
- Train Confidence: 0.402
- Test Confidence: 0.402
Rule #4
Rule: If a person recommends Pulp Fiction (1994) they will also recommend Usual Suspects, The (1995)
- Train Confidence: 0.402
- Test Confidence: 0.402
Rule #5
Rule: If a person recommends Pulp Fiction (1994) they will also recommend Silence of the Lambs, The (1991)
- Train Confidence: 0.402
- Test Confidence: 0.402