在Python中优化用于创建一起评级的项目列表的算法

给出购买事件列表(customer_id,item)

1-hammer
1-screwdriver
1-nails
2-hammer
2-nails
3-screws
3-screwdriver
4-nails
4-screws

我正在尝试构建一个数据结构,告诉我用另一个项目购买商品的次数.不是同时买的,而是因为我开始保存数据而买的.结果看起来像

{
       hammer : {screwdriver : 1, nails : 2}, 
  screwdriver : {hammer : 1, screws : 1, nails : 1}, 
       screws : {screwdriver : 1, nails : 1}, 
        nails : {hammer : 1, screws : 1, screwdriver : 1}
}

表示用钉子两次(人1,3)和一把螺丝刀(人1)买了一把锤子,用螺丝刀买了一次螺钉(人3),依此类推……

我目前的做法是

users = dict其中userid是键,而购买的项目列表是值
usersForItem = dict其中itemid是键,而购买item的用户列表是值
userlist =已对当前项目进行评级的临时用户列表

pseudo:
for each event(customer,item)(sorted by item):
  add user to users dict if not exists, and add the items
  add item to items dict if not exists, and add the user
----------

for item,user in rows:

  # add the user to the users dict if they don't already exist.
  users[user]=users.get(user,[])

  # append the current item_id to the list of items rated by the current user
  users[user].append(item)

  if item != last_item:
    # we just started a new item which means we just finished processing an item
    # write the userlist for the last item to the usersForItem dictionary.
    if last_item != None:
      usersForItem[last_item]=userlist

    userlist=[user]

    last_item = item
    items.append(item)
  else:
    userlist.append(user)

usersForItem[last_item]=userlist   

所以,在这一点上,我有两个决定 – 谁买了什么,以及谁买了什么.这是它变得棘手的地方.现在填充了usersForItem,我遍历它,遍历购买该项目的每个用户,并查看用户的其他购买.我承认这不是最狡猾的做事方式 – 我试图确保在得到Python之前得到正确的结果(我是).

relatedItems = {}
for key,listOfUsers in usersForItem.iteritems():
  relatedItems[key]={}
  related=[]

  for ux in listOfReaders:
    for itemRead in users[ux]:
      if itemRead != key:
        if itemRead not in related:
          related.append(itemRead)
        relatedItems[key][itemRead]= relatedItems[key].get(itemRead,0) + 1    

  calc jaccard/tanimoto similarity between relatedItems[key] and its values

有没有更有效的方法可以做到这一点?此外,如果这种类型的操作有适当的学术名称,我很乐意听到它.

编辑:澄清包括这样一个事实,即我不会限制购买同时购买的物品.物品可以随时购买.

最佳答案

events = """\
1-hammer 
1-screwdriver 
1-nails 
2-hammer 
2-nails 
3-screws 
3-screwdriver 
4-nails 
4-screws""".splitlines()
events = sorted(map(str.strip,e.split('-')) for e in events)

from collections import defaultdict
from itertools import groupby

# tally each occurrence of each pair of items
summary = defaultdict(int)
for val,items in groupby(events, key=lambda x:x[0]):
    items = sorted(it[1] for it in items)
    for i,item1 in enumerate(items):
        for item2 in items[i+1:]:
            summary[(item1,item2)] += 1
            summary[(item2,item1)] += 1

# now convert raw pair counts into friendlier lookup table
pairmap = defaultdict(dict)
for k,v in summary.items():
    item1, item2 = k
    pairmap[item1][item2] = v

# print the results    
for k,v in sorted(pairmap.items()):
    print k,':',v

得到:

hammer : {'nails': 2, 'screwdriver': 1}
nails : {'screws': 1, 'hammer': 2, 'screwdriver': 1}
screwdriver : {'screws': 1, 'nails': 1, 'hammer': 1}
screws : {'nails': 1, 'screwdriver': 1}

(这通过购买事件解决您的初始请求分组项目.要按用户分组,只需将事件列表的第一个键从事件编号更改为用户ID.)

点赞