推荐系统介绍:(协同过滤)—Intro to Recommender Systems: Collaborative Filtering

本文主要是介绍推荐系统介绍:(协同过滤)—Intro to Recommender Systems: Collaborative Filtering,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

本文试验前期准备:

  1. MovieLens  ml-100k数据集
  2. Jupyter notebook
  3. themoviedb.org API key

 本文试验内容翻译自:http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/

 

  1. 添加python引用
    import numpy as np
    import pandas as pd
  2. 进入MovieLens  ml-100k数据存放目录
    cd F:\Master\MachineLearning\kNN\ml-100k
  3. 读取数据:u.data每行数据分为userid,itemid,rating,时间戳四部分
    names = ['user_id', 'item_id', 'rating', 'timestamp']
    df = pd.read_csv('u.data', sep='\t', names=names)
    df.head()

     

     user_iditem_idratingtimestamp
    01962423881250949
    11863023891717742
    2223771878887116
    3244512880606923
    41663461886397596
  4. 统计文件中用户总数与电影总数
    n_users = df.user_id.unique().shape[0]
    n_items = df.item_id.unique().shape[0]
    print str(n_users) + ' users'
    print str(n_items) + ' items'
    943 users
    1682 items
  5. 构造 用户-电影评分矩阵
    ratings = np.zeros((n_users, n_items))
    for row in df.itertuples():ratings[row[1]-1, row[2]-1] = row[3]
    ratings
    array([[ 5.,  3.,  4., ...,  0.,  0.,  0.],[ 4.,  0.,  0., ...,  0.,  0.,  0.],[ 0.,  0.,  0., ...,  0.,  0.,  0.],..., [ 5.,  0.,  0., ...,  0.,  0.,  0.],[ 0.,  0.,  0., ...,  0.,  0.,  0.],[ 0.,  5.,  0., ...,  0.,  0.,  0.]])
  6. 计算数据稀疏度
    sparsity = float(len(ratings.nonzero()[0]))
    sparsity /= (ratings.shape[0] * ratings.shape[1])
    sparsity *= 100
    print 'Sparsity: {:4.2f}%'.format(sparsity)

    Sparsity: 6.30% 
    数据稀疏度:6.3%

  7.  数据稀疏度为6.3%,943个user,1682个item,每个用户平均需要做出100条评论,随机抽取10%数据,将数据分为训练集与测试机两部分
    def train_test_split(ratings):test = np.zeros(ratings.shape)train = ratings.copy()for user in xrange(ratings.shape[0]):test_ratings = np.random.choice(ratings[user, :].nonzero()[0], size=10, replace=False)train[user, test_ratings] = 0.test[user, test_ratings] = ratings[user, test_ratings]# Test and training are truly disjointassert(np.all((train * test) == 0)) return train, test
    train, test = train_test_split(ratings)

     

  8. 计算user或item的余弦相似性可以用代码通过for循环实现,但是这样Python代码会运行非常慢,这里可以使用NumPy的科学计算函数来表达方程式,提高计算速度
    def slow_similarity(ratings, kind='user'):if kind == 'user':axmax = 0axmin = 1elif kind == 'item':axmax = 1axmin = 0sim = np.zeros((ratings.shape[axmax], ratings.shape[axmax]))for u in xrange(ratings.shape[axmax]):for uprime in xrange(ratings.shape[axmax]):rui_sqrd = 0.ruprimei_sqrd = 0.for i in xrange(ratings.shape[axmin]):sim[u, uprime] = ratings[u, i] * ratings[uprime, i]rui_sqrd += ratings[u, i] ** 2ruprimei_sqrd += ratings[uprime, i] ** 2sim[u, uprime] /= rui_sqrd * ruprimei_sqrdreturn simdef fast_similarity(ratings, kind='user', epsilon=1e-9):# epsilon -> small number for handling dived-by-zero errorsif kind == 'user':sim = ratings.dot(ratings.T) + epsilonelif kind == 'item':sim = ratings.T.dot(ratings) + epsilonnorms = np.array([np.sqrt(np.diagonal(sim))])return (sim / norms / norms.T)
    %timeit fast_similarity(train, kind='user')
    1 loop, best of 3: 171 ms per loop
  9.  分别计算user相似性和item相似性,并输出item相似性矩阵的前4行

    user_similarity = fast_similarity(train, kind='user')
    item_similarity = fast_similarity(train, kind='item')
    print item_similarity[:4, :4]
    [[ 1.          0.42176871  0.3440934   0.4551558 ][ 0.42176871  1.          0.2889324   0.48827863][ 0.3440934   0.2889324   1.          0.33718518][ 0.4551558   0.48827863  0.33718518  1.        ]]
  10.  预测评分,predict_fast_simple使用NumPy数学函数,计算更块

    def predict_slow_simple(ratings, similarity, kind='user'):pred = np.zeros(ratings.shape)if kind == 'user':for i in xrange(ratings.shape[0]):for j in xrange(ratings.shape[1]):pred[i, j] = similarity[i, :].dot(ratings[:, j])\/np.sum(np.abs(similarity[i, :]))return predelif kind == 'item':for i in xrange(ratings.shape[0]):for j in xrange(ratings.shape[1]):pred[i, j] = similarity[j, :].dot(ratings[i, :].T)\/np.sum(np.abs(similarity[j, :]))return preddef predict_fast_simple(ratings, similarity, kind='user'):if kind == 'user':return similarity.dot(ratings) / np.array([np.abs(similarity).sum(axis=1)]).Telif kind == 'item':return ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    %timeit predict_slow_simple(train, user_similarity, kind='user')
    1 loop, best of 3: 1min 52s per loop
    %timeit predict_fast_simple(train, user_similarity, kind='user')
    1 loop, best of 3: 279 ms per loop 
  11.  使用sklearn计算MSE,首先去除数据矩阵中的无效0值,然后直接调用sklearn里面的mean_squared_error函数计算MSE

    from sklearn.metrics import mean_squared_errordef get_mse(pred, actual):# Ignore nonzero terms.pred = pred[actual.nonzero()].flatten()actual = actual[actual.nonzero()].flatten()return mean_squared_error(pred, actual)
    item_prediction = predict_fast_simple(train, item_similarity, kind='item')
    user_prediction = predict_fast_simple(train, user_similarity, kind='user')print 'User-based CF MSE: ' + str(get_mse(user_prediction, test))
    print 'Item-based CF MSE: ' + str(get_mse(item_prediction, test))
    User-based CF MSE: 8.44170489251
    Item-based CF MSE: 11.5717812485
  12.  为提高预测的MSE,可以只考虑使用与目标用户最相似的k个用户的数据,进行Top-k预测并进行MSE计算

    def predict_topk(ratings, similarity, kind='user', k=40):pred = np.zeros(ratings.shape)if kind == 'user':for i in xrange(ratings.shape[0]):top_k_users = [np.argsort(similarity[:,i])[:-k-1:-1]]for j in xrange(ratings.shape[1]):pred[i, j] = similarity[i, :][top_k_users].dot(ratings[:, j][top_k_users]) pred[i, j] /= np.sum(np.abs(similarity[i, :][top_k_users]))if kind == 'item':for j in xrange(ratings.shape[1]):top_k_items = [np.argsort(similarity[:,j])[:-k-1:-1]]for i in xrange(ratings.shape[0]):pred[i, j] = similarity[j, :][top_k_items].dot(ratings[i, :][top_k_items].T) pred[i, j] /= np.sum(np.abs(similarity[j, :][top_k_items]))        return pred
    pred = predict_topk(train, user_similarity, kind='user', k=40)
    print 'Top-k User-based CF MSE: ' + str(get_mse(pred, test))pred = predict_topk(train, item_similarity, kind='item', k=40)
    print 'Top-k Item-based CF MSE: ' + str(get_mse(pred, test))

     

    计算结果为:

    Top-k User-based CF MSE: 6.47059807493
    Top-k Item-based CF MSE: 7.75559095568

    相比之前的方法,MSE已经降低了不少。

  13. 为进一步降低MSE,这里尝试使用不同的k值寻找最小的MSE,使用matplotlib 可视化输出结果
    k_array = [5, 15, 30, 50, 100, 200]
    user_train_mse = []
    user_test_mse = []
    item_test_mse = []
    item_train_mse = []def get_mse(pred, actual):pred = pred[actual.nonzero()].flatten()actual = actual[actual.nonzero()].flatten()return mean_squared_error(pred, actual)for k in k_array:user_pred = predict_topk(train, user_similarity, kind='user', k=k)item_pred = predict_topk(train, item_similarity, kind='item', k=k)user_train_mse += [get_mse(user_pred, train)]user_test_mse += [get_mse(user_pred, test)]item_train_mse += [get_mse(item_pred, train)]item_test_mse += [get_mse(item_pred, test)]  
    %matplotlib inline
    import matplotlib.pyplot as plt
    import seaborn as sns
    sns.set()pal = sns.color_palette("Set2", 2)plt.figure(figsize=(8, 8))
    plt.plot(k_array, user_train_mse, c=pal[0], label='User-based train', alpha=0.5, linewidth=5)
    plt.plot(k_array, user_test_mse, c=pal[0], label='User-based test', linewidth=5)
    plt.plot(k_array, item_train_mse, c=pal[1], label='Item-based train', alpha=0.5, linewidth=5)
    plt.plot(k_array, item_test_mse, c=pal[1], label='Item-based test', linewidth=5)
    plt.legend(loc='best', fontsize=20)
    plt.xticks(fontsize=16);
    plt.yticks(fontsize=16);
    plt.xlabel('k', fontsize=30);
    plt.ylabel('MSE', fontsize=30);

     

     
    从图中可以看出,在测试数据集中,k为15和50时分别产生一个最小值对基于用户和基于项目的协同过滤

     

  14.  计算无偏置下均方根误差MSE
    def predict_nobias(ratings, similarity, kind='user'):if kind == 'user':user_bias = ratings.mean(axis=1)ratings = (ratings - user_bias[:, np.newaxis]).copy()pred = similarity.dot(ratings) / np.array([np.abs(similarity).sum(axis=1)]).Tpred += user_bias[:, np.newaxis]elif kind == 'item':item_bias = ratings.mean(axis=0)ratings = (ratings - item_bias[np.newaxis, :]).copy()pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])pred += item_bias[np.newaxis, :]return pred

     

    user_pred = predict_nobias(train, user_similarity, kind='user')
    print 'Bias-subtracted User-based CF MSE: ' + str(get_mse(user_pred, test))item_pred = predict_nobias(train, item_similarity, kind='item')
    print 'Bias-subtracted Item-based CF MSE: ' + str(get_mse(item_pred, test))
    Bias-subtracted User-based CF MSE: 8.67647634245
    Bias-subtracted Item-based CF MSE: 9.71148412222



  15. 将Top-k和偏置消除算法结合起来,计算基于User的和基于Item的MSE,并分别取k=5,15,30,50,100,200,将计算的MSE结果运用matplotlib 可视化输出
    def predict_topk_nobias(ratings, similarity, kind='user', k=40):pred = np.zeros(ratings.shape)if kind == 'user':user_bias = ratings.mean(axis=1)ratings = (ratings - user_bias[:, np.newaxis]).copy()for i in xrange(ratings.shape[0]):top_k_users = [np.argsort(similarity[:,i])[:-k-1:-1]]for j in xrange(ratings.shape[1]):pred[i, j] = similarity[i, :][top_k_users].dot(ratings[:, j][top_k_users]) pred[i, j] /= np.sum(np.abs(similarity[i, :][top_k_users]))pred += user_bias[:, np.newaxis]if kind == 'item':item_bias = ratings.mean(axis=0)ratings = (ratings - item_bias[np.newaxis, :]).copy()for j in xrange(ratings.shape[1]):top_k_items = [np.argsort(similarity[:,j])[:-k-1:-1]]for i in xrange(ratings.shape[0]):pred[i, j] = similarity[j, :][top_k_items].dot(ratings[i, :][top_k_items].T) pred[i, j] /= np.sum(np.abs(similarity[j, :][top_k_items])) pred += item_bias[np.newaxis, :]return pred
    k_array = [5, 15, 30, 50, 100, 200]
    user_train_mse = []
    user_test_mse = []
    item_test_mse = []
    item_train_mse = []for k in k_array:user_pred = predict_topk_nobias(train, user_similarity, kind='user', k=k)item_pred = predict_topk_nobias(train, item_similarity, kind='item', k=k)user_train_mse += [get_mse(user_pred, train)]user_test_mse += [get_mse(user_pred, test)]item_train_mse += [get_mse(item_pred, train)]item_test_mse += [get_mse(item_pred, test)]  
    In [29]:
    pal = sns.color_palette("Set2", 2)plt.figure(figsize=(8, 8))
    plt.plot(k_array, user_train_mse, c=pal[0], label='User-based train', alpha=0.5, linewidth=5)
    plt.plot(k_array, user_test_mse, c=pal[0], label='User-based test', linewidth=5)
    plt.plot(k_array, item_train_mse, c=pal[1], label='Item-based train', alpha=0.5, linewidth=5)
    plt.plot(k_array, item_test_mse, c=pal[1], label='Item-based test', linewidth=5)
    plt.legend(loc='best', fontsize=20)
    plt.xticks(fontsize=16);
    plt.yticks(fontsize=16);
    plt.xlabel('k', fontsize=30);
    plt.ylabel('MSE', fontsize=30);



  16. 导入requests引用,通过requests.get方法获取链接地址
    import requests
    import jsonresponse = requests.get('http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)')
    print response.url.split('/')[-2]
    Movie ID 输出结果:tt0114709
  17. 这里需要使用themoviedb的API,通过查询themoviedb.org的API获取指定movie id 的海报文件存放路径
    # Get base url filepath structure. w185 corresponds to size of movie poster.
    headers = {'Accept': 'application/json'}
    payload = {'api_key': '这里填入你的API'} 
    response = requests.get("http://api.themoviedb.org/3/configuration", params=payload, headers=headers)
    response = json.loads(response.text)
    base_url = response['images']['base_url'] + 'w185'def get_poster(imdb_url, base_url):# Get IMDB movie IDresponse = requests.get(imdb_url)movie_id = response.url.split('/')[-2]# Query themoviedb.org API for movie poster path.movie_url = 'http://api.themoviedb.org/3/movie/{:}/images'.format(movie_id)headers = {'Accept': 'application/json'}payload = {'api_key': '这里填入你的API'} response = requests.get(movie_url, params=payload, headers=headers)try:file_path = json.loads(response.text)['posters'][0]['file_path']except:# IMDB movie ID is sometimes no good. Need to get correct one.movie_title = imdb_url.split('?')[-1].split('(')[0]payload['query'] = movie_titleresponse = requests.get('http://api.themoviedb.org/3/search/movie', params=payload, headers=headers)movie_id = json.loads(response.text)['results'][0]['id']payload.pop('query', None)movie_url = 'http://api.themoviedb.org/3/movie/{:}/images'.format(movie_id)response = requests.get(movie_url, params=payload, headers=headers)file_path = json.loads(response.text)['posters'][0]['file_path']return base_url + file_path
    from IPython.display import Image
    from IPython.display import displaytoy_story = 'http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)'
    Image(url=get_poster(toy_story, base_url))

     

    直接输出了电影的海报图片

     

  18. 加载MovieLens中u.data文件中的电影信息,根据给定的电影信息,计算最相似的k个电影,输出它们的海报

    # Load in movie data
    idx_to_movie = {}
    with open('u.item', 'r') as f:for line in f.readlines():info = line.split('|')idx_to_movie[int(info[0])-1] = info[4]def top_k_movies(similarity, mapper, movie_idx, k=6):return [mapper[x] for x in np.argsort(similarity[movie_idx,:])[:-k-1:-1]]
    idx = 0 # Toy Story
    movies = top_k_movies(item_similarity, idx_to_movie, idx)
    posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)

     

    display(*posters)


  19. 输出id为1的电影(GoldenEye)的最相似的k(k默认为6)部电影海报
    idx = 1 # GoldenEye
    movies = top_k_movies(item_similarity, idx_to_movie, idx)
    posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
    display(*posters)

     

  20. 输出id为2的电影(Muppet Treasure Island)的最相似的k(k默认为6)部电影海报
    idx = 20 # Muppet Treasure Island
    movies = top_k_movies(item_similarity, idx_to_movie, idx)
    posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
    display(*posters)

     

  21. 输出id为20的电影(Muppet Treasure Island)的最相似的k(k默认为6)部电影海报
    idx = 20 # Muppet Treasure Island
    movies = top_k_movies(item_similarity, idx_to_movie, idx)
    posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
    display(*posters)

     

  22. 输出id为40的电影(Billy Madison)的最相似的k(k默认为6)部电影海报
    idx = 40 # Billy Madison
    movies = top_k_movies(item_similarity, idx_to_movie, idx)
    posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
    display(*posters)
  23. 有时候现在这个的推荐结果似乎并不总是很好,Star Wars最相似的电影是Toy Story?Star Wars这类很受欢迎的电影在系统中预测评分很高,可以考虑运用一个不同的相似度度量方法——pearson相关度来移除一些偏置
    from sklearn.metrics import pairwise_distances
    # Convert from distance to similarity
    item_correlation = 1 - pairwise_distances(train.T, metric='correlation')
    item_correlation[np.isnan(item_correlation)] = 0.

     

  24. 再此分别对id为0,1,20,40的电影进行最相似的k部电影预测
    idx = 0 # Toy Story
    movies = top_k_movies(item_correlation, idx_to_movie, idx)
    posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
    display(*posters)
    idx = 1 # GoldenEye
    movies = top_k_movies(item_correlation, idx_to_movie, idx)
    posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
    display(*posters)
    idx = 20 # Muppet Treasure Island
    movies = top_k_movies(item_correlation, idx_to_movie, idx)
    posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
    display(*posters)
    idx = 40 # Billy Madison
    movies = top_k_movies(item_correlation, idx_to_movie, idx)
    posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
    display(*posters)

     

 

sim(u,u)=cos(θ)=ru˙rururu=iruiruiir2uiir2ui

这篇关于推荐系统介绍:(协同过滤)—Intro to Recommender Systems: Collaborative Filtering的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/987936

相关文章

MySQL中慢SQL优化的不同方式介绍

《MySQL中慢SQL优化的不同方式介绍》慢SQL的优化,主要从两个方面考虑,SQL语句本身的优化,以及数据库设计的优化,下面小编就来给大家介绍一下有哪些方式可以优化慢SQL吧... 目录避免不必要的列分页优化索引优化JOIN 的优化排序优化UNION 优化慢 SQL 的优化,主要从两个方面考虑,SQL 语

Linux系统之主机网络配置方式

《Linux系统之主机网络配置方式》:本文主要介绍Linux系统之主机网络配置方式,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录一、查看主机的网络参数1、查看主机名2、查看IP地址3、查看网关4、查看DNS二、配置网卡1、修改网卡配置文件2、nmcli工具【通用

Linux系统之dns域名解析全过程

《Linux系统之dns域名解析全过程》:本文主要介绍Linux系统之dns域名解析全过程,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录一、dns域名解析介绍1、DNS核心概念1.1 区域 zone1.2 记录 record二、DNS服务的配置1、正向解析的配置

java streamfilter list 过滤的实现

《javastreamfilterlist过滤的实现》JavaStreamAPI中的filter方法是过滤List集合中元素的一个强大工具,可以轻松地根据自定义条件筛选出符合要求的元素,本文就来... 目录1. 创建一个示例List2. 使用Stream的filter方法进行过滤3. 自定义过滤条件1. 定

C++中函数模板与类模板的简单使用及区别介绍

《C++中函数模板与类模板的简单使用及区别介绍》这篇文章介绍了C++中的模板机制,包括函数模板和类模板的概念、语法和实际应用,函数模板通过类型参数实现泛型操作,而类模板允许创建可处理多种数据类型的类,... 目录一、函数模板定义语法真实示例二、类模板三、关键区别四、注意事项 ‌在C++中,模板是实现泛型编程

Python实现html转png的完美方案介绍

《Python实现html转png的完美方案介绍》这篇文章主要为大家详细介绍了如何使用Python实现html转png功能,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 1.增强稳定性与错误处理建议使用三层异常捕获结构:try: with sync_playwright(

Java使用多线程处理未知任务数的方案介绍

《Java使用多线程处理未知任务数的方案介绍》这篇文章主要为大家详细介绍了Java如何使用多线程实现处理未知任务数,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 知道任务个数,你可以定义好线程数规则,生成线程数去跑代码说明:1.虚拟线程池:使用 Executors.newVir

Redis如何实现刷票过滤

《Redis如何实现刷票过滤》:本文主要介绍Redis如何实现刷票过滤问题,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录引言一、概述二、技术选型三、搭建开发环境四、使用Redis存储数据四、使用SpringBoot开发应用五、 实现同一IP每天刷票不得超过次数六

查看Oracle数据库中UNDO表空间的使用情况(最新推荐)

《查看Oracle数据库中UNDO表空间的使用情况(最新推荐)》Oracle数据库中查看UNDO表空间使用情况的4种方法:DBA_TABLESPACES和DBA_DATA_FILES提供基本信息,V$... 目录1. 通过 DBjavascriptA_TABLESPACES 和 DBA_DATA_FILES

Linux系统中配置静态IP地址的详细步骤

《Linux系统中配置静态IP地址的详细步骤》本文详细介绍了在Linux系统中配置静态IP地址的五个步骤,包括打开终端、编辑网络配置文件、配置IP地址、保存并重启网络服务,这对于系统管理员和新手都极具... 目录步骤一:打开终端步骤二:编辑网络配置文件步骤三:配置静态IP地址步骤四:保存并关闭文件步骤五:重