Predicting user-item ratings based on data from friends

Companies often try to explore the ratings that users gave for particular products or items, so they can recommend them better ones and in turn increase their revenue. For this reason, recommender systems have been popular on many sites, including Amazon, Netflix, Yelp and others.

Knowing people's preferences allows companies to develop better products, serve customers better or offer more delightful experiences. Gaining additional insight in what people value enables companies to invest strategically in areas that allow for higher growth. Additionally, this knowledge gives them the ability to offer personalized service that is adjusted to the needs of every individual.

Each new thing that is learned about the customer gives a unique opportunity to approach them in a special way. If they have shown interest in one item, they might be interested in other similar items as well. The product line might be so big that they might not be willing to spend the time to explore all options by themselves. An intelligent algorithm could help them by focusing their attention on specific items, based on their past actions and eventually knowledge about the preferences characteristic for their social circle.

But sometimes the ratings are missing or incomplete. For instance, we might know how other people have rated a certain item, but we don't know how a particular person would rate it. In some cases we might be able to predict that from the context and this is what this post will focus on.

Suppose that we have the following matrices of preferences in terms of ratings for given items and the friendships among the individuals who rated these items.

Ratings
43223241
52151323
52153134
35334251
13435224
311n/a3241
32124112
43223511
44231153
Friendships
001001001
000010110
100101001
001010010
010101001
101010000
010000010
010100100
101010000

The first matrix contains values from 1 to 5, indicating the number of stars that a particular person gave for to an item. The rows correspond to users and the columns correspond to each item that was rated. In this case we have the ratings of 9 individuals, which voted on 8 different items. We notice that the 6th person hasn't rated the 4th item yet, which is indicated with "n/a". We would like to try finding a reasonable rating for this item given the preferences of the other people and the friendships among them. The friendship matrix contains only zeros and ones, indicating whether one person is friends with another. The values on the diagonal of this matrix are zeros since we assume that people can't be friends with themselves.

Here is a possible way to approach the problem in code:

def avg_ratings(user_name): v = user_item_ratings[user_name] if None in v: v.remove(None) return sum(v) / float(len(v)) def dot_product(v1, v2): return sum([v1k*v2k for v1k, v2k in zip(v1, v2)]) def root_squared_sum_ratings(user): return sum([ui**2 for ui in user if ui is not None]) ** (1/2.) def find_friends_of(user): user_vector = friendships[user] return [people[i] for i, is_friend in enumerate(user_vector) if is_friend] def cosine_similarity(user1, user2): if user1 != user2: # Get only those ratings that don't have a corresponding None user1_ratings = user_item_ratings[user1] user2_ratings = user_item_ratings[user2] ratings1, ratings2 = [], [] for rating1, rating2 in zip(user1_ratings, user2_ratings): if None not in (rating1, rating2): ratings1.append(rating1) ratings2.append(rating2) dot_vectors = dot_product(ratings1, ratings2) root1 = root_squared_sum_ratings(ratings1) root2 = root_squared_sum_ratings(ratings2) return dot_vectors / (root1 * root2) def find_most_similar_users(to_user, how_many): del_idx = people.index(to_user) other_users = people[:] del other_users[del_idx] similarities = {other_user:cosine_similarity(to_user, other_user) for other_user in other_users} similarities_reverse_sorted = sorted(similarities.items(), reverse=True, key=lambda x: x[1]) return [tup[0] for tup in similarities_reverse_sorted[:how_many]] def find_missing_ratings(user_item_ratings): missing = [] for k, v in user_item_ratings.items(): if None in v: idx = v.index(None) missing.append((k , user_item_indices[idx])) return missing def predict_missing_ratings(user_item_ratings): missing_rating_estimates = {} missing_ratings = find_missing_ratings(user_item_ratings) for user, item in missing_ratings: rating = {} item_index = user_item_indices.index(item) user_friends = find_friends_of(user) most_similar_users = find_most_similar_users(user, 3) most_similar_users = list(set(most_similar_users) & set(user_friends)) numerator_terms, denominator_terms = [], [] for similar_user in most_similar_users: similar_user_rating = user_item_ratings[similar_user][item_index] rating_relative_to_avg = similar_user_rating - avg_ratings(similar_user) numerator_terms.append(cosine_similarity(user, similar_user) * rating_relative_to_avg) denominator_terms.append(cosine_similarity(user, similar_user)) missing_rating_estimates[(user, item)] = avg_ratings(user) + sum(numerator_terms) / sum(denominator_terms) return missing_rating_estimates people = ['Jocelyn', 'Garnet', 'Ashanti', 'Donna', 'Wei', 'Cleo', 'Velva', 'Ignacio', 'Boyce'] user_item_indices = ['Item 1', 'Item 2', 'Item 3', 'Item 4', 'Item 5', 'Item 6', 'Item 7', 'Item 8'] user_item_ratings = { 'Jocelyn': [4, 3, 2, 2, 3, 2, 4, 1], 'Garnet': [5, 2, 1, 5, 1, 3, 2, 3], 'Ashanti': [5, 2, 1, 5, 3, 1, 3, 4], 'Donna': [3, 5, 3, 3, 4, 2, 5, 1], 'Wei': [1, 3, 4, 3, 5, 2, 2, 4], 'Cleo': [3, 1, 1, None, 3, 2, 4, 1], 'Velva': [3, 2, 1, 2, 4, 1, 1, 2], 'Ignacio': [4, 3, 2, 2, 3, 5, 1, 1], 'Boyce': [4, 4, 2, 3, 1, 1, 5, 3] } friendships = { 'Jocelyn': [0, 0, 1, 0, 0, 1, 0, 0, 1], 'Garnet': [0, 0, 0, 0, 1, 0, 1, 1, 0], 'Ashanti': [1, 0, 0, 1, 0, 1, 0, 0, 1], 'Donna': [0, 0, 1, 0, 1, 0, 0, 1, 0], 'Wei': [0, 1, 0, 1, 0, 1, 0, 0, 1], 'Cleo': [1, 0, 1, 0, 1, 0, 0, 0, 0], 'Velva': [0, 1, 0, 0, 0, 0, 0, 1, 0], 'Ignacio': [0, 1, 0, 1, 0, 0, 1, 0, 0], 'Boyce': [1, 0, 1, 0, 1, 0, 0, 0, 0] } print(predict_missing_ratings(user_item_ratings)) # {('Cleo', 'Item 4'): 2.7700836128372934}

Ideally, we would have to only define the two matrices and run a function once to find all possible predictions for all missing values. Here I have used dictionaries mainly for illustrative purposes and to identify more clearly which vector belongs to which person. In the real world such labeled data is unlikely to be available or costly to provide for large matrices. In this code you can see how we could combine different types of information to obtain more accurate results. Knowing what the rating of this item could have been allows us to decide whether it is worth promoting this item to this person (in this case it is not). As you can see, the code is quite simple and has no external dependencies, but if we needed more flexibility (switching from lists to Numpy arrays, for instance), this would still be possible. Great code is documentation in itself, which is why an attempt has been made to make all variable and function names as descriptive as possible.