Have you ever wondering how Netflix know what movies you probably like? Or if you watch Stranger Things that Netflix is likely to suggest to you is the ET: The Extra-Terrestrial. Basic on the customer viewing behavior data, we can predict users discover the exact content they want to watch. Similar products method is one of the ways to run the prediction models. Especially, brand new users haven’t rated the movies.

First, Read the movie database files in python.

import numpy as np
import pandas as pd
import matrix_factorization_utilities

# Load user ratings
df = pd.read_csv('movie_ratings_data_set.csv')
# Load movie titles
movies_df = pd.read_csv('movies.csv', index_col='movie_id')

Let’s have a closer look at the datasets.  Use pandas read csv command to load the data set into a data table. The first column is the ID of the user who made the rating. The second column is the ID of the movie that the user rated. And the third column is the rating that the user gave the movies. Files download : movie_ratings_data_set and movies.csv.

import pandas as pd
df = pd.read_csv("/Users/annettechiu/Desktop/Ex_Files_ML_EssT_Recommendations/Exercise Files/Chapter 4/movie_ratings_data_set.csv")
movies_df = pd.read_csv('/Users/annettechiu/Desktop/Ex_Files_ML_EssT_Recommendations/Exercise Files/Chapter 4/movies.csv', index_col='movie_id')

Second, convert the running list of user ratings into a matrix.

Use the user ID field for the rows or index in the pivot table, and use the movie ID for columns in the table. When summarizing data with a pivot table, it’s possible that we’ll have duplicates when the same user viewed the same movie twice but gave it two different ratings. In this case, we can use the aggregate function to resolve duplicates. We’ll pass in the parameter called aggfunc=np.max or can use aggfunc=np.mean. Therefore, if a user rated the same movie twice, we’ll take the higher rating or the mean of the rating. 

# Convert the running list of user ratings into a matrix
ratings_df = pd.pivot_table(df, index='user_id', columns='movie_id', aggfunc=np.max)

In the real world, most users will not review all products so there will always be a lot of black data. Don’t worry! sparse datasets are normal for recommendation systems. With Python, we can find out those missing data.

Third, Apply matrix factorization to find the latest features.

In order to find out the missing data in the movie rating matrix, we assign attributes to each user and each movie and then multiply them together and adding up the results. Attributes can be “Action”, “Drama”, “Romance”, “Music” , “Dark”…etc. All we know is that each attribute represents some characteristic that made users feel attracted to certain movies. these vectors are hidden information that we found by looking at review data.  U ( user attributes) x M (movie attributes) = Movie ratings

For example, User 1 is rating two movies M1 and M2. User1 like crowd-pleaser movies and don’t like too much drama. M1 is an action movie like Pirates of The Caribbean: The Curse of the Black Pearl and M2 is an Arthouse movie and not crowd-pleaser like Lost in Translation. As a result, user 1 will give M1 score 82 and M2 score -38.

We can actually use the movie ratings we know so far to work backward and find the U matrix and an M matrix that satisfy this equation. Finally, we’ll multiply the U and M matrices we found back together to get review scores for every user and every movie.

Matrix Factorization

  • U( user attributes) x M(movie attributes) = Movie ratings
  • Set all elements in U and M to random numbers. Right now, U and M will result in random numbers.
  • Create a “cost function” that checks how far off U*M. Currently is from equaling the known values of the movie rating matrix
  • Using a numerical optimization algorithm, tweak the numbers in U and M a little at a time. The goal is to get the “cost function” a little closer to zero. Scipy’s fmin_cg() optimization function to find the minimum cost.
  • Repeat step 3 until we can’t reduce the cost function further. The U and M. Values we find will be estimated U*M=Movie Ratings
# Apply matrix factorization to find the latent features
U, M = matrix_factorization_utilities.low_rank_matrix_factorization(ratings_df.as_matrix(),

Use matrix factorization to calculate the U and M matrices. We define 15 attributes in each U and M matrix.  num_features=15


A control in the model that limits how much weight to place on one single attribute when modeling users and products. The higher the regulation amount, the less weight we can put on any single attribute. Regularization limits how much weight we will place on one single attribute and reduce emphasizes specific data points too much. For example, if we have two romance movies.  The first is romance movie like Notting Hill and the second is historical romance movie. Both have romance elements, but some viewers might prefer funny movie and other viewers might prefer the serious movie. They both have some romance similar elements. However, they are very different movies that appeal to different audiences. If we place too much weight on romance, then the system will recommend “Pearl Harbor” to “Notting Hill” audiences.

Regularization helps the system to recognized for both its romance and comedy elements. The higher we set the regularization amount, the less weight we’ll put on any single attribute. We are using an amount of 0.1 in the code because we are using small dataset. For larger datasets, we can use 1.0, 10.0 etc. We can later experiment with different regulation values to see how it affects the quality of your recommendations.






Root-Mean-Square Error  RMSE is a measurement of the difference between the user’s real rating and the rating we predicted. The lower the more accurate the model. An RMSE of zero means our model perfectly guesses user ratings.  When measuring the accuracy of our recommendation system, we’ll randomly split our movie ratings data into two groups. The first 70% of data will be our training dataset. The other 30% data will be our testing dataset.

import numpy as np
import pandas as pd
import matrix_factorization_utilities

# Load user ratings
raw_training_dataset_df = pd.read_csv('movie_ratings_data_set_training.csv')
raw_testing_dataset_df = pd.read_csv('movie_ratings_data_set_testing.csv')
# Convert the running list of user ratings into a matrix
ratings_training_df = pd.pivot_table(raw_training_dataset_df, index='user_id', columns='movie_id', aggfunc=np.max)
ratings_testing_df = pd.pivot_table(raw_testing_dataset_df, index='user_id', columns='movie_id', aggfunc=np.max)
# Apply matrix factorization to find the latent features
U, M = matrix_factorization_utilities.low_rank_matrix_factorization(ratings_training_df.as_matrix(),
# Find all predicted ratings by multiplying U and M
predicted_ratings = np.matmul(U, M)
# Measure RMSE
rmse_training = matrix_factorization_utilities.RMSE(ratings_training_df.as_matrix(),
rmse_testing = matrix_factorization_utilities.RMSE(ratings_testing_df.as_matrix(),
print("Training RMSE: {}".format(rmse_training))
print("Testing RMSE: {}".format(rmse_testing))
Optimization terminated successfully.
         Current function value: 315.538580
         Iterations: 1062
         Function evaluations: 1594
         Gradient evaluations: 1594
Training RMSE: 0.24952555662048573
Testing RMSE: 1.2096517096071573

Out Put: we got a training RMSE of 0.24 and a testing RMSE of about 1.2. The low training RMSE shows that our basic algorithm is working, the testing RMSE is the more important number because it tells us how good our predictions are.  We can adjust the regularization amount parameter to see the difference of the RMSE. Moreover, larger the movies review dataset we have the more accurate prediction.

Use Intent representations( U x  M ) to find similar products. 

Use numpy’s transpose function to flip-flop the matrix so each column becomes a row. This just makes the data easier to work with, it doesn’t change the data itself.

# Swap the rows and columns of product_features just so it's easier to work with
M = np.transpose(M)
# Choose a movie to find similar movies to. Let's find movies similar to movie #5:
movie_id = 3
# Get movie #3's name and genre
movie_information = movies_df.loc[movie_id]
print("We are finding movies similar to this movie:")
print("Movie title: {}".format(movie_information.title))
print("Genre: {}".format(movie_information.genre))
# Get the features for movie #3 we found via matrix factorization
current_movie_features = M[movie_id - 1]
print("The attributes for this movie are:")

To find other movies similar to this one, we just have to find the other movies whose numbers are closest to this movie’s numbers.

# The main logic for finding similar movies:
# 1. Subtract the current movie's features from every other movie's features
difference = M - current_movie_features

# 2. Take the absolute value of that difference (so all numbers are positive)
absolute_difference = np.abs(difference)
# 3. Each movie has 15 features. Sum those 15 features to get a total 'difference score' for each movie
total_difference = np.sum(absolute_difference, axis=1)
# 4. Create a new column in the movie list with the difference score for each movie
movies_df['difference_score'] = total_difference

# 5. Sort the movie list by difference score, from least different to most different
sorted_movie_list = movies_df.sort_values('difference_score')
# 6. Print the result, showing the 5 most similar movies to movie_id #1
print("The five most similar movies are:")
print(sorted_movie_list[['title', 'difference_score']][0:5])

This gives us the difference in scores between the current movie and every other movie in the database and takes the absolute value of the difference we calculated. Next, pandas provide a convenient sort value function.


We are finding movies similar to this movie:
Movie title: The Sheriff 2
Genre: crime drama, western
The attributes for this movie are:
[ 0.54346996 -0.9705157  -1.02934284  0.23430794 -0.92440832 -1.68229039
 -0.72160543  0.35075559  0.43500803  0.47638858 -0.00959097 -0.2267184
 -0.28894699 -0.89375743  0.87709183]
The five most similar movies are:
                         title  difference_score
3                The Sheriff 2          0.000000
9                  Biker Gangs          2.100256
1                The Sheriff 1          2.689085
5         The Big City Judge 2          2.695672
28               The Sheriff 4          2.719307

Process finished with exit code 0

Finally, we can print out the first five movies on the list. The first movie in this list is the movie itself. That’s because a movie is most similar to itself. The other four movies look pretty similar to our movie. We can use them to find similar products and recommend them to the audience.   



Business Insider: Netflix lifted the lid on how the algorithm that recommends you titles to watch actually works

Linkedin Learning: Machine Learning & AI Foundations: Recommendations

Python Tutorial: Pandas Pivot Table Explained

Matrix Factorization: A Simple Tutorial and Implementation in Python

Movie popularity classification based on inherent movie attributes using C4.5, PART and correlation coefficient