Content-Based Recommendation System Implementation

Arushi Khokhar
ACM JUIT
Published in
7 min readJun 14, 2020

--

A content-based recommendation system revolves around a user’s profiles. It is based on the user’s ratings including the number of times a user has clicked on different items or even liked those items. The recommendations are based on the similarity between those items.

Implementing a content-based movie recommendation system

Now that we have a very basic idea of what a content-based recommendation system is, let’s get to the coding stuff! We’ll be making a movie recommendation system which is based on ratings given to a movie by different users.

Start by importing the necessary libraries:

Recommendation systems require a very large dataset. For this model, we’ll be using a dataset of 100k movies.

Load the dataset. The columns of this data frame do not have names. So, we’ll give a name to every column.

This is what our data looks like. People are given user ids and movies are given item ids. Every person has rated one or more than one movie.

This data frame, however, does not tell which item id corresponds to which movie name. To get the names of movies, we’ll load another dataset (both these datasets are available in the file downloaded earlier).

Give column names to the columns of this data frame also.

Now we’ll merge both these data frames.

As already mentioned, one user has rated one or more than one movie. This means that one movie has been rated by more than one user. We’ll have to find the average rating of each movie to get some meaningful data.

Movies with a rating of exactly 5 or 1 are likely to have been reviewed by only 1 or 2 people because when more number of people rate a movie, the average cannot be a perfect 5 or 1. Any movie which has been rated by only a few people cannot be recommended to a user. Let’s slowly get rid of these movies which have been recommended by very few users. Start with finding out the number of ratings a particular movie has received.

Now that we have both the attributes (average rating and number of ratings) for every movie, let’s create a separate data frame for them.

Great! Now we have the final data frame that we’ll be using to make our prediction model. But before that, it’s time for some data visualization. Enter matplotlib and seaborn!

Let’s plot a histogram of number of ratings to check the distribution.

Looks neat! Along the x-axis, we have the number of times movies have been rated. Along the y-axis, we have how many movies have been rated those number of times. For example, the number of movies that have been rated by 0–10 people are more than 500

Now let’s plot another histogram. This time it id going to be a histogram of average ratings.

Along the x-axis, we have the average movie ratings and along the y-axis, we have the number of movies. This is kind of a normal distribution.

And now a final plot!

The above plot shows that as the rating increases, the number of ratings of that movie also increases. In addition to it, the figure clearly depicts the movies which have been rated by very few people by isolated dots (ends of the x-axis).

Recommendation system with reference to a particular movie

We’ll implement our recommendation model for a single movie at first. Let’s pick Star Wars (1977).

First, make a matrix to shows which user has given what rating to which movie. There are a lot of NaN values because all the users have not watched all the movies.

From this matrix let’s extract Star War’s information

Now we will find out that what ratings have been given to other movies by the users who have rated Star Wars. To understand it better do read about the core( ) function.

There are NaN values corresponding to the movies which have not been rated along with star wars or vice versa since correlation does not exist.

Next, we’ll make a separate data frame for movies and their correlation with the movie Star Wars.

The basic idea behind this is that if a person likes Star Wars, he/she can be recommended the movie with the highest correlation.

We had NaN values in our data frame. We’ll remove those using dropna( ) function.

Great! Now let’s take a little look at our data.

If 6 people have rated the movie Hollow reed as 5 and the same people out of 583 people have rated Star Wars as 5, then both of these movies have a correlation of 1. But this data is not meaningful as very few people have rated Hollow Reed. To overcome this, we will put a threshold at 100 people so that only the movies that have been watched by more than 100 people are recommended. Here, the movie with the title ’Til There was You (1997) cannot be recommended as it has been rated by only 9 people.

Now, this data looks pretty good as only movies that have been rated by more than 100 people will be recommended.

RECOMMENDATION FUNCTION

Let’s make a general recommendation function now which will give us 5 movie recommendations based on the movie we enter.

Here’s what the output looks like:

CONCLUSION

This was a content-based recommendation system that we made using very basic python libraries. The ones used in real life are way more complicated. This was just to give you some idea about recommendation systems.

--

--