Mastering E-commerce Product Recommendations in Python

Using RFM Scores and TF-IDF Scores for Product Recommendations

Published in

DataFabrica

7 min readDec 22, 2023

Product recommendation is the task of suggesting products to customers based on past purchases, demographic information, social media engagement and more. For example, if customer purchases an iPad on Amazon, a reasonable product recommendations could be other Apple products such as AirPods or a Macbook. If a customer fits a specific age demographic, product recommendations can be made to suit that customer profile. For example, older customers may be interested in home appliances while younger customers may be more interested in trendy clothing and accessories.

E-commerce companies, like Amazon, are interested generating tailored product recommendations for many reasons. It can help companies retain customers, increase revenue, improve marketing startegies and more. For example, if a company knows that a specific customer hasn’t made a purchase in a while, the marketing team can leverage the customer’s past purchases to create tailored product recommendations to encourage future purchases.

A common approach to product recommendations is using a technique called collaborative filtering. Collaborative filtering can be done using an item-based approach or a customer-based approach. An item-based approach is where new products are suggested to a customer based on their similarity to other products. A customer based approach is when a new product is suggested to a customer based on their similarity to other customers.

Here we will see how to generate customized product recommendations using an E-commerce transaction dataset. We will see how to use customer segmentation and collaborative filtering to construct personalized product recommendations, which can be used for custom marketing.

For our purposes we will be working with the Synthetic E-commerce Electronics Sales Data available on DataFabrica. The data contains synthetic e-commerce sales data for different electronic products such as phones, tablets and laptops.

Getting Started with Electronic Products E-Commerce Sales Data

To start let’s import the pandas library, read our data into a pandas dataframe and display the first five rows of data :

import pandas as pd 
df = pd.read_csv("ecommerce_data_electronics.csv")
print(df.head())

We see that the data contains prices, product names, product descriptions, product IDs, category, time_stamp and card holder names.

To generate product recommendations, we will use a combination of RFM score, TF-IDF scores and cosine similarity calculations. Specifically, we will use a combination of both item and customer based collaborative filtering to generate our product recommendations.

First, we will use RFM scores to generate customer segments, following the steps in a previous article, Mastering Customer Segementation Using Credit Card Transaction Data.

RFM stands for recency, frequency and monetary. They have the following meanings:

Recency, Frequency and Monetary

Recency is the number of days between the last purchase made by a customer and a reference date, usually the current or max date available in the data.
Frequency is the number of purchases between the date of their last purchase and the current or max date available in the data.
Monetary is the total amount of money spent between the date of their first and last purchases.

RFM scores are strings that are concatenated quartile values in R, F and M.

Let’s start our analysis on our electronics data. We can use the pandas aggfunction to generate each of these fields:

NOW = df['time_stamp'].max()
rfmTable = df.groupby('card_name').agg({'time_stamp': lambda x: (NOW - x.max()).days, 'product_id': lambda x: len(x), 'price': lambda x: x.sum()})
rfmTable['time_stamp'] = rfmTable['time_stamp'].astype(int)
rfmTable.rename(columns={'time_stamp': 'recency', 
                         'product_id': 'frequency',
                         'price': 'monetary_value'}, inplace=True)

We can display the resulting table:

print(rfmTable.head())

Next we we calculate the quartiles in recency, frequency and monetary values:

rfmTable['r_quartile'] = pd.qcut(rfmTable['recency'], q=4, labels=range(1,5), duplicates='raise')
rfmTable['f_quartile'] = pd.qcut(rfmTable['frequency'], q=4, labels=range(1,5), duplicates='drop')
rfmTable['m_quartile'] = pd.qcut(rfmTable['monetary_value'], q=4, labels=range(1,5), duplicates='drop')
print(rfmTable.head())

We will then convert each of these columns to string datatypes:

rfm_data['r_quartile'] = rfm_data['r_quartile'].astype(str)
rfm_data['f_quartile'] = rfm_data['f_quartile'].astype(str)
rfm_data['m_quartile'] = rfm_data['m_quartile'].astype(str)

Which will then allow us to concatenate the quartile values and construct our RFM scores:

rfm_data['RFM_score'] = rfm_data['r_quartile'] + rfm_data['f_quartile'] + rfm_data['m_quartile']

Next, let’s display our RFM table:

print(rfm_data.head())

Using this RFM score we can generate customer segments using the following mappings:

Premium Customer: r, f, and m all ≥ 3
Repeat Customer: f >= 3 and r or m >=3
Top Spender: m >= 3 and f or r >=3
At-Risk Customer: two or more of r,f and m <=2
Inactive Customer: two or more = 1
Other: anything else

rfm_data['customer_segment'] = 'Other'

rfm_data.loc[rfm_data['RFM_score'].isin(['334', '443', '444', '344', '434', '433', '343', '333']), 'customer_segment'] = 'Premium Customer' #nothing <= 2
rfm_data.loc[rfm_data['RFM_score'].isin(['244', '234', '232', '332', '143', '233', '243']), 'customer_segment'] = 'Repeat Customer' # f >= 3 & r or m >=3
rfm_data.loc[rfm_data['RFM_score'].isin(['424', '414', '144', '314', '324', '124', '224', '423', '413', '133', '323', '313', '134']), 'customer_segment'] = 'Top Spender' # m >= 3 & f or m >=3
rfm_data.loc[rfm_data['RFM_score'].isin([ '422', '223', '212', '122', '222', '132', '322', '312', '412', '123', '214']), 'customer_segment'] = 'At Risk Customer' # two or more  <=2
rfm_data.loc[rfm_data['RFM_score'].isin(['411','111', '113', '114', '112', '211', '311']), 'customer_segment'] = 'Inactive Customer' # two or more  =1

Let’s display the count in customer segments:

from collections import Counter 
print(Counter(rfm_data['customer_segment']))

Screenshot taken by Author

We can use these customer segments to generate customer-based collaborative filters. For example, we can select a “Premium Customer” and recommend new products to this customers based on purchases made by other Premium Customers. Next we will using TF-IDF and cosine similarity calculations (using product descriptions) to add item-based collaborative filtering to our recommender.

Using TF-IDF and Cosine Similarity to Generate Collaborative Filters

We will proceed by defining a function called generate_recommendations which will take target_customer, cohort, and num_recommendations as inputs:

def generate_recommendations(target_customer, cohort, num_recommendations=5):
    #code truncated for clarity

Next we will perform a groupby operation to combine products for each unique customer. This will allow us to construct a user-item matrix that maps each customer to their past purchases. We will do this for both the product and product description:

def generate_recommendations(target_customer, cohort, num_recommendations=5):
    #code truncated for clarity
    user_item_matrix = cohort.groupby('card_name')['product'].apply(lambda x: ', '.join(x)).reset_index()
    user_item_matrix['product_descriptions'] = cohort.groupby('card_name')['product_description'].apply(lambda x: ', '.join(x)).reset_index()['product_description']

Next we will generate the TF-IDF matrix. TF-IDF stands for term frequency inverse document frequency. The “TF” in TF-IDF represents measures how frequently a word appears in a document, in this case a set of product descriptions:

TF = (number of time keywor, K, appears in product description set (d)/ 
      total number of words in product description set)

The “IDF” in TF-IDF stands for inverse document frequency. This measures the importance of a word across a collection of product description sets:

IDF = log(total number of product description sets in corpus(D)/
          number of product description sets (d) containing keywork, K + 1) + 1

The TF-IDF score is generated by multiplying the TF and IDF metrics:

tfidf_scores = TF*IDF

As a result each product description set will be represented by a non-zero vector. From here, we can calculate how “similar” products are to each other using cosine similarity calculations on the TF-IDF representations of the product descriptions. Cosine similarity measures the angle between two non-zero vectors. Similar products will have a high cosine similarity and dissimilar products will have a low cosine similarity:

def generate_recommendations(target_customer, cohort, num_recommendations=5):
    #code truncated for clarity 
    tfidf = TfidfVectorizer()
    tfidf_matrix = tfidf.fit_transform(user_item_matrix['product_descriptions'])
    similarity_matrix = cosine_similarity(tfidf_matrix)

Next we will filter our user_item_matrix to the input target customer, select their customer segment, and select all other customers within that customer segment. The similar customers will be determined by sorting the cosine similarities calculated using TF-IDF scores. We will also define a set containing all of the products purchased by the target customer:

def generate_recommendations(target_customer, cohort, num_recommendations=5):
    #code truncated for clarity
    target_customer_index = user_item_matrix[user_item_matrix['card_name'] == target_customer].index[0]
    similar_customers = similarity_matrix[target_customer_index].argsort()[::-1][1:num_recommendations+1]
    target_customer_purchases = set(user_item_matrix[user_item_matrix['card_name'] == target_customer]['product'].iloc[0].split(', '))

Next we will initialize a list called recommendations, iterate over our list of similar customers, get the products purchased by similar customers, and return the set of new products to recommend to the target customer:

def generate_recommendations(target_customer, cohort, num_recommendations=5):
    recommendations = []
    for customer_index in similar_customers:
        customer_purchases = set(user_item_matrix.iloc[customer_index]['product'].split(', '))
        new_items = customer_purchases.difference(target_customer_purchases)
        recommendations.extend(new_items)
    return list(set(recommendations))[:num_recommendations]

Next we will filter our RFM data to only include ‘Premium Customers’:

rfm_data = rfm_data[rfm_data['customer_segment']== 'Premium Customer']

We will then select the list of 5 ‘Premium Customers’ and filter our electronics data to include these customers:

premium = list(set(rfm_data['card_name']))[:5]
df_premium = df[df['card_name'].isin(premium)]

Finally, we will generate recommendations for a target customer. Let’s generate product recommendations for Ashley Perry:

recommendations = generate_recommendations("Ashley Perry", df_premium, num_recommendations=5)
print(recommendations)

Screenshot taken by Author

Let’s generate recommendations for Clifford Stanley:

print("Recommendations for Aaron Berg")
print(recommendations)

The code in this post is available on GitHub.

Conclusions

In this post we walked through how to generate customer-based collaborative filters using RFM customer segements, TF-IDF scores and cosine simlarity. The RFM score calculations are a useful way to generate distinct customer segments based on customer purchasing patterns. Further, customer segments generated by RFM score can be leveraged to generate cohort specific TF-IDF representations of product descriptions. The vector representations of product descriptions within a customer segment can then be used to calculate cosine similarity, which allowed us to generate product recommendations tailored to a given customer’s purchasing preference.

The data used in this blog post is available on DataFabrica and can be found here.