My Projects

RFM Analysis for Effective Segmentation Using Python

Introduction

In this article, we explore the powerful technique of RFM analysis for customer segmentation using Python programming. RFM analysis can provide valuable insights into customer behavior, allowing businesses to segment their customer base and develop targeted retention strategies. We cover the key steps involved in conducting RFM analysis, including data preparation, RFM score calculation, and segmentation. We also demonstrate how to visualize the results using Python libraries such as Pandas, Matplotlib, and Seaborn. By the end of this article, you will have a solid understanding of how to use RFM analysis to gain valuable customer insights and develop effective segmentation strategies.

In this article, we will use Python to perform RFM analysis on a dataset of online retail transactions. The dataset can be found at the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx

First, we will import the necessary libraries and load the dataset:

#import modules

import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import datetime as dt

#Read and Show data

df = pd.read_excel(“Online Retail.xlsx”)
data.head()

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom

# checking data information
data.info()
# checking missing data
df=data.isna().sum()
df

# check shape
data.shape

(406829, 8)

 

Next, we will clean and preprocess the data. We will remove rows with no CustomerID

and duplicate data.

 

# remove data with no CustomerID
data = data[pd.notnull(data[‘CustomerID’])]
# drop duplicate
filtered_data = data[[‘Country’,’CustomerID’]].drop_duplicates()

# check shape
data.shape

(406829, 8)

Next we will filter top ten country from where customer purchase.

#Top ten country’s customer
filtered_data.Country.value_counts()[:10].plot(kind=’bar’)

uk_data=data[data.Country==’United Kingdom’]
uk_data.info()
uk_data.describe()

Quantity UnitPrice CustomerID
count 361878.000000 361878.000000 361878.000000
mean 11.077029 3.256007 15547.871368
std 263.129266 70.654731 1594.402590
min -80995.000000 0.000000 12346.000000
25% 2.000000 1.250000 14194.000000
50% 4.000000 1.950000 15514.000000
75% 12.000000 3.750000 16931.000000
max 80995.000000 38970.000000 18287.000000

uk_data = uk_data[(uk_data[‘Quantity’]>0)]
uk_data.info()
uk_data = uk_data[(uk_data[‘UnitPrice’]>0)]
uk_data.info()
uk_data.describe()

Quantity UnitPrice CustomerID
count 354321.000000 354321.000000 354321.000000
mean 12.013795 2.963994 15552.486392
std 189.267956 17.862655 1594.527150
min 1.000000 0.001000 12346.000000
25% 2.000000 1.250000 14194.000000
50% 4.000000 1.950000 15522.000000
75% 12.000000 3.750000 16931.000000
max 80995.000000 8142.750000 18287.000000

 

 

uk_data=uk_data[[‘CustomerID’,’InvoiceDate’,’InvoiceNo’,’Quantity’,’UnitPrice’]] uk_data

CustomerID InvoiceDate InvoiceNo Quantity UnitPrice
0 17850.0 2010-12-01 08:26:00 536365 6 2.55
1 17850.0 2010-12-01 08:26:00 536365 6 3.39
2 17850.0 2010-12-01 08:26:00 536365 8 2.75
3 17850.0 2010-12-01 08:26:00 536365 6 3.39
4 17850.0 2010-12-01 08:26:00 536365 6 3.39
541889 15804.0 2011-12-09 12:31:00 581585 12 1.95
541890 13113.0 2011-12-09 12:49:00 581586 8 2.95
541891 13113.0 2011-12-09 12:49:00 581586 24 1.25
541892 13113.0 2011-12-09 12:49:00 581586 24 8.95
541893 13113.0 2011-12-09 12:49:00 581586 10 7.08

354321 rows × 5 columns

uk_data[‘TotalPrice’] = uk_data[‘Quantity’] * uk_data[‘UnitPrice’] uk_data

CustomerID InvoiceDate InvoiceNo Quantity UnitPrice TotalPrice
0 17850.0 2010-12-01 08:26:00 536365 6 2.55 15.30
1 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34
2 17850.0 2010-12-01 08:26:00 536365 8 2.75 22.00
3 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34
4 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34
541889 15804.0 2011-12-09 12:31:00 581585 12 1.95 23.40
541890 13113.0 2011-12-09 12:49:00 581586 8 2.95 23.60
541891 13113.0 2011-12-09 12:49:00 581586 24 1.25 30.00
541892 13113.0 2011-12-09 12:49:00 581586 24 8.95 214.80
541893 13113.0 2011-12-09 12:49:00 581586 10 7.08 70.80

354321 rows × 6 columns

uk_data[‘InvoiceDate’].min(),uk_data[‘InvoiceDate’].max()

(Timestamp('2010-12-01 08:26:00'), Timestamp('2011-12-09 12:49:00'))

PRESENT = dt.datetime(2011,12,10)
uk_data[‘InvoiceDate’] = pd.to_datetime(uk_data[‘InvoiceDate’])
uk_data

CustomerID InvoiceDate InvoiceNo Quantity UnitPrice TotalPrice
0 17850.0 2010-12-01 08:26:00 536365 6 2.55 15.30
1 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34
2 17850.0 2010-12-01 08:26:00 536365 8 2.75 22.00
3 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34
4 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34
541889 15804.0 2011-12-09 12:31:00 581585 12 1.95 23.40
541890 13113.0 2011-12-09 12:49:00 581586 8 2.95 23.60
541891 13113.0 2011-12-09 12:49:00 581586 24 1.25 30.00
541892 13113.0 2011-12-09 12:49:00 581586 24 8.95 214.80
541893 13113.0 2011-12-09 12:49:00 581586 10 7.08 70.80

354321 rows × 6 columns

RFM Analysis

Here, you are going to perform following opertaions:

  • For Recency, Calculate the number of days between present date and date of last purchase each customer.
  • For Frequency, Calculate the number of orders for each customer.
  • For Monetary, Calculate sum of purchase price for each customer.

# RFM calculation
rfm= uk_data.groupby(‘CustomerID’).agg({‘InvoiceDate’: lambda date: (PRESENT – date.max()).days,’InvoiceNo’: ‘count’,’TotalPrice’: lambda price: price.sum()})
rfm

 

CustomerID InvoiceDate InvoiceNo TotalPrice
12346.0 325 1 77183.60
12747.0 2 103 4196.01
12748.0 0 4595 33719.73
12749.0 3 199 4090.88
12820.0 3 59 942.34
18280.0 277 10 180.60
18281.0 180 7 80.82
18282.0 7 12 178.05
18283.0 3 756 2094.88
18287.0 42 70 1837.28

3920 rows × 3 columns

# ‘InvoiceDate’, ‘TotalPrice’, ‘InvoiceNo’
# Change the name of columns
rfm.rename(columns={‘InvoiceDate’:’recency’,’InvoiceNo’:’frequency’,’TotalPrice’:’monetary’})
rfm

CustomerID recency frequency monetary
12346.0 325 1 77183.60
12747.0 2 103 4196.01
12748.0 0 4595 33719.73
12749.0 3 199 4090.88
12820.0 3 59 942.34
18280.0 277 10 180.60
18281.0 180 7 80.82
18282.0 7 12 178.05
18283.0 3 756 2094.88
18287.0 42 70 1837.28

rfm[‘recency’] = rfm[‘recency’].astype(int)
rfm[‘frequency’] = rfm[‘frequency’].astype(int)
rfm[‘monetary’] = rfm[‘monetary’].astype(int)
rfm

3920 rows × 3 columns

CustomerID recency frequency monetary
12346.0 325 1 77183
12747.0 2 103 4196
12748.0 0 4595 33719
12749.0 3 199 4090
12820.0 3 59 942
18280.0 277 10 180
18281.0 180 7 80
18282.0 7 12 178
18283.0 3 756 2094
18287.0 42 70 1837

3920 rows × 3 columns

Computing Quantile of RFM values

Customers with the lowest recency, highest frequency and monetary amounts considered as top customers.

We will do it using Data binning .

rfm[‘r_quartile’] = pd.qcut(rfm[‘recency’], 4, labels=[‘1′,’2′,’3′,’4’])
rfm[‘f_quartile’] = pd.qcut(rfm[‘frequency’], 4, labels=[‘4′,’3′,’2′,’1’])
rfm[‘m_quartile’] = pd.qcut(rfm[‘monetary’], 4, labels=[‘4′,’3′,’2′,’1’])
rfm.head()

CustomerID recency frequency monetary r_quartile f_quartile m_quartile
12346.0 325 1 77183 4 4 1
12747.0 2 103 4196 1 1 1
12748.0 0 4595 33719 1 1 1
12749.0 3 199 4090 1 1 1
12820.0 3 59 942 1 2 2

rfm[‘RFM_Score’] = rfm.r_quartile.astype(str)+ rfm.f_quartile.astype(str) + rfm.m_quartile.astype(str) rfm.head()

CustomerID recency frequency monetary r_quartile f_quartile m_quartile RFM_Score
12346.0 325 1 77183 4 4 1 441
12747.0 2 103 4196 1 1 1 111
12748.0 0 4595 33719 1 1 1 111
12749.0 3 199 4090 1 1 1 111
12820.0 3 59 942 1 2 2 122

# RFM Score
rfm[‘RFM_Score’].value_counts()

111    409
444    345
211    186
433    178
344    169
      ... 
241      7
141      5
431      4
413      4
114      1
Name: RFM_Score, Length: 61, dtype: int64

rfm[‘RFM_Score_num’] = rfm.r_quartile.astype(int)+ rfm.f_quartile.astype(int) + rfm.m_quartile.astype(int)
rfm.head()

CustomerID recency frequency monetary r_quartile f_quartile m_quartile RFM_Score
12346.0 325 1 77183 4 4 1 441 9
12747.0 2 103 4196 1 1 1 111 3
12748.0 0 4595 33719 1 1 1 111 3
12749.0 3 199 4090 1 1 1 111 3
12820.0 3 59 942 1 2 2 122 5

# Creating custom segments

# Define rfm_level function
def rfm_level(df):
    if df[‘RFM_Score_num’] >= 10:
        return ‘Low’
    elif ((df[‘RFM_Score_num’] >= 6) and (df[‘RFM_Score_num’] < 10)):
        return ‘Middle’
    else:
        return ‘Top’

# Create a new variable RFM_Level
rfm[‘RFM_Level’] = rfm.apply(rfm_level, axis=1)

# Print the header with top 5 rows to the console
rfm.head()

CustomerID recency frequency monetary r_quartile f_quartile m_quartile RFM_Score RFM_Score_num RFM_Level
12346.0 325 1 77183 4 4 1 441 9 Middle
12747.0 2 103 4196 1 1 1 111 3 Top
12748.0 0 4595 33719 1 1 1 111 3 Top
12749.0 3 199 4090 1 1 1 111 3 Top
12820.0 3 59 942 1 2 2 122 5 Top

RFM_Level recency frequency monetary
Low 190.045918 14.948129 258.973639
Middle 71.130299 49.714464 1059.506234
Top 19.335088 225.438596 4651.303509

rfm.RFM_Level.value_counts()

Middle 1604

Low 1176

Top 1140

Name: RFM_Level, dtype: int64

Segmentation using k-means clustering (unsupervised)

K-means assumptions:

Symmetric distribution of features/variables
Variables with same average values
Variables with same variance

# Plot distribution
plt.figure(figsize=(7, 10))
plt.subplot(3, 1, 1)
sns.histplot(rfm[“recency”], kde=True, bins=20)

plt.subplot(3, 1, 2)
sns.histplot(rfm[“frequency”], kde=True, bins=20)

# Plot distribution of var3
plt.subplot(3, 1, 3)
sns.histplot(rfm[“monetary”],kde=True, bins=20)

# Show the plot
plt.show()

rfm.columns

Index([‘recency’, ‘frequency’, ‘monetary’, ‘r_quartile’, ‘f_quartile’, ‘m_quartile’, ‘RFM_Score’, ‘RFM_Score_num’, ‘RFM_Level’], dtype=’object’)

# Print the average values of the variables in the dataset
print(‘mean : \n’, rfm[[‘recency’, ‘frequency’, ‘monetary’]].mean())

mean :

recency 91.742092

frequency 90.388010

monetary 1863.899745

dtype: float64

# Print the standard deviation of the variables in the dataset
print(‘std: \n’, rfm[[‘recency’, ‘frequency’, ‘monetary’]].std())

std:

recency 99.533485

frequency 217.808385

monetary 7482.810958

dtype: float64

Apply transformer to remove skewness

Apply Standard Scaler for same mean and variance

rfm.head()

CustomerID recency frequency monetary r_quartile f_quartile m_quartile RFM_Score RFM_Score_num RFM_Level
12346.0 325 1 77183 4 4 1 441 9 Middle
12747.0 2 103 4196 1 1 1 111 3 Top
12748.0 0 4595 33719 1 1 1 111 3 Top
12749.0 3 199 4090 1 1 1 111 3 Top
12820.0 3 59 942 1 2 2 122 5 Top

# apply yeo-johnson transformers
from scipy.stats import yeojohnson

df = pd.DataFrame()
df[“CustomerID”] = rfm.index

for col in [‘recency’, ‘frequency’, ‘monetary’]:
    y, lmbda = yeojohnson(rfm[col])
    df[col] = y

df


CustomerID
recency frequency monetary
0 12346.0 9.481158 0.694879 7.350116
1 12747.0 1.200124 4.722890 6.042678
2 12748.0 0.000000 8.694124 7.009467
3 12749.0 1.550490 5.400639 6.029753
4 12820.0 1.550490 4.155270 5.241050
3915 18280.0 9.087391 2.418707 4.232865
3916 18281.0 8.074691 2.095081 3.690085
3917 18282.0 2.463324 2.588772 4.225605
3918 18283.0 1.550490 6.790070 5.681812
3919 18287.0 5.143613 4.328745 5.611509

3920 rows × 4 columns

# Plot distribution
plt.figure(figsize=(7, 10))
plt.subplot(3, 1, 1)
sns.histplot(df[“recency”], kde=True, bins=10)

plt.subplot(3, 1, 2)
sns.histplot(df[“frequency”], kde=True, bins=10)

# Plot distribution of var3
plt.subplot(3, 1, 3)
sns.histplot(df[“monetary”],kde=True, bins=10)

# Show the plot
plt.show()

# solving same mean/avg and variance issue
import sklearn.preprocessing as preproc

features = [‘recency’, ‘frequency’, ‘monetary’]
# Standardization – note that by definition, some outputs will be negative
df[features] = preproc.StandardScaler().fit_transform(df[features])
df

CustomerID recency frequency monetary
0 12346.0 1.623997 -2.382059 3.217638
1 12747.0 -1.738788 0.730526 1.405623
2 12748.0 -2.226138 3.799237 2.745524
3 12749.0 -1.596511 1.254246 1.387711
4 12820.0 -1.596511 0.291906 0.294625
3915 18280.0 1.464095 -1.049997 -1.102648
3916 18281.0 1.052855 -1.300074 -1.854901
3917 18282.0 -1.225824 -0.918582 -1.112709
3918 18283.0 -1.596511 2.327908 0.905489
3919 18287.0 -0.137405 0.425956 0.808054

3920 rows × 4 columns

# Print the average values of the variables in the dataset
print(‘mean : \n’, df[[‘recency’, ‘frequency’, ‘monetary’]].mean().astype(int))

# Print the standard deviation of the variables in the dataset
print(‘std: \n’, df[[‘recency’, ‘frequency’, ‘monetary’]].std().astype(int))

mean :

recency 0

frequency 0

monetary 0

dtype: int64

std:

recency 1

frequency 1

monetary 1

dtype: int64

Now our features are ready. We will apply k-mean clustering. Apply k-means now

# Import KMeans
from sklearn.cluster import KMeans
# assume k/cluster number/group number = 3
# Initialize KMeans
kmeans = KMeans(n_clusters=3, random_state=1)

# Fit k-means clustering on the normalized data set
kmeans.fit(df[features])

# Extract cluster labels
rfm[‘cluster_labels’] = kmeans.labels_
df[‘cluster_labels’] = kmeans.labels_

/usr/local/lib/python3.8/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to ‘auto’ in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(

rfm

CustomerID recency frequency monetary r_quartile f_quartile m_quartile RFM_Score RFM_Score_num RFM_Level cluster_labels
12346.0 325 1 77183 4 4 1 441 9 Middle 1
12747.0 2 103 4196 1 1 1 111 3 Top 0
12748.0 0 4595 33719 1 1 1 111 3 Top 0
12749.0 3 199 4090 1 1 1 111 3 Top 0
12820.0 3 59 942 1 2 2 122 5 Top 0
18280.0 277 10 180 4 4 4 444 12 Low 2
18281.0 180 7 80 4 4 4 444 12 Low 2
18282.0 7 12 178 1 4 4 144 9 Middle 1
18283.0 3 756 2094 1 1 1 111 3 Top 0
18287.0 42 70 1837 2 2 1 221 5 Top 1

3920 rows × 10 columns

df.cluster_labels.value_counts()

1 1624

2 1218

0 1078

Name: cluster_labels, dtype: int64

rfm.groupby(‘RFM_Level’).agg({
    ‘recency’: ‘mean’,
    ‘frequency’: ‘mean’,
    ‘monetary’: ‘mean’})

recency frequency monetary
Low 190.045918 14.948129 258.973639
Middle 71.130299 49.714464 1059.506234
Top 19.335088 225.438596 4651.303509

rfm.groupby(‘cluster_labels’).agg({
    ‘recency’: ‘mean’,
    ‘frequency’: ‘mean’,
    ‘monetary’: ‘mean’})

cluster_labels recency frequency monetary
0 19.401670 234.763451 5072.400742
1 65.764163 51.161330 929.342365
2 190.404762 14.909688 270.268473

Find best k value

## Calculate sum of squared errors
sse = {}
# Fit KMeans and calculate SSE for each k
for k in range(1, 21):
    # Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=1)
    # Fit KMeans on the normalized dataset
    kmeans.fit(df[features])
    # Assign sum of squared distances to k element of dictionary
    sse[k] = kmeans.inertia_

sse

 

{1: 11760.000000000013,
 2: 6080.567689041685,
 3: 4709.937307327478,
 4: 3870.116603724802,
 5: 3319.490869277227,
 6: 2916.5075068200385,
 7: 2655.5748530820592,
 8: 2459.2276248472426,
 9: 2299.9953369555833,
 10: 2154.4760094876638,
 11: 2016.785702586328,
 12: 1896.2115029782149,
 13: 1814.3350578718046,
 14: 1733.105393862509,
 15: 1664.5963048437225,
 16: 1597.9977424250183,
 17: 1531.5951097080115,
 18: 1489.225635962716,
 19: 1436.2061701139564,
 20: 1391.496724909009}

## Plot sum of squared errors
plt.figure(1 , figsize = (6, 7))
# Add the plot title “The Elbow Method”
plt.title(‘The Elbow Method’)
# Add X-axis label “k”
plt.xlabel(‘k’)
# Add Y-axis label “SSE”
plt.ylabel(‘sse’)

# Plot SSE values for each key in the dictionary
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
points = pca.fit_transform(df[features])

df[‘PC_1’] = points[:,0]
df[‘PC_2’] = points[:,1]

df

CustomerID recency frequency monetary cluster_labels PC_1 PC_2
0 12346.0 1.623997 -2.382059 3.217638 1 0.327716 -1.753851
1 12747.0 -1.738788 0.730526 1.405623 0 -2.186094 0.716165
2 12748.0 -2.226138 3.799237 2.745524 0 -5.116733 -0.440139
3 12749.0 -1.596511 1.254246 1.387711 0 -2.421441 0.416584
4 12820.0 -1.596511 0.291906 0.294625 0 -1.172168 1.160378
3915 18280.0 1.464095 -1.049997 -1.102648 2 2.056443 -0.480331
3916 18281.0 1.052855 -1.300074 -1.854901 2 2.455089 0.240255
3917 18282.0 -1.225824 -0.918582 -1.112709 1 0.608274 1.789366
3918 18283.0 -1.596511 2.327908 0.905489 0 -2.782606 0.218528
3919 18287.0 -0.137405 0.425956 0.808054 1 -0.819899 -0.331485

3920 rows × 7 columns

plt.figure(1 , figsize = (4, 4))
sns.scatterplot(x=’PC_1′, y=’PC_2′, data=df)
plt.show()

Apply k-means clustering

# Initialize KMeans
# k=4 from line chart /elbow method
kmeans = KMeans(n_clusters=4, random_state=1)

# Fit k-means clustering on the normalized data set
kmeans.fit(df[features])
# Extract cluster labels
rfm[‘cluster_labels’] = kmeans.labels_
df[‘cluster_labels’] = kmeans.labels_

/usr/local/lib/python3.8/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to ‘auto’ in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(

rfm.cluster_labels.value_counts()

0 1086

1 1021

2 928

3 885

Name: cluster_labels, dtype: int64

    rfm.groupby(‘cluster_labels’).agg({
    ‘recency’: ‘mean’,
    ‘frequency’: ‘mean’,
    ‘monetary’: ‘mean’})

cluster_labels recency frequency monetary
0 91.289134 76.705341 1426.324125
1 216.774731 15.449559 280.197845
2 12.912716 249.543103 5484.925647
3 30.710734 26.744633 430.967232

plt.figure(1 , figsize = (8, 6))
sns.scatterplot(x=’PC_1′, y=’PC_2′, hue=’cluster_labels’, data=df, palette=”Set1″)

<AxesSubplot:xlabel='PC_1', ylabel='PC_2'>

df.head()

CustomerID recency frequency monetary cluster_labels PC_1 PC_2
0 12346.0 1.623997 -2.382059 3.217638 0 0.327716 -1.753851
1 12747.0 -1.738788 0.730526 1.405623 2 -2.186094 0.716165
2 12748.0 -2.226138 3.799237 2.745524 2 -5.116733 -0.440139
3 12749.0 -1.596511 1.254246 1.387711 2 -2.421441 0.416584
4 12820.0 -1.596511 0.291906 0.294625 2 -1.172168 1.160378

# Melt the normalized dataset and reset the index
# Assign CustomerID and Cluster as ID variables
# Assign RFM values as value variables
# Name the variable and value
df_melt = pd.melt(df, id_vars=[‘CustomerID’, ‘cluster_labels’],
                  value_vars=[‘recency’, ‘frequency’, ‘monetary’],
                  var_name=’Metric’, value_name=’Value’)
df_melt.sample(5)

CustomerID cluster_labels Metric Value
9676 15402.0 0 monetary -0.033478
7558 17897.0 0 frequency 0.810887
2836 16771.0 2 recency -0.307374
4230 13248.0 1 frequency -0.341760
1517 14973.0 0 recency 0.593712

## Visualize snake plot

# Add the plot title
plt.title(‘Snake plot of normalized variables’)
# Add the x axis label
plt.xlabel(‘Metric’)
# Add they axis label
plt.ylabel(‘Value’)

# Plot a line for each value of the cluster variable
sns.lineplot(data=df_melt, x=’Metric’, y=’Value’, hue=’cluster_labels’, palette=’Set1′)
plt.show()

Customer Cluster

Based on the RFM analysis that we conducted using the Online Retail dataset, we can identify several customer segments and develop retention policies for each of them. Here are some examples:

High-value (111): These are customers who have a high monetary value, high purchase frequency, and made a purchase recently. To retain these customers, businesses can offer them loyalty programs, special discounts, and personalized communication. Since these customers are already loyal and have a high spending potential, businesses can focus on building long-term relationships with them to ensure their continued loyalty.

New customers(144): These are customers who have made their first purchase recently. To retain these customers, businesses can offer them welcome discounts, personalized recommendations, and a seamless buying experience. Since these customers are still testing the waters and have not yet formed any brand loyalty, businesses need to focus on building a positive first impression to encourage repeat purchases.

At-risk customers(441): These are customers who have not made a purchase in a long time but have a high historical value. To retain these customers, businesses can offer them win-back campaigns, personalized offers, and targeted communication. Since these customers have not made a purchase in a while, it is important to remind them of the value that the business can offer and incentivize them to make another purchase.

Low-value customers(444): These are customers who have a low monetary value, low purchase frequency, and have not made a purchase recently. To retain these customers, businesses can offer them targeted promotions, personalized recommendations, and a simplified buying experience. Since these customers have not yet demonstrated a high spending potential, businesses need to focus on building loyalty by providing them with a positive buying experience and personalized recommendations.

By developing retention policies for each of these customer segments, businesses can improve customer loyalty, increase repeat purchases, and drive revenue growth. However, it is important to remember that customer behavior is constantly evolving, and retention policies need to be adapted and updated regularly to remain effective.

Limitation

Lack of context: RFM analysis only considers a customer’s recency, frequency, and monetary value of purchases. It does not take into account other factors that may influence customer behavior, such as demographics, psychographics, or external market conditions. This can limit the accuracy of the insights gained from RFM analysis.

Lack of predictive power: RFM analysis is based on historical data and does not provide predictive insights into future customer behavior. While it can help identify customer segments that are more likely to make a purchase in the future, it cannot predict with certainty when or how much they will spend.

Homogeneity of segments: RFM analysis may create customer segments that are too homogeneous, meaning that they may not capture the full diversity of customer behavior. For example, two customers may have the same RFM scores but have different motivations for making a purchase. This can limit the effectiveness of retention policies developed based on these segments.

Data limitations: RFM analysis relies on clean and accurate data. If there are missing or incorrect data points, it can lead to inaccurate insights and segmentation. Additionally, if the data is not up-to-date or lacks sufficient historical data, it may not accurately reflect customer behavior.

Assumption of equal weights: RFM analysis assumes that recency, frequency, and monetary value are equally important factors in determining customer behavior. However, this may not be true for all businesses and may vary based on industry, customer base, and business objectives. Therefore, it is important to consider other factors when developing retention policies.

Despite these limitations, RFM analysis can still provide valuable insights into customer behavior and help businesses develop effective retention policies. However, it is important to be aware of its limitations and supplement it with other data sources and analysis methods to gain a more complete understanding of customer behavior.

Application

RFM analysis can be applied in a variety of fields and industries where businesses have access to customer transaction data. Here are some examples:

E-commerce: Online retailers can use RFM analysis to segment their customers and develop targeted retention policies to encourage repeat purchases.

Banking and financial services: Banks and financial institutions can use RFM analysis to segment their customers based on their usage patterns and develop personalized services and offers.

Telecommunications: Telecommunications companies can use RFM analysis to segment their customers based on their usage patterns and develop targeted communication and retention policies.

Hospitality and travel: Hotels and travel companies can use RFM analysis to segment their customers based on their booking behavior and develop targeted marketing and retention policies.

Healthcare: Healthcare providers can use RFM analysis to segment their patients based on their utilization patterns and develop targeted communication and engagement strategies.

Subscription-based businesses: Subscription-based businesses can use RFM analysis to segment their customers based on their subscription behavior and develop targeted retention and upsell strategies.

Overall, any business that has access to transaction data can use RFM analysis to gain insights into customer behavior and develop targeted retention policies. However, it is important to supplement RFM analysis with other data sources and analysis methods to gain a more complete understanding of customer behavior.

Conclusion

RFM analysis is a powerful technique for customer segmentation that can provide valuable insights into customer behavior. By using Python programming to conduct RFM analysis, businesses can segment their customer base and develop targeted retention strategies. The key steps involved in conducting RFM analysis include data preparation, RFM score calculation, and segmentation. By visualizing the results using Python libraries such as Pandas, Matplotlib, and Seaborn, businesses can gain a better understanding of their customer base and develop effective marketing strategies.

RFM analysis is not without its limitations, and it should be supplemented with other data sources and analysis methods to gain a more complete understanding of customer behavior. However, when used in conjunction with other techniques, RFM analysis can be a valuable tool for businesses looking to increase customer loyalty and drive revenue growth.

Overall, RFM analysis is a valuable technique for customer segmentation that can provide businesses with valuable insights into customer behavior. By using Python programming to conduct RFM analysis, businesses can gain a competitive edge in today’s rapidly evolving marketplace.

 

Muhammad Kamal Hossain

Recent Posts

Activation Functions in Deep Learning

In the exciting world of deep learning, activation functions play a crucial role in shaping…

1 year ago

The Benefits of Using AWS in Data Science: Maximizing the Potential

Fortunately, cloud computing platforms such as Amazon Web Services (AWS) offer a powerful solution to…

2 years ago

The Top Benefits of Using Data Science in Marketing: How to Gain a Competitive Edge in the Digital Age

In today's digital age, marketing has evolved to become more complex than ever before. With…

2 years ago

Mastering Regression Analysis for Data Science

Regression analysis is a powerful statistical technique used to analyze and model relationships between variables.…

2 years ago

Data Privacy in Data Science: A Comprehensive Guide

Data privacy has become a major concern in today's digital world. With the rise of…

2 years ago

Tableau for Data Science: How to Visualize Your Data Like a Pro

Are you struggling to make sense of your data? Do you spend hours analyzing spreadsheets…

2 years ago