Friday, July 21, 2017

Customer Analysis : RFM Scoring with R



Hello! In this blog, I want to write about some basic topics on Customer Analysis. I am currently reading The Power of Habit from Charles Duhigg and I read a great example of customer analysis on that book. It is about the company 'Target' and its statisticians ability to detect pregnant customers so that they can send coupons to them accordingly. They are doing this by analyzing people's shopping patterns. And once they recognize who is pregnant and the approximate the time when they are going to give birth, they can win those customers and make them buy their stuff from Target. I found that fantastic. I cannot do such analysis by now but it made me curious about the topic. Later on I learned about RFM scoring and some other basic types of analysis. I want to do an example of RFM in this blog.

Let's load our data. It can be downloaded from here.

> customerDf <- read.table("CDNOW_sample.txt")
> customerDf <- customerDf[,-2]
#Second column is their id in sample data set. we can delete it.
> colnames(customerDf) <- c("ID", "Date", "Unit", "Total Amount")
> class(customerDf$Date)
[1] "integer"
#We need to make it Date

> customerDf[,2] <- as.Date(as.character(customerDf[,2]),"%Y%m%d")

Here is our data. What we can do with those parameters? One of the possible answers is RFM scoring. RFM stands for Recency, Frequency and Monetary. Recency refers to How recently did the customer purchase? Once we decide on a reference date, recency is equal to the day difference between our reference day and last time of shopping from given customer. Frequency is simply How often do they purchase in that interval. Monetary means How much do they spend?

In RFM scoring, we compute those RFM values and then we score them. We score according to our
expectancies and average sales we are making. This scoring is usually in interval of [1,5], 5 is the highest score. So 555 scored customer will be our best customer. In other words, it is a customer value metric. I will give some further sources at the end of the blog.


Let's start with computing our Recency value.


> refDay  <- c("19980101")
> refDay <-as.Date(refDay,"%Y%m%d")
> df <- customerDf[customerDf$Date<refDay,]
> min(customerDf$Date)
[1] "1997-01-01"

The time interval is decided. Since very first data from customers come from 1997-01-01, it will be from 1997-01-01 to 1998-01-01.

> uniqCust <- unique(df$ID)
> dateDf <- rep(NA,length(uniqCust))
> for(i in 1:length(uniqCust)){
+     dateDf[i] <- max(df[df$ID==uniqCust[i],2])
+ }

> head(dateDf)
[1] 10207  9874  9862  9862  9862 10201
#Here we have customers' last shopping days in numeric type.
#We can compute the recency by simply subtracting it from our reference day.
> recency = as.numeric(refDay) - dateDf
> head(recency)

[120 353 365 365 365  26

Frequency is just how many times they shopped. 


> frequency <- data.frame(table(customerDf$ID))
> colnames(frequency) <- c("ID", "Frequency")
> head(frequency)
ID Frequency
1  4          4
2 18         1
3 21         2
4 50         1
5 60         1
6 71         1

Monetary value;

> revenue <- rep(0, length(uniqCust))
> for(j in 1:length(uniqCust)){
  +     for(i in 1:length(df$ID)){
    +         if(uniqCust[j] == df$ID[i]){
      +             revenue[j] = revenue[j] + df$`Total Amount`[i]
      +         }
    +     }
  + }
> head(revenue)
[1] 100.50  75.11   6.79  13.97  23.94 714.12

We have computed the necessary values. Now we are ready for scoring.


> rfmTable <- cbind.data.frame(uniqCust,recency,frequency$Frequency,revenue)
> head(rfmTable)
 uniqCust recency frequency$Frequency revenue
1        4        20                   4                    100.50
2       21     353                   1                      75.11
3       50     365                   2                        6.79
4       71     365                   1                      13.97
5       86     365                   1                      23.94
6      111      26                    1                    714.12

Let's start scoring. Last shopping from 5 days before the reference day will be given the best score, 5. It will continue like 28, 84, 168, 366.


> rankR <- cut(as.numeric(rfmTable$recency), breaks = c(0,5,28,84,168,366))
> levels(rankR) <- c(5,4,3,2,1)

In case of frequency, 0 to 3 will have the worst score, 1. (3,4] will have 2, (4,7] will have 3, (7,9] will have 4 and more than 9 will have 5 points.

> rankF <- cut(rfmTable$`frequency$Frequency`, breaks = c(0,3,4,7,9,10000))
> levels(rankF) <- c(1,2,3,4,5)

Lastly, for monetary value, intervals will be like (0, 99], (99, 299], (299,599], (599,1000], (1000,1000000].

> rankM <- cut(rfmTable$revenue, breaks = c(0,99,299,599,1000,1000000))
> levels(rankM) <- c(1,2,3,4,5)

Now we are ready to construct our RFM score matrix.

> rfmScores <- cbind(uniqCust, rankR, rankF, rankM)
> head(rfmScores)
uniqCust rankR rankF rankM
[1,]        4     2     2     2
[2,]       21     5     1     1
[3,]       50     5     1     1
[4,]       71     5     1     1
[5,]       86     5     1     1
[6,]      111     2     1     4

Now since we have such scores, we know our customers better. We can take action accordingly. For instance, if we have 353 scored customers it would be wise to send them some coupons since they shopped from us very frequently but not that recently.

Tuesday, July 11, 2017

Image Compression Using SVD with R




I just finished the great Linear Algebra course from Prof. Gilbert Strang thanks to MIT OpenCourseWare. It was a fantastic series of lectures. The reason I started to that course was understanding Singular Value Decomposition at the first place. But I learned a lot throughout the course. I definitely recommend you to watch some of them even if you have already taken a course on Linear Algebra. After finishing the course, I decided to write a blog using concepts that I learned from that course. Let's start!


Singular Value Decomposition is another matrix factorization. After performing SVD  to the matrix we get 3 different matrices that are orthogonal, diagonal, orthogonal. Let transpose of Q equal to Q'. Eigenvalue - eigenvector factorization QΛQ' is an example of SVD. It has orthonormal, diagonal and orthonormal matrices(we can convert orthogonal to orthonormal by simply dividing with the length of the vector). But not all matrices can be factorized into that great form. If matrix A is not symmetric positive definite we cannot get the eigenvalue - eigenvector decomposition. We need to look at  AA' and A'A and then get their eigenvectors.

So with matrix A, what we are looking for is an orthonormal basis. SVD factors the matrix into three matrices A =UDV'. U is orthonormal basis in the column space and V is orthonormal basis in the row space. Matrix U is eigenvectors of AA' and matrix V is eigenvectors of A'A. Eigenvalues of AA' and A'A are equal.

After applying those steps we can get our SVD factorization. But why does it necessary? Where we can use it? In SVD factorization, we are getting a diagonal matrix D that has square roots of eigenvalues of A. The matrices that has the eigenvalues itself are AA' and A'A. So, when it comes to matrix A, those square roots of eigenvalues are our singular values. And those singular values are varying. Some of them are significant and others are extremely small. If we just get some significant singular values we can explain or show our matrix to some extend. This concept can be applied to image compression. If we have an image that contains 512 x 512 pixels we can compress it into less pixels using SVD. We will do that by selecting low level ranked matrices.

Let's apply this using R.
JPEG package can be used to read and write jpg files. I will use a bacteria image to compress.




> library(jpeg)
> bacteria <- readJPEG('bac.jpg')
> ncol(bacteria)
> [1] 600
> nrow(bacteria)
> [1] 431
> dim(bacteria)
> [1] 431 600   3
RGB color model uses red, green and blue colors. As can be seen from the dimensions, our image has 431x600 pixels for each of the color types. We can split them into three different matrixes.
> r <- bacteria[,,1] 
> g <- bacteria[,,2]
> b <- bacteria[,,3] 
Now we have our RGB colors. We are ready to apply SVD.
> bacteria.r <- svd(r)
> bacteria.g <- svd(g)
> bacteria.b <- svd(b)
Now lets make a list using them.
> rgb.list <- list(bacteria.r, bacteria.g, bacteria.b)
In that rgb.list, we have 3 lists and they each also another list of 3. Each of them has singular value decomposition results inside. We need to reach those values inside that list. For this reason, sapply function that allows to reach variables features inside the list is useful.
> comp <- sapply(rgb.list, function(i){
> compressed = i$u[,1:3] %*% diag(i$d[1:3]) %*% t(i$v[,1:3])
  }, simplify = 'array')
Here we select only the first 3 components of SVD instead of getting all 431. In other words, we have a rank 3 matrix. Full rank would be our original image.
> writeJPEG(comp, paste('compressed/','bacteria_svd_level_', 3, '.jpg' ))
Here is our rank 3




This is the 53 rank version of the same processes. Only code that changed is [,1:3] to [,1:53].





With 53 rank we can get the most of the image. Even less can be enough to see those bacterias. SVD helped us to decrease the size of the image. It is helpful when we are working with lots of images. It makes it easier to store and retrieve the data.









Gibbs Sampler

Gibbs Sampler In this blog we are going to discuss Gibbs Sampler with an applied example. But we will not dive into Gibbs Sampler direc...