Wednesday, October 25, 2017

Normality Assumption and Log Transformation


Normal Distribution

Normal distribution is the most common distribution in the most circumstances. If we observe enough data (which is actually a problem), more than often we can see that the distribution of the data is near to normal.  This is because of the central limit theorem. It states that if we take the average of enough related things (which is random variables independently drawn from independent distributions) we will eventually get the Normal Distribution.

The problem with collecting data is that you do not know what distribution the data follows.  What we have is the sample without distribution to help figure it out.  The true distribution is generally not knowable.  There are tons of distributions and we may match up with none of them. This is where central limit theorem comes. It states that the mean value of our sample do have a known distribution even if we do not know the distribution of the population. The 'known distribution' is the Normal Distribution.

In real life we don't get exactly Normal Distribution. But what we get is near to it. It is near enough to make assumptions about distribution as normal and it allows us to predict things. This is where Normal Distribution is crucial. Normal Distribution is simple, it takes only two parameters and they are easy to compute. One is mean and the other is variance. Variance is the square of standard deviation.

Normal distribution allows us to make predictions and test their validity. For instance, in the case of linear regression we have this question. "Does our least squares estimators are optimal?".  Under the normal distribution of error terms, we can show that this estimators are optimal using maximum likelihood estimation. Maximum likelihood assumes that the sampling distribution of error is normally distributed with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the errors of the sample. MLE finds the parameter values that maximize the likelihood of making the observations given the parameters. MLE will pick the Normal Distribution under which your data are most likely.

Also, if we want to construct confidence intervals or hypothesis then we can use the normality assumption.  A confidence interval is a range of values that gives the user a sense of how precisely statistic estimates a parameter. We can use it as "margin of error". For instance, we can say "In a 95% confidence interval, I fail to reject the alternative hypothesis that the observation is not in that interval". They can be used with distributions that are not normal. One option is highly skewed distributions. But it is easiest to understand what confidence intervals about in symmetric distributions.  ( normal distribution is not the only option for confidence intervals, we can do Bootstrapping also. )

Let's see a little example using R,

> set.seed(1)
> x <- 2
> s <- 1
> n <- 1000
> error <- qnorm(0.975) * s / sqrt(n)
> error
[1] 0.0619795
> leftSide <- x - error
> rightSide <- x + error
> leftSide
[1] 1.93802
> rightSide
[1] 2.06198

X is sample mean which has sample size (n) of 1000, s is sample standard deviation and n is the number if the observations drawn from the normal distribution. We can say that assuming normality we are 95% confident that real mean is between 1.93802 and 2.06198.

> set.seed(1)
> dist <- rnorm(1000,2,1)
> hist(dist)
> abline(v=leftSide, col="red",lwd=3 )
> abline(v=rightSide, col="blue",lwd=3 )






So, if we want to use those properties we need to make our data normal. To do that, we can do skewness analysis and then transform our data.

SKEWNESS

In a normal distribution, the graph of a distribution appears as a symmetrical "bell-shaped" curve. In such a distribution mean is equal to the mode. So, we can say that skewness is asymmetry in a statistical distribution in which the curve is not "bell-shaped". Instead, it is skewed either to the left or to the right.

In a right skewed distribution, the mean of the distribution exceeds mode while in a left skewed distribution the mode of the distribution exceeds the mean.

Let's have a look at a right skewed distribution from Kaggle's House Prices data.

> library(e1071)
> data <- read.csv("train.csv")
> SalePrice <- data$SalePrice
skewness(SalePrice)
[1] 1.879009
> hist(SalePrice)


Here we can see that the distribution is skewed to the right. What we can do to make it more normal is that we can perform log transformation.

> skewness(log(SalePrice))
[1] 0.1210859

> hist(log(SalePrice))

We can also generate our skewed data as well.

> library(sn)
> hist(rsn(n=100000, xi=20.24, omega=73.84, alpha=50.14))

> skewness(a)
[1] 0.9900123

> hist(a)



> skewness(log(a))

[1] -0.1412755

> hist(log(a))




The question is can we use the log transform in every case. The answer is no. Log-scale works on relative changes (multiplicative). If we are interested in absolute changes we need to use linear scale transformation method. We can use square root transformation. Let's say we have we have 10 dollars. After an interest of 1 dollar, now we have 11 dollars. The change is 10%, or 1 dollar. The first way of measuring the change is relative while the latter is additive.


So, it was an introductory blog about normality assumption and data transformation techniques.  I hope you get something useful from it!

Saturday, September 9, 2017

Data Analysis and Feature Engineering with Titanic Dataset using R

Hi! In this blog we are going to explore the famous Titanic Data Set. There will be visualisations using ggplot and we are going to use dplyr package which is a grammar of data manipulation. New features from the given data set will also be engineered! Data comes from Kaggle's Titanic competition. The objective is to create a model that predicts passengers who are survived.Solutions are submitted to Kaggle. We don't know who is survived from test set, we need to submit to see how our model works. Train and test sets are seperated.

Let's start with loading libraries and files,

> library('corrgram')
> library(easyGgplot2)
> library('forcats')
> library(neuralnet)
> library(ggplot2)
> library(dplyr)
> library(randomForest)
> train <- read_csv('train.csv')
> test <- read_csv('test.csv')
> test$Survived <- NA
> combine <- rbind(train,test)

Let's check NA's.

> apply(combine, 2, function(x) sum(is.na(x)))
PassengerId    Survived      Pclass        Name
0                     418           0           0
Sex         Age       SibSp       Parch
0         263           0           0
Ticket        Fare       Cabin    Embarked
0           1        1014           2

> glimpse(combine)
Observations: 1,309
Variables: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 1...
$ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1...
$ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2...
$ Name        <chr> "Braund, Mr. Owen Harris", "...
$ Sex         <chr> "male", "female", "female", ...
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, ...
$ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1...
$ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0...
$ Ticket      <chr> "A/5 21171", "PC 17599", "ST...
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53....
$ Cabin       <chr> NA, "C85", NA, "C123", NA, N...
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q"...

Pclass is the passenger class which is 1,2 or 3. SibSp is the number of siblings or spouses travelling together. Parch is the number of parents or children travelling together. Ticket is the ticket number and the embarked is the port of embarkation.

Now we are converting Survived, Pclass, Embarked and Sex to factor and reconstruct combine data frame.

> train <- train %>% mutate(Survived = factor(Survived),
                           Pclass=factor(Pclass),Embarked=factor(Embarked),Sex = factor(Sex))
> test <- test %>% mutate(Pclass=factor(Pclass),Embarked=factor(Embarked),Sex = factor(Sex))
> combine <- rbind(train,test)

How many of passengers are survived from the accident?

> train %>% count(Survived)
# A tibble: 2 × 2
Survived     n
    <fctr> <int>
1        0   549
2        1   342

> p1 <- ggplot(train, aes(x=Pclass,fill=Survived)) + geom_bar(position='fill')
> p2 <- ggplot(train,aes(x=Age,color=Survived)) + geom_freqpoly(binwidth=1) + theme(legend.position = "none")
> p3 <- ggplot(train,aes(x=Age,color=Survived)) + geom_density() + theme(legend.position = "none")
> p4 <- ggplot(train,aes(x = SibSp+Parch,fill=Survived)) + geom_bar(position='fill') + theme(legend.position = "none")
> p5 <- ggplot(train,aes(x = Fare, color = Survived)) + geom_freqpoly(binwidth=0.04) + scale_x_log10() + theme(legend.position = "none")
> ggplot2.multiplot(p1,p2,p3,p4,p5)



Obviously survive rate is higher as the Pclass goes towards 1. For Age, we have multiple plots. One is frequency the other is density plot. The most notable pattern in those plots is the survival rate among passengers who are under 10 years old. For SibSp + Parch, which can be thought as family size is also correlated with survival. Looks like small families with SibSp + Parch less than 4 have better chance to survive. Fare has very wide range, even with the logaritmic values. Fare plot shows that when Fare goes up survival chances are also go up.

Let's see how the features are correlated among each other. In order to create a correlation matrix
all features must be numeric. We don't need PassengerId, Name, Cabin and Ticket.

> pCor <- train[,-c(1,4,9,11)] %>% mutate(Sex = fct_recode(Sex,"0" = "male",
                                                                                                         "1" = "female")  ) %>%
                              mutate(Sex = as.integer(Sex),
                              Pclass = as.integer(Pclass),
                              Survived = as.integer(Survived),
                              Embarked = as.integer(Embarked))
> corrgram(pCor,lower.panel=panel.pie,main = "Correlation Among Features")
As seen in previous plots, Sex, Pclass and Fare are more correlated with Survived than other features. Also there is a certain amount of correlation between Fare and Pclass, another correlation is between Pclass and Age. Elderly people have chosen better classes? Maybe.

1. Handling With Missing Cells

> apply(combine,2,function(x) sum(is.na(x)) )
PassengerId    Survived      Pclass        Name         Sex         Age       SibSp       Parch      Ticket
            0                418           0                0                0           263           0             0           0
Fare       Cabin    Embarked
1        1014           2

1.1 Age

How can we deal with age? There are 263 missing values, which is considerably high. There is something that we can use in the data set. All names have some titles like Mr, Master, Miss etc. We can explore ages using those title since person who has title 'Master' is at most 8 years old. We can simply take the median of each title group and set it to missing cells. One can argue that it may cause overfitting. It is worth to work on it.

Let's find the titles using regular expressions.

> Title <- sub(".*,.([^.]*)\\..*","\\1",combine$Name)
> combine <- combine %>% mutate(Title = factor(Title)) %>%
       mutate(Title = fct_collapse(Title, "Miss" = c("Mlle", "Ms"), "Mrs" = "Mme",
                                                                     "Ranked" = c( "Major", "Dr", "Capt", "Col", "Rev"),
                                                                     "Royalty" = c("Lady", "Dona", "the Countess", "Don", "Sir", "Jonkheer")))
> levels(combine$Title)
[1] "Ranked"  "Royalty" "Master"  "Miss"    "Mrs"     "Mr"  

Let's see the survival rate of Title owners on train data set.

> train$Title <- combine$Title[1:891]
> p6 <- ggplot(train,aes(x=Title,fill = Survived)) + geom_bar(position="fill")
> p6




This plot also shows that females have higher chance to survive than males. Survival rate in Ranked is low, we can say that personal of the Titanic are mostly died. We will use Title feature to fill missing values on Age. Since Title feature highly supports Sex feature, including both may cause overfitting.

> combine <- combine %>% group_by(Title) %>% mutate(Age = ifelse(is.na(Age),round(median(Age, na.rm=T),1),Age))

We are done with Age. Next step will be Cabin. Cabin has 1014 NA values. Instead of filling it, let's look at survival rates on whether we know the cabin or not.

1.2 Cabin

> combine$cabinKnown <- ifelse(is.na(combine$Cabin),FALSE,TRUE)
> p7 <- ggplot(combine[1:891,], aes(x=cabinKnown,fill=Survived)) + geom_bar(position="fill")



WOW! Seems like survival rate is quite high if cabin is known. I wonder the what proportion of cabins are known per Pclass.

> p8 <- ggplot(train, aes(x=cabinKnown,fill=Survived)) + geom_bar() + facet_grid(~Pclass)



As suspected! If you are from first class they value your information more! And for sure, you have better chances to survive. I also want to know the relationship between Pclass, cabin number and Fare. Before that, we need to fill the empty cells on Fare and Cabin.

1.3 Fare

For fare, there is only one empty cell. Let's find that person.

> combine %>% filter(is.na(Fare))
Source: local data frame [1 x 14]
Groups: Title [1]

PassengerId Survived Pclass               Name    Sex   Age SibSp Parch Ticket  Fare  Cabin Embarked
<int>   <fctr> <fctr>             <fctr> <fctr> <dbl> <int> <int> <fctr> <dbl> <fctr>   <fctr>
1        1044       NA      3 Storey, Mr. Thomas   male  60.5     0     0   3701    NA     NA        S
# ... with 2 more variables: Title <fctr>, cabinKnown <lgl>

He is a third class passenger who is 60.5 years old. The way we are going to fill empty fare cell is
following. Each passenger class (Pclass) has some certain fare values. Let's visualize it first.

> p9 <- ggplot(combine,aes(x=Pclass,y=Fare,color=Pclass)) + geom_boxplot() + scale_y_log10() + ggtitle('Fare v Pclass')



Here we can see that Fare in class one is distinctly higher that 2 and 3. Similarly there is a considerably high difference between class 2 and 3. Class 3 has some outliers. We will fill the NA by taking the median of Fare's in class 3.

>Fare=ifelse(is.na(combine$Fare)&combine$Pclass==3,round(median(na.omit(combine$Fare)),1),combine$Fare)
> combine$Fare = Fare

Great! Now let's look at the 2 missing cells in Embarked.

1.4 Embarked

> combine%>%filter(is.na(Embarked))
Source: local data frame [2 x 15]
Groups: Fare [1]

PassengerId Survived Pclass                                      Name    Sex   Age SibSp Parch Ticket
<int>   <fctr> <fctr>                                    <fctr> <fctr> <dbl> <int> <int> <fctr>
  1          62        1      1                       Icard, Miss. Amelie female    38     0     0 113572
2         830        1      1 Stone, Mrs. George Nelson (Martha Evelyn) female    62     0     0 113572
# ... with 6 more variables: Fare <dbl>, Cabin <fctr>, Embarked <fctr>, Title <fctr>, cabinKnown <lgl>,
#   fareFull <dbl>

They are first class female passengers. And their Ticket number is the same! They should have embarked from the same point. These two ladies make us curious about shared tickets since they have the same ticket!

> p10 <- ggplot(combine, aes(x=Embarked,y=Pclass)) + geom_jitter(colour=combine$Pclass)




Plot shows that first class passengers are embarked mostly from C and S. The probability that a given
first class passenger has embarked from point C is higher than point S. We can see it by looking to intensity of the other class' passengers. By the way S is Southampton, C is Cherbourg and Q is Queenstown as written in wikipedia. From Southampton, high amount of passengers have embarked, but those passengers are mostly third class passengers. For this reason we decide to fill empty cells as Cherbourg, C.

> combine <- combine %>% mutate(Embarked = as.character(Embarked)) %>% mutate(Embarked = ifelse(is.na(Embarked),'C',Embarked)) %>% mutate(Embarked = as.factor(Embarked))

Great! Now let's look into the not-so-secret groups among passengers. We are going to look at ticket groups, families and children.

2. Feature Engineering via Groups 

There are certain types of groups among passengers that have better/worse chances to survive.  These are may include ticket groups, family size, being children and being alone.

2.1 Ticket Group and Family

> ticketGroup <- combine %>% group_by(Ticket) %>% summarise(count=n())
> familySize <- combine$SibSp + combine$Parch

What do you think about the probability of shared tickets are actually shared by families? Let's find the correlation between these two vectors and see it.

> cor(combine$count,familySize)

[1] 0.8005555

Looks like they are highly correlated. For this reason we will contine with only count, which is the number of people who share the same ticket.

> ticketGroup <- ticketGroup %>% mutate(count = as.character(count)) %>% mutate(count = as.factor(count))
> combine <- left_join(combine,ticketGroup,by="Ticket")
> p11 <- ggplot(combine[1:891,],aes(x=count,fill=Survived)) + geom_bar(position='fill')



This plot gives us a clue about the advantage of being among of small groups. If you are sharing your ticket with a person or two or three, you have better chances to survive.

Let's look at the family size versus survival rate.

> p12 <- ggplot(combine[1:891,],aes(Parch,SibSp,color=Survived)) + geom_count()



Here we can see that passengers with groups that is smaller than 4 have higher chances to be survived. Another thing that we can see from the plot is that huge amount of passengers were traveling alone. It also needs to be explored. Let's create large family feature looking that plot.

> combine$large_family <- (combine$Parch > 3) | (combine$SibSp > 2)
> ggplot(combine[1:891,],aes(x=large_family,fill=Survived)) + geom_bar(position="fill")

Being member of a large family decreases the odds of survival.

2.2 Fare

Fare feature involves ticket's fare. People who are shared their ticket should also be shared the price
if it.

> combine$fareShared = combine$Fare / as.numeric(as.character(combine$count))

2.3 Children

In the density plot above we showed that children have better chances to survive. So let's create another feature which is being children.

> combine$child <- combine$Age < 10
> p13 <- ggplot(combine[1:891,],aes(x=child,fill=Survived)) + geom_bar(position='fill')



Here we can see that passengers who are below 10 have better chances to survive.

2.4 No Family Member on Board - Alone

Now let's find the passengers that don't belong to any family group.

> combine <- combine %>% mutate(alone=(SibSp==0)&(Parch==0))
> p14 <- ggplot(combine[1:891,],aes(x=alone,fill=Survived)) + geom_bar(position = 'fill')



We can see that passengers who are alone have less chances to survive than who are not.

3. Model

Even if there are still a lot of room for improvement for features, let's stop there and take a look at
how our features will perform. We will construct a prediction model using bagging. Bagging is a method to decrease variance of a statistical learning method. Here we are going to use it with decision trees. Decision trees are suffered from high variance. bagging allows us to separate different prediction models using each training set and average them. Further information can be found here. The advantage we are going to get from using decision tree, that is random forest is that it gives us a importance values of each feature so that we can see our features roles on predicting the Survived passengers.

> train <- combine[1:891,c(2,3,5,6,12,14,17,19,20,21)]
> test <- combine[892:1309,c(3,5,6,12,14,17,19,20,21)]

> rf <-randomForest(Survived~Pclass+Sex+Embarked+cabinKnown+smallFamily+child+fareShared+alone+Age,data=train, mtry=3,importance=T)
> varImpPlot(rf)

Alone and Embarked seems to be not so efficient features. Since main objective of this blog is seeing how generated features are performing, let's keep them. 

We will predict the Survived passengers by using 'Neural Networks'. Neural networks accepts variables that are numeric. Furthermore it needs to be in matrix format, not data frame. Model.matrix command does the both.

> matrix.train <- model.matrix(~Survived+Pclass+Sex+Age+Embarked+cabinKnown+large_family+child+fareShared+alone,data=train)
> matrix.train <- matrix.train[,-1]

Neural networks use 'squashing functions' for prediction. Squashing functions map the whole real axis into a finite interval. Famous example for squashing function is sigmoid function, intervals are between zero and one. So, the output of the function is also between zero and one. Think like logistic regression. If we scale our variables between zero and one, neural networks will work more efficiently for this reason. Scaling is also necessary for different scales. In our data set, Age and Fare are something different. We need to scale them. 

> maxs = apply(matrix.train,2,max)
> mins = apply(matrix.train,2,min)
> scaled.train <- as.data.frame(scale(matrix.train,center=mins,scale=maxs-mins))

> matrix.test <- model.matrix(~ Pclass+Sex+Age+Embarked+cabinKnown+large_family+child+fareShared+alone,data=test)
> matrix.test <- matrix.test[,-1]
> maxs <- apply(matrix.test,2,max)
> mins <- apply(matrix.test,2,min)
> scaled.test <- as.data.frame(scale(matrix.test,center=mins,scale=maxs-mins))

Now we are ready to train our model. 

> nn <- neuralnet(Survived1~Pclass2+Pclass3+Sexmale+EmbarkedS+EmbarkedQ+cabinKnownTRUE+large_familyTRUE+childTRUE+fareShared+aloneTRUE+Age,data=scaled.train,hidden = 4,linear.output = F, act.fct = "tanh")
> pr.nn <- neuralnet::compute(nn,scaled.test)
> x = (pr.nn$net.result > 0.5)*1

"Tanh" is another squashing function that is equal to 2*sigmoid(x) - 1. Hidden is the number of units in the hidden layer. If we code plot(nn) we can see the structure.

> plot(nn)



To submit our prediction to kaggle code

>  submit <- cbind(test$PassengerId,x)
>  submit = data.frame(submit)
>   colnames(submit) = c('PassengerId','Survived')
>   write.csv(submit,'submit.csv',row.names = F)

and submit! Score will be about 0.72, which is an average score among competitors.

In this blog we saw how to use 'dplyr' language at basic level, ggplot, very basic side of feature engineering and some introduction to machine learning algorithms. This blog is highly inspired from other kernels which can be found on Kaggle like this one. In the next blog, I would like to talk more about neural networks! Stay tuned until next time!


Wednesday, August 23, 2017

Bayesian Data Analysis : Conjugate Prior with R

In the last blog, I talked about non informative priors. As you might guess, there is one
other type of prior which is informative prior. Informative priors are based on expert
opinion, historical data and so on. After selection of the prior type, we find our distribution with some number of draws and then we get our posterior distribution using the specific model. If our analysis is sequential, in other words if we add some new data in progress, we can use posterior distribution as a prior distribution in next step. We can add some values to our distribution and keep going like that. In this blog I want to use this property in the example.

Conjugate priors are key concept in prior&posterior distributions. A simple definition for conjugate prior is that the family of prior distribution is the same with posterior distribution’ family. For example, if we choose our prior as Beta distribution we will end with Beta distribution in our posterior distribution. This makes our lives easier. 

Here is a simple example. Let’s say we have a coin that we don't know whether it is fair or not. So we are not sure that the probability of getting head is equal to 1/2. We start with Uniform(0, 1) which means the probability of getting head is anywhere between 0 and 1. Uniform(0, 1) is equal to Beta(1, 1). I am going to use Beta since it is conjugate prior. After choosing our prior distribution we toss the coin 6 times and the result is 2 heads and 4 tails. We have two successful outcomes. In the first place, we have Beta(1, 1). If α =1 and β = 1, posterior for Beta will be Beta(α + Σy, β + n - Σy) where Σy is the number of success. Σy is equal to 2 and n is equal to 6. So, our new Beta will be Beta(1 + 2, 1 + 6 - 2) = Beta(3, 5). This is the posterior distribution. We started with Beta and ended with Beta. 

I want to perform a simple Bayesian Data Analysis using the property explained in the example. Let’s say we have a company called Jacob’s Eggs that sells organic eggs. After discussing on how to sell it with marketing department, our CEO comes up with a plan of handing out brochures to houses in specific areas. After handing out to 25 houses we sold 9 subscriptions for our 3-month organic egg subscription. That is 36% sales rate. But our CEO, whose name is Jacob John Blacksea, is not sure about how to interpret those results. Is it successful sales rate? Furthermore, a good friend of Jacob John who is very good sales specialist named Bahadirius told him that the sales rate is often between 20 - 25%. Jacob got confused and wants to make a simple analysis to interpret the results. Here is the analysis,



>n_draws <- 100000
>prior <- runif(n_draws, 0,1)
>hist(prior)
 

>simulated_data <- rep(NA, n_draws)
>model <- function(parameter){
  return(rbinom(1, size = 25, p = parameter)) } 
>#rbinom takes probability from the prior and gives simulated # of success for each iteration 
>for( i in 1:n_draws){
    simulated_data[i] <- model(prior[i])
}
> posterior <- prior[simulated_data == 9]

> hist(posterior)

> abline(v=mean(posterior), col = "red")


What happened above is that we drew 100000 data and generated uniform distribution as prior. Our model function will give us the number of successes in the given size, which is 25 using prior probabilities. For each prior probability we return the number of success to simulated_data vector.  Our posterior is number of successes that are equal to 9 in simulated_data vector. 


> mean(posterior)

[1] 0.3692231

> 9/25
[1] 0.36
> quantile(posterior, c(0.025,0.975))
     2.5%     97.5% 
0.2049356 0.5511299 

95% of the probabilities are distributed between 0.2 and 0.55.

> #95% of the probabilities falls between 0.2 and 0.55
> sum(posterior > 0.3) / length(posterior)
[1] 0.7666064
> sum(posterior > 0.4) / length(posterior)
[1] 0.3528043
> sum(posterior > 0.25) / length(posterior)
[1] 0.9092789

We had 36% probability to sell subscriptions. The probability that our sales rate will be higher that 0.25 is 0.909. We are quite certain that it will be more than 0.25. But in the case of 0.4, we only have 0.35 as our probability. So we can expect that it will be less than 0.4. For now 36% seems like a good portion for sales rate.


In the above analysis we used the prior before Bahadirius shares his information. To add the new information I will select Σy = 4, n = 17. Remember that Uniform(0,1) is equal to Beta(1,1).

>#Beta(1 + 4, 1 + 17 - 4)
>priorB <- rbeta(n_draws, 5, 14) 
>hist(priorB)
This is more informative than uniform distribution. 

> simulated_data2 <- rep(NA,n_draws)
> for(i in 1:n_draws){
+     simulated_data2[i] <- model(priorB[i])
+ }
> posteriorB <- priorB[simulated_data2 == 9]
> hist(posteriorB)
> mean(posteriorB)
[1] 0.3188646

> hist(posteriorB)
> abline(v=mean(posteriorB), col = "red", lwd = 2)
 Here our mean for the probability of getting 9 sales is near 32%. We had 36% in the first place. To interpret it we can do something like

> sum(posteriorB > 0.35) / length(posteriorB)
[1] 0.3180726

The probability that sales rate will be higher than 35% is 0.32. So we can say that our sales people have done a good job by selling 9 subscriptions according to that analysis!

The example in this blog inspired from Rasmus Bååth's Research Blog. In this blog, there is a great series of lectures on Bayesian Data Analysis. If you are looking for an introductory lecture on Bayesian Statistics, Coursera has some effective courses like this one.




Thursday, August 10, 2017

Bayesian : Non-informative priors & Jeffreys Prior & Fisher Information

Priors are what makes bayesian approach strong. But using priors with the knowledge that we are not sure about is not recommended. If we want to take advantage of bayesian approach, like having posterior probabilities, we can use non informative priors in a case that we do not have necessary data to use as informative prior.

Using non informative prior makes data have maximum influence on the posterior rather than prior. What makes the prior non informative? The answer is effective sample size. Effective sample size is the amount of data we have on our prior distribution. As it approaches to zero our prior will become non informative. 

Let's say we have a coin. We do not know whether it is a fair coin or not. But we have flipped that coin two times and get one head and one tail. With this information we can have an uniform distribution θ ~ U[0, 1]. Which is equal the Beta(1, 1). Effective sample size of that beta distribution is 2. This information is not telling us enough about being fair coin. But still has some information in it. So we need to get closer to zero. As getting closer, our limit will be equal to zero. Eventually, we are going to have Beta(0, 0). The density of Beta(0, 0) is proportional to θ^(-1)(1-θ)^(-1). If we integrate this we will have infinite integral. So, it is not a true density. This is called ‘improper prior’. It does not have a proper density. But this prior is usable. As long as we get some data later on, we can get proper posteriors from improper priors. Posterior density will be proportional to θ^(y-1)(1-θ)^(n-y-1). We will have Beta(y, n-y). 

For instance, after choosing improper prior as Beta(0,0), let’s say we observed one head and one tail. If getting head is success, our posterior will be Beta(1,1). And the posterior mean will be y/n, which is 1/2 in our example. This is exactly equal to maximum likelihood estimation. So, by using this improper prior, we get a posterior which gives us point estimates same as frequentist approach. But at the same time, we have a posterior. If we want to make interval statements, probability statements, we can actually find an interval and say that there is 95% probability that θ is in this interval.

Improper priors have a problem with parameterization. If we have an event that can get any value between 0 and 1 we can draw an uniform distribution with U(0,1). Let’s parameterize this event with Z. For now, Z falls any value between 0 and 1. But how about Z^200? How can we be sure that it will be in the same range or it will concentrate near 0. This is a parameterization problem of improper priors. Choosing a prior depends upon the particular parameterization in this case. Choice of parameterization affects the prior. We need to get a prior that doesn’t depend on the choice of parameterization.

Jeffreys prior has this property. Jeffreys prior is proportional to the square root of fisher information. Prior π(θ) is proportional to √I(θ), I(θ) is the fisher information. Fisher information measures the expected amount of information.


Fisher information formula;


Where X has more than one data points. If our I(θ) value is large it means that likelihood will change as we change the parameter. We can make such conclusion since it is the second derivative of log likelihood. It shows the amount of change. If the variance is small, I(θ) is small, we can conclude that the posterior comes from the data we generated using parameter θ. It doesn't depend on the parameter. Likelihood will be small everywhere except near parameter θ. It supports our idea of non informative prior which is posterior is determined by the data, not the prior. 

An example using normal model is following. We will use a single point from that normal model. Let’s compute the fisher information of it.


First, log of single point in normal model is equal to (X - μ)² ∕ 2σ². Next we will take the second order derivative then the expected value of it. The result will be 1/σ². If we want to extend it to the normal model it will be n/σ². As can be seen, it is a constant. And the square root of it will also be a constant. It doesn't vary as parameter varies. And since it is just a constant, it can be any value between minus and plus infinities. So, it actually is a non informative prior. These kind of priors are Jeffreys priors.



More detailed explanation about Jeffreys prior is here.


There is a great blog about Jeffreys prior here.

Friday, July 21, 2017

Customer Analysis : RFM Scoring with R



Hello! In this blog, I want to write about some basic topics on Customer Analysis. I am currently reading The Power of Habit from Charles Duhigg and I read a great example of customer analysis on that book. It is about the company 'Target' and its statisticians ability to detect pregnant customers so that they can send coupons to them accordingly. They are doing this by analyzing people's shopping patterns. And once they recognize who is pregnant and the approximate the time when they are going to give birth, they can win those customers and make them buy their stuff from Target. I found that fantastic. I cannot do such analysis by now but it made me curious about the topic. Later on I learned about RFM scoring and some other basic types of analysis. I want to do an example of RFM in this blog.

Let's load our data. It can be downloaded from here.

> customerDf <- read.table("CDNOW_sample.txt")
> customerDf <- customerDf[,-2]
#Second column is their id in sample data set. we can delete it.
> colnames(customerDf) <- c("ID", "Date", "Unit", "Total Amount")
> class(customerDf$Date)
[1] "integer"
#We need to make it Date

> customerDf[,2] <- as.Date(as.character(customerDf[,2]),"%Y%m%d")

Here is our data. What we can do with those parameters? One of the possible answers is RFM scoring. RFM stands for Recency, Frequency and Monetary. Recency refers to How recently did the customer purchase? Once we decide on a reference date, recency is equal to the day difference between our reference day and last time of shopping from given customer. Frequency is simply How often do they purchase in that interval. Monetary means How much do they spend?

In RFM scoring, we compute those RFM values and then we score them. We score according to our
expectancies and average sales we are making. This scoring is usually in interval of [1,5], 5 is the highest score. So 555 scored customer will be our best customer. In other words, it is a customer value metric. I will give some further sources at the end of the blog.


Let's start with computing our Recency value.


> refDay  <- c("19980101")
> refDay <-as.Date(refDay,"%Y%m%d")
> df <- customerDf[customerDf$Date<refDay,]
> min(customerDf$Date)
[1] "1997-01-01"

The time interval is decided. Since very first data from customers come from 1997-01-01, it will be from 1997-01-01 to 1998-01-01.

> uniqCust <- unique(df$ID)
> dateDf <- rep(NA,length(uniqCust))
> for(i in 1:length(uniqCust)){
+     dateDf[i] <- max(df[df$ID==uniqCust[i],2])
+ }

> head(dateDf)
[1] 10207  9874  9862  9862  9862 10201
#Here we have customers' last shopping days in numeric type.
#We can compute the recency by simply subtracting it from our reference day.
> recency = as.numeric(refDay) - dateDf
> head(recency)

[120 353 365 365 365  26

Frequency is just how many times they shopped. 


> frequency <- data.frame(table(customerDf$ID))
> colnames(frequency) <- c("ID", "Frequency")
> head(frequency)
ID Frequency
1  4          4
2 18         1
3 21         2
4 50         1
5 60         1
6 71         1

Monetary value;

> revenue <- rep(0, length(uniqCust))
> for(j in 1:length(uniqCust)){
  +     for(i in 1:length(df$ID)){
    +         if(uniqCust[j] == df$ID[i]){
      +             revenue[j] = revenue[j] + df$`Total Amount`[i]
      +         }
    +     }
  + }
> head(revenue)
[1] 100.50  75.11   6.79  13.97  23.94 714.12

We have computed the necessary values. Now we are ready for scoring.


> rfmTable <- cbind.data.frame(uniqCust,recency,frequency$Frequency,revenue)
> head(rfmTable)
 uniqCust recency frequency$Frequency revenue
1        4        20                   4                    100.50
2       21     353                   1                      75.11
3       50     365                   2                        6.79
4       71     365                   1                      13.97
5       86     365                   1                      23.94
6      111      26                    1                    714.12

Let's start scoring. Last shopping from 5 days before the reference day will be given the best score, 5. It will continue like 28, 84, 168, 366.


> rankR <- cut(as.numeric(rfmTable$recency), breaks = c(0,5,28,84,168,366))
> levels(rankR) <- c(5,4,3,2,1)

In case of frequency, 0 to 3 will have the worst score, 1. (3,4] will have 2, (4,7] will have 3, (7,9] will have 4 and more than 9 will have 5 points.

> rankF <- cut(rfmTable$`frequency$Frequency`, breaks = c(0,3,4,7,9,10000))
> levels(rankF) <- c(1,2,3,4,5)

Lastly, for monetary value, intervals will be like (0, 99], (99, 299], (299,599], (599,1000], (1000,1000000].

> rankM <- cut(rfmTable$revenue, breaks = c(0,99,299,599,1000,1000000))
> levels(rankM) <- c(1,2,3,4,5)

Now we are ready to construct our RFM score matrix.

> rfmScores <- cbind(uniqCust, rankR, rankF, rankM)
> head(rfmScores)
uniqCust rankR rankF rankM
[1,]        4     2     2     2
[2,]       21     5     1     1
[3,]       50     5     1     1
[4,]       71     5     1     1
[5,]       86     5     1     1
[6,]      111     2     1     4

Now since we have such scores, we know our customers better. We can take action accordingly. For instance, if we have 353 scored customers it would be wise to send them some coupons since they shopped from us very frequently but not that recently.

Gibbs Sampler

Gibbs Sampler In this blog we are going to discuss Gibbs Sampler with an applied example. But we will not dive into Gibbs Sampler direc...