Wednesday, October 25, 2017

Normality Assumption and Log Transformation


Normal Distribution

Normal distribution is the most common distribution in the most circumstances. If we observe enough data (which is actually a problem), more than often we can see that the distribution of the data is near to normal.  This is because of the central limit theorem. It states that if we take the average of enough related things (which is random variables independently drawn from independent distributions) we will eventually get the Normal Distribution.

The problem with collecting data is that you do not know what distribution the data follows.  What we have is the sample without distribution to help figure it out.  The true distribution is generally not knowable.  There are tons of distributions and we may match up with none of them. This is where central limit theorem comes. It states that the mean value of our sample do have a known distribution even if we do not know the distribution of the population. The 'known distribution' is the Normal Distribution.

In real life we don't get exactly Normal Distribution. But what we get is near to it. It is near enough to make assumptions about distribution as normal and it allows us to predict things. This is where Normal Distribution is crucial. Normal Distribution is simple, it takes only two parameters and they are easy to compute. One is mean and the other is variance. Variance is the square of standard deviation.

Normal distribution allows us to make predictions and test their validity. For instance, in the case of linear regression we have this question. "Does our least squares estimators are optimal?".  Under the normal distribution of error terms, we can show that this estimators are optimal using maximum likelihood estimation. Maximum likelihood assumes that the sampling distribution of error is normally distributed with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the errors of the sample. MLE finds the parameter values that maximize the likelihood of making the observations given the parameters. MLE will pick the Normal Distribution under which your data are most likely.

Also, if we want to construct confidence intervals or hypothesis then we can use the normality assumption.  A confidence interval is a range of values that gives the user a sense of how precisely statistic estimates a parameter. We can use it as "margin of error". For instance, we can say "In a 95% confidence interval, I fail to reject the alternative hypothesis that the observation is not in that interval". They can be used with distributions that are not normal. One option is highly skewed distributions. But it is easiest to understand what confidence intervals about in symmetric distributions.  ( normal distribution is not the only option for confidence intervals, we can do Bootstrapping also. )

Let's see a little example using R,

> set.seed(1)
> x <- 2
> s <- 1
> n <- 1000
> error <- qnorm(0.975) * s / sqrt(n)
> error
[1] 0.0619795
> leftSide <- x - error
> rightSide <- x + error
> leftSide
[1] 1.93802
> rightSide
[1] 2.06198

X is sample mean which has sample size (n) of 1000, s is sample standard deviation and n is the number if the observations drawn from the normal distribution. We can say that assuming normality we are 95% confident that real mean is between 1.93802 and 2.06198.

> set.seed(1)
> dist <- rnorm(1000,2,1)
> hist(dist)
> abline(v=leftSide, col="red",lwd=3 )
> abline(v=rightSide, col="blue",lwd=3 )






So, if we want to use those properties we need to make our data normal. To do that, we can do skewness analysis and then transform our data.

SKEWNESS

In a normal distribution, the graph of a distribution appears as a symmetrical "bell-shaped" curve. In such a distribution mean is equal to the mode. So, we can say that skewness is asymmetry in a statistical distribution in which the curve is not "bell-shaped". Instead, it is skewed either to the left or to the right.

In a right skewed distribution, the mean of the distribution exceeds mode while in a left skewed distribution the mode of the distribution exceeds the mean.

Let's have a look at a right skewed distribution from Kaggle's House Prices data.

> library(e1071)
> data <- read.csv("train.csv")
> SalePrice <- data$SalePrice
skewness(SalePrice)
[1] 1.879009
> hist(SalePrice)


Here we can see that the distribution is skewed to the right. What we can do to make it more normal is that we can perform log transformation.

> skewness(log(SalePrice))
[1] 0.1210859

> hist(log(SalePrice))

We can also generate our skewed data as well.

> library(sn)
> hist(rsn(n=100000, xi=20.24, omega=73.84, alpha=50.14))

> skewness(a)
[1] 0.9900123

> hist(a)



> skewness(log(a))

[1] -0.1412755

> hist(log(a))




The question is can we use the log transform in every case. The answer is no. Log-scale works on relative changes (multiplicative). If we are interested in absolute changes we need to use linear scale transformation method. We can use square root transformation. Let's say we have we have 10 dollars. After an interest of 1 dollar, now we have 11 dollars. The change is 10%, or 1 dollar. The first way of measuring the change is relative while the latter is additive.


So, it was an introductory blog about normality assumption and data transformation techniques.  I hope you get something useful from it!

No comments:

Post a Comment

Gibbs Sampler

Gibbs Sampler In this blog we are going to discuss Gibbs Sampler with an applied example. But we will not dive into Gibbs Sampler direc...