Normal Distribution
Normal distribution
is the most common distribution in the most circumstances. If we observe enough
data (which is actually a problem), more than often we can see that the
distribution of the data is near to normal.
This is because of the central limit theorem. It states that if we take the average of enough related things (which is random variables independently drawn
from independent distributions) we will eventually get the Normal Distribution.
The problem with collecting data is that you do not know what distribution the data follows. What we have is the sample without distribution to help figure it out. The true distribution is generally not knowable. There are tons of distributions and we may match up with none of them. This is where central limit theorem comes. It states that the mean value of our sample do have a known distribution even if we do not know the distribution of the population. The 'known distribution' is the Normal Distribution.
In real life we
don't get exactly Normal Distribution. But what we get is near to it. It is
near enough to make assumptions about distribution as normal and it allows us
to predict things. This is where Normal Distribution is crucial. Normal
Distribution is simple, it takes only two parameters and they are easy to
compute. One is mean and the other is variance. Variance is the square of
standard deviation.
Normal distribution
allows us to make predictions and test their validity. For instance, in the
case of linear regression we have this question. "Does our least squares
estimators are optimal?". Under the
normal distribution of error terms, we can show that this estimators are
optimal using maximum likelihood estimation. Maximum likelihood assumes that
the sampling distribution of error is normally distributed with some unknown mean and variance, the mean
and variance can be estimated with MLE while only knowing the errors of the
sample. MLE finds the parameter values that maximize the likelihood of making
the observations given the parameters. MLE will pick the Normal Distribution
under which your data are most likely.
Also, if we want to
construct confidence intervals or hypothesis then we can use the normality
assumption. A confidence interval is a
range of values that gives the user a sense of how precisely statistic
estimates a parameter. We can use it as "margin of error". For
instance, we can say "In a 95% confidence interval, I fail to reject the alternative hypothesis that the observation is not in that interval". They can be used with distributions that are not normal. One option is
highly skewed distributions. But it is easiest to understand what confidence
intervals about in symmetric distributions.
( normal distribution is not the only option for confidence intervals,
we can do Bootstrapping also. )
Let's see a little
example using R,
> set.seed(1)
> x <- 2
> s <- 1
> n <- 1000
> error <-
qnorm(0.975) * s / sqrt(n)
> error
[1] 0.0619795
> leftSide <-
x - error
> rightSide <-
x + error
> leftSide
[1] 1.93802
> rightSide
[1] 2.06198
X is sample mean
which has sample size (n) of 1000, s is sample standard deviation and n is the
number if the observations drawn from the normal distribution. We can say that
assuming normality we are 95% confident that real mean is between 1.93802 and 2.06198.
> set.seed(1)
> dist <-
rnorm(1000,2,1)
> hist(dist)
>
abline(v=leftSide, col="red",lwd=3 )
>
abline(v=rightSide, col="blue",lwd=3 )
So, if we want to
use those properties we need to make our data normal. To do that, we can do
skewness analysis and then transform our data.
SKEWNESS
In a normal
distribution, the graph of a distribution appears as a symmetrical
"bell-shaped" curve. In such a distribution mean is equal to the
mode. So, we can say that skewness is asymmetry in a statistical distribution
in which the curve is not "bell-shaped". Instead, it is skewed either
to the left or to the right.
In a right skewed
distribution, the mean of the distribution exceeds mode while in a left skewed
distribution the mode of the distribution exceeds the mean.
Let's have a look at
a right skewed distribution from Kaggle's House Prices data.
> library(e1071)
> data <-
read.csv("train.csv")
> SalePrice <-
data$SalePrice
> skewness(SalePrice)
[1] 1.879009
> hist(SalePrice)
Here we can see that
the distribution is skewed to the right. What we can do to make it more normal
is that we can perform log transformation.
> skewness(log(SalePrice))
[1] 0.1210859
> hist(log(SalePrice))
We can also generate
our skewed data as well.
> library(sn)
>
hist(rsn(n=100000, xi=20.24, omega=73.84, alpha=50.14))
> skewness(a)
[1] 0.9900123
> hist(a)
>
skewness(log(a))
[1] -0.1412755
> hist(log(a))
The question is can
we use the log transform in every case. The answer is no. Log-scale works on
relative changes (multiplicative). If we are interested in absolute changes we
need to use linear scale transformation method. We can use square root transformation.
Let's say we have we have 10 dollars. After an interest of 1 dollar, now we
have 11 dollars. The change is 10%, or 1 dollar. The first way of measuring the
change is relative while the latter is additive.
So, it was an
introductory blog about normality assumption and data transformation
techniques. I hope you get something
useful from it!