Tuesday, June 2, 2015

The Sample Variance and Sample Standard Deviation

The Sample Variance and Standard Deviation

Part of Mike's Big Data, Data Mining, and Analytics Tutorial  

The sample variance and standard deviation can be thought of as measures of the spread between the mean and the points in the sample. The Sample variance is defined as the sum of the squared deviations from the mean, divided by an adjusted sample size to make the statistic “unbiased”:
\[ s^2_x = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1} \]
The sample standard deviation is the square root of the sample variance:
\[ s_x = \sqrt{s_x^2} = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}} \]
Visually, for normally distibuted data, the standard deviation can be interpreted as the arrow from the mean:

Another view with 1 standard deviation on either side of the mean shaded:

We’ll talk more about dispersion measures in the posts on random sampling distributions.
There are a few guidelines to using the variance/standard deviation:
  • The variance and standard deviation are measures of dispersion/spread for data that is measured on a continuous scale (As opposed to interval/ratio scale, review data classification here)
  • The standard deviation and variance are generally not appropriate for ordinal/nominal scale data. Using the variance/standard deviation on ordinal/nominal scale data can lead to meaningless statements of the form:
    • 20% of the time we would expect to see the number of bee hives per acre less than -1.
    • “Our survey yielded a standard deviation for Satisfaction of 2, meaning that a large percentage of our survey respondents are off scale” [Satisfaction is a nominal (and at best ordinal) measure. Statistical procedures involving means and standard deviations have no place in survey analysis, most of the time…]
Given the following 30 normally distributed values:
#Get 30 standard normal values
x<-rnorm(30)
#Display the values
x
##  [1] -1.06697008 -0.99386337 -0.04848545 -0.43638720  0.91749364
##  [6]  0.28708520  1.67698404  0.97897077 -0.59151972  1.85470415
## [11]  0.54272426 -0.39907658  0.50406143 -0.18513261  0.56133026
## [16]  1.03735518 -0.04948910  1.65030321 -0.66896995 -1.02599162
## [21] -0.57135208  0.80466583  1.09648810  0.98932572 -0.16181664
## [26] -1.91483513  0.88486131  1.87548053 -0.17144575 -0.35227207
Displayed visually:

The sample mean of the values is 0.2341409. We can construct a table for computation:
#Calculate the variance by hand using a table
#start with our sample values
df                      <- data.frame(x)
#Fill the column with the mean value
df$mean_x               <- rep(mean(x),length(x))
#Find the difference between the individual point and the mean
df$x_minus_mean         <- df$x - df$mean_x
#Square the difference
df$x_minus_mean_squared <- df$x_minus_mean^2

#Show the table
df
##              x    mean_x x_minus_mean x_minus_mean_squared
## 1  -1.06697008 0.2341409  -1.30111095          1.692889710
## 2  -0.99386337 0.2341409  -1.22800424          1.507994418
## 3  -0.04848545 0.2341409  -0.28262633          0.079877642
## 4  -0.43638720 0.2341409  -0.67052807          0.449607898
## 5   0.91749364 0.2341409   0.68335276          0.466971000
## 6   0.28708520 0.2341409   0.05294432          0.002803101
## 7   1.67698404 0.2341409   1.44284316          2.081796396
## 8   0.97897077 0.2341409   0.74482989          0.554771569
## 9  -0.59151972 0.2341409  -0.82566060          0.681715423
## 10  1.85470415 0.2341409   1.62056328          2.626225331
## 11  0.54272426 0.2341409   0.30858338          0.095223705
## 12 -0.39907658 0.2341409  -0.63321746          0.400964352
## 13  0.50406143 0.2341409   0.26992055          0.072857105
## 14 -0.18513261 0.2341409  -0.41927349          0.175790256
## 15  0.56133026 0.2341409   0.32718939          0.107052895
## 16  1.03735518 0.2341409   0.80321431          0.645153222
## 17 -0.04948910 0.2341409  -0.28362998          0.080445965
## 18  1.65030321 0.2341409   1.41616233          2.005515753
## 19 -0.66896995 0.2341409  -0.90311083          0.815609166
## 20 -1.02599162 0.2341409  -1.26013250          1.587933907
## 21 -0.57135208 0.2341409  -0.80549296          0.648818902
## 22  0.80466583 0.2341409   0.57052495          0.325498719
## 23  1.09648810 0.2341409   0.86234722          0.743642728
## 24  0.98932572 0.2341409   0.75518485          0.570304155
## 25 -0.16181664 0.2341409  -0.39595751          0.156782352
## 26 -1.91483513 0.2341409  -2.14897600          4.618097858
## 27  0.88486131 0.2341409   0.65072043          0.423437081
## 28  1.87548053 0.2341409   1.64133966          2.693995874
## 29 -0.17144575 0.2341409  -0.40558662          0.164500509
## 30 -0.35227207 0.2341409  -0.58641295          0.343880149
#The variance is:
var.x <- sum(df$x_minus_mean_squared)/(length(x)-1)
var.x
## [1] 0.924833
#The standard deviation is:
sd.x  <- sqrt(var.x)
sd.x
## [1] 0.9616824
Most software packages include a built in way to compute variances and satndard deviations. In R, it is accomplished using var:
#the variance
var.x<-var(x)
var.x
## [1] 0.924833
#the standard deviation
sd.x<-sqrt(var.x)
sd.x
## [1] 0.9616824
 

Back to Mike's Big Data, Data Mining, and Analytics Tutorial  

 

No comments:

Post a Comment