Sunday, May 31, 2015

Sample Median Calculation

Sample Median Calculation

Part of Mike's Big Data, Data Mining, and Analytics Tutorial  

The median is a measure of center that cuts a distribution of values into two equal parts representing 50% of the sample. The notation that I will use when talking about medians (and percentiles) is the \( x_p \) notation, where \( x \) represents a sample value and \( p \) represents the percentage of the sample that is less than \( x \). In our case the median will be denoted as \( x_{.5} \) or \( x_{50\%} \). When we calculate the median, we are looking at finding the “middle” value instead of the “balancing” value, though in the case of data that is symmetrically distributed, the mean and median tend to be pretty close or the same.
There are a few guidelines to using the median:
  • The median is a measure of center for data that is measured on an ordinal scale (Review data classification here)
  • The median is not appropriate for nominal scale data; however continuous scale data can be treated as ordinal and statistical procedures using the median can be used on continuous scale data.
For these examples, we will use the following two samples:
#Get 10 random integer values uniformly distributed between 1 and 100 
x_even<-round(runif(10,1,100))
#sort and display the values 
x_even<-x_even[order(x_even)]
x_even
##  [1] 11 17 44 56 69 71 74 75 78 94
#Create an odd size sample that is mostly equivalent to the even size sample

x_odd<-sample(x_even,9)
x_odd<-x_odd[order(x_odd)]
x_odd
## [1] 11 44 56 69 71 74 75 78 94
The computation of the median varies if the sample size has an odd or even number. In the case of the odd size sample (sample size is not cleanly divisible by 2), the median is exactly the middle value.

Odd-Size Sample Median Calculation

As a reminder, these are the sorted values from the odd-size sample:
x_odd
## [1] 11 44 56 69 71 74 75 78 94
If we have sorted the sample from smallest to largest and it has an odd size, we want the \( \frac{n+1}{2} \)th element. In the case of our odd-size sample, we want the \( \frac{9+1}{2} = 5 \)th element.
Using this definition, we can directly get the median:
x_odd[((length(x_odd)+1)/2)]
## [1] 71
#This is the same as
x_odd[5]
## [1] 71
We can also use the median function:
median(x_odd)
## [1] 71
The median value 71 cuts 4 values off below (11, 44, 56, and 69) and 4 values above (74, 75, 78, and 94), leading us to have 50% of the sample above and below 71.

Even-Size Sample Median Calculation

As a reminder, these are the sorted values from the even-size sample:
x_even
##  [1] 11 17 44 56 69 71 74 75 78 94
If we have a situation where our sample has an even size, we can't cleanly pick out a “middle” value from which to use as the median. In this case, we want to compute a value that causes 50% of the sample to be above and below. In order to do this, we average the middle two numbers.
If we have sorted the sample from smallest to largest and it has an even size, we want to find the midpoint between the \( \frac{n}{2} \)th and \( \frac{n+2}{2} \)th element. In the case of our sample with size 10, we want to calculate the midpoint between the \( \frac{10}{2} = 5 \)th and the \( \frac{12}{2} = 6 \)th elements. In the case of our even size sample, the median is the midpoint between 69 and 71.
Using this definition, we can directly get the median:
(x_even[((length(x_even))/2)] + x_even[((length(x_even)+2)/2)])/2
## [1] 70
#This is the same as
(x_even[5]+x_even[6])/2
## [1] 70
We can also use the median function:
median(x_even)
## [1] 70
The median value 70 cuts 5 values off below (11, 17, 44, 56, and 69) and 5 values above (71, 74, 75, 78, and 94), leading us to have 50% of the sample above and below 70.

Back To Mike's Big Data, Data Mining, and Analytics Tutorial  

Friday, May 29, 2015

How to Install R on Windows

Installation of R on Windows

Part of Mike's Big Data, Data Mining, and Analytics Tutorial  


R Installation is pretty straightforward on Windows. First, the latest version of R needs to be downloaded from CRAN (http://cran.r-project.org).

Then, launching the installer, simply follow the prompts until R is installed.

Next, it might be a good idea to install RStudio:

How to Install RStudio on Windows

Back to Mike's Big Data, Data Mining, and Analytics Tutorial  


Calculating a Mean/Average in R


The mean can be thought of as a “balancing point” between values smaller than the mean and larger than the mean. It can also be thought of as a “typical value.” Statisticians/data scientists may refer to the mean of a set as the ‘location’ parameter for a set.
The mean or average of a set of data values is defined as the sum of the values divided by the count of values.
\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
There are a few guidelines to using the mean:
  • The mean is a measure of center for data that is measured on a continuous scale (Review data classification here)
  • The mean is not appropriate for ordinal/nominal scale data. Using the mean leads to meaningless statements of the form:
    • “The average gender in the world is somewhere between male and female (1.2)”
    • “The average satisfaction was 2.3”
Given the following 10 values generated between 1 and 20:
#Get 10 random integer values uniformly distributed between 1 and 20 
x<-round(runif(10,1,20))
#sort and display the values 
x<-x[order(x)]
x
##  [1]  1  3  3 10 11 14 15 16 19 19
These values can be summarized as frequencies of individual values (frequency referring to the number [count] of times each individual value appears in the set):
table(x)
## x
##  1  3 10 11 14 15 16 19 
##  1  2  1  1  1  1  1  2
This table of values can be visualized in a histogram (a bar chart that shows the relative frequency of each value or a summarization within ranges of values [called bins]). In the chart below, the red line is drawn at the mean of the values:

The chart below shows the same information, but using R’s default binning/summarization algorithm:

The mean of this set is:
\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
\[ \bar{x} = \frac{1 + 3 + 3+ 10+ 11+ 14+ 15+ 16+ 19+ 19}{10} \]
\[ \bar{x} = \frac{111}{10} \]
\[ \bar{x} = 11.1 \]
In R, it is easy to find the average of a set of numbers using the built-in mean function:
mean(x)
## [1] 11.1
It is also possible to write (mostly) equivalent, but less efficient functions that compute the mean/average in R:
average<-function(x) {
  sum(x)/length(x)
}
average(x)
## [1] 11.1
Or even worse performance-wise, but demonstrating the mechanics of the for loop…
average<-function(x) {
  sum_x<-0
  count_x<-0
  for (i in 1:length(x)) {
     sum_x<-sum_x+x[i]
     count_x<-count_x + 1
  }
  sum_x/count_x
}
average(x)
## [1] 11.1
There’s really not a good reason in most cases to write your own function that calculates the mean, but you may find a special reason in doing so…

Back to Mike's Big Data, Data Mining, and Analytics Tutorial