Friday, May 29, 2015

Calculating a Mean/Average in R


The mean can be thought of as a “balancing point” between values smaller than the mean and larger than the mean. It can also be thought of as a “typical value.” Statisticians/data scientists may refer to the mean of a set as the ‘location’ parameter for a set.
The mean or average of a set of data values is defined as the sum of the values divided by the count of values.
\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
There are a few guidelines to using the mean:
  • The mean is a measure of center for data that is measured on a continuous scale (Review data classification here)
  • The mean is not appropriate for ordinal/nominal scale data. Using the mean leads to meaningless statements of the form:
    • “The average gender in the world is somewhere between male and female (1.2)”
    • “The average satisfaction was 2.3”
Given the following 10 values generated between 1 and 20:
#Get 10 random integer values uniformly distributed between 1 and 20 
x<-round(runif(10,1,20))
#sort and display the values 
x<-x[order(x)]
x
##  [1]  1  3  3 10 11 14 15 16 19 19
These values can be summarized as frequencies of individual values (frequency referring to the number [count] of times each individual value appears in the set):
table(x)
## x
##  1  3 10 11 14 15 16 19 
##  1  2  1  1  1  1  1  2
This table of values can be visualized in a histogram (a bar chart that shows the relative frequency of each value or a summarization within ranges of values [called bins]). In the chart below, the red line is drawn at the mean of the values:

The chart below shows the same information, but using R’s default binning/summarization algorithm:

The mean of this set is:
\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
\[ \bar{x} = \frac{1 + 3 + 3+ 10+ 11+ 14+ 15+ 16+ 19+ 19}{10} \]
\[ \bar{x} = \frac{111}{10} \]
\[ \bar{x} = 11.1 \]
In R, it is easy to find the average of a set of numbers using the built-in mean function:
mean(x)
## [1] 11.1
It is also possible to write (mostly) equivalent, but less efficient functions that compute the mean/average in R:
average<-function(x) {
  sum(x)/length(x)
}
average(x)
## [1] 11.1
Or even worse performance-wise, but demonstrating the mechanics of the for loop…
average<-function(x) {
  sum_x<-0
  count_x<-0
  for (i in 1:length(x)) {
     sum_x<-sum_x+x[i]
     count_x<-count_x + 1
  }
  sum_x/count_x
}
average(x)
## [1] 11.1
There’s really not a good reason in most cases to write your own function that calculates the mean, but you may find a special reason in doing so…

Back to Mike's Big Data, Data Mining, and Analytics Tutorial  

No comments:

Post a Comment