Thursday, March 17, 2016

The lolcat Statistical Package - Public Release

Background and Project Goals

The world needs a statistical tool that is valuable both for teachers and practitioners. Many of the statistical tools in use today are highly expensive, proprietary, and carry a large amount of baggage. Many tools can’t be reasonably automated. Tools that can be automated (often written in R, Python, Java, and other languages) are incomplete from both a teaching and working standpoint.
The main objective of this project is to provide a tool set that is accurate, reliable, efficient (for the end user), and provides enough additional functionality to be valuable above and beyond many of the commercial tools in existence today. Ideally, users of this tool should not need to be proficient developers. In fact, using the code samples included in the online package documentation along with a couple of small sections of my tutorial on R, practitioners and students (at all levels) should not really need to know much at all in terms of R.
You’ll find something useful in this package regardless of whether you are a student just starting your journey or an experienced practitioner analyzing large amounts of data.

Cost to Use the Package

The public release of the package itself is offered free of charge and is open source. You are free to take it and modify it any way that you choose.
The online documentation is not included under the open source license and may not be redistributed or stored without express written consent; however it is free to access online anytime.
See the LICENSE file in the package or on github for more details on the license.

Where to Obtain the Package Source

If you want to browse the package source, see the official location:
https://mikeburr.visualstudio.com/DefaultCollection/lolcat-public/_git/lolcat

Installing the Package

Make sure that you have the following installed:
  • R
  • RStudio
The easiest way to install the latest package version is to use devtools and run the following R script:
# Install the devtools package
install.packages("devtools")

# Load the devtools package
require(devtools)

# Download the latest lolcat public release
install_git("https://mikeburr.visualstudio.com/DefaultCollection/lolcat-public/_git/lolcat")

# Load the lolcat package
require(lolcat)
If the package is successfully loaded, you will likely see something like this:
> require(lolcat)
Loading required package: lolcat
>

Notes for System Administrators (Most People Can Skip This)

System administrators seeking to install the package system wide (o.e. for all users) should install the package in the “site” package directory or use a network location for shared packages.
  • Windows: C:Files-{version}
  • Linux: Varies…
    • On the latest Fedora Core Release, the site package directory is /usr/lib64/R/library/. I’d expect this to be the same for other Redhat/CentOS variants using yum/dnf for package management.
    • On the latest Ubuntu release, the site package directory is /usr/lib/R/library/. I’d expect the same for other Debian/Ubuntu variants using apt for package management.
    • If you build R yourself or use a non-standard distribution, you’re your own best hope to locate the site package directory.
  • Mac OS: Someone will have to tell me…

Resources for Teaching with R and lolcat

TODO

Resources for Learning R and lolcat

TODO

A-Z Function Documentation

TODO

Monday, June 15, 2015

Numeric Operators in R

Numeric Operators in R

Part of Mike's Big Data, Data Mining, and Analytics Tutorial  



R provides a number of operators for performing mathematical operations on numbers and vectors. Unlike most other programming languages, R can seamlessly take objects of different forms (single numbers, vectors, arrays) and perform operations with them.
In the case when two inputs have different lengths, R will “recycle” elements from the shorter input by repeating the shorter input to get the correct length. This is best demonstrated by example:
a<-rep(1,10) # 1 1 1 1 1 1 1 1 1 1 
b<-1:3       # 1 2 3
#b treated as# 1 2 3 1 2 3 1 2 3 1
#Note the warning...
a+b          # 2 3 4 2 3 4 2 3 4 2
## Warning in a + b: longer object length is not a multiple of shorter object
## length
##  [1] 2 3 4 2 3 4 2 3 4 2
In general, I think recycling elements makes understanding R code harder, and I would generally treat it as an antipattern in R (i.e. don’t do it…). In addition to making code harder to understand, it will also help you start a bad habit of ignoring errors and warnings in R. There are better methods (rep,expand.grid,combn, etc) that preserve code readability if you (or someone else) needs to do any review/modification in the future.
Like most other programming languages, R operators follow a precedence that is similar to the precidence taught to math students (i.e. parentheses first, then multiply/divide, then addition/subtraction). I will cover this section discussing the most common operators and precedence from high to low (see ?Syntax for the full list).

^ Exponentiation

An exponent can be defined as one number (the base) multiplied by itself a number of times (the exponent). Here 4 is the exponent and 2 is the base:
\[ 2^4 = 2 \cdot 2 \cdot 2 \cdot 2 = 16 \]
In the case of a non-whole exponent, exponentiation works as an nth root operation. For example:
\[ 8^{\bar{.3}} = 8^{\frac{1}{3}} = \sqrt[3]{8^1} = \sqrt[3]{8} = 2 \]
Negative exponents work as division operations. For example:
\[ 2^{-3} = \frac{1}{2^3} = \frac {1}{8} = 0.125 \]
R accomplishes exponentiation via the ^ operator. A few examples:
2^3#Exponent of a single number
## [1] 8
(1:10)^2#Exponent over an entire vector
##  [1]   1   4   9  16  25  36  49  64  81 100
c(4,9,16,25,36)^(1/2)#nth root operation
## [1] 2 3 4 5 6

* Multiplication and / Division Operators

Multiplication can be defined as the repeated addition of a number. Example:
\[ 5 * 3 = 5 + 5 + 5 = 15 \]
Division can be defined in terms of splitting a quantity (called the numerator) between a set of groups (denominator). Example: 20 split between 5 groups yields 4 for each of the 5 groups:
\[ 20 / 5 = 4 \]
Various rules exist and are taught in low level mathematics courses for finding multiplication and subtraction for non-whole numbers by hand.
R accomplishes multiplication with the * operator. A few examples:
2*2                    #Multiplication of two numbers
## [1] 4
(1:5)*rep(2,5)         #Multiplication of vectors
## [1]  2  4  6  8 10
c(1,2,3,4,5)*c(2,2,2,2,2) #Same as above
## [1]  2  4  6  8 10
(1:5)*2                #Multiplication of a vector by a number (all elements multiplied)
## [1]  2  4  6  8 10
The vector example above illustrated (O are input, X are output):

R accomplishes division with the / operator. A few examples:
4/2                    #Division of two numbers
## [1] 2
seq(2,10,by=2)/rep(2,5)#Division of vectors
## [1] 1 2 3 4 5
c(2,4,6,8,10)/c(2,2,2,2,2) #Same as above
## [1] 1 2 3 4 5
seq(2,10,by=2)/2       #Division of a vector by a number (all elements divided by number)
## [1] 1 2 3 4 5
The vector example above illustrated (O are input, X are output):

The input objects need to be numeric to utilize numeric operators. An example is that lists can’t be multiplied/divided directly:
list(x=2,y=3) * list(y=2,x=3) #Might yield list(x=6,y=6)? Nope...
## Error in list(x = 2, y = 3) * list(y = 2, x = 3): non-numeric argument to binary operator
Matrix multiplication uses a different operator (%*%). See the post on Matrix operations for more details.

+ Addition and - Subtraction Operators

Addition can be thought of as finding the magnitude of two combined quantities.
\[ 1 + 1 = 2 \]
Subtraction can generally be thought of as finding the difference between two values.
\[ 2 - 1 = 1 \]
R accomplishes addition with the + operator. A few examples:
+5        #Unary addition operator - no change/effect
## [1] 5
2+10      #Addition of 2 numbers
## [1] 12
1:10+2    #Addition of number and vector
##  [1]  3  4  5  6  7  8  9 10 11 12
1:10+11:20#Addition of 2 vectors
##  [1] 12 14 16 18 20 22 24 26 28 30
The vector example above illustrated (O are input, X are output):

R accomplishes subtraction with the - operator. A few examples:
-5        #Unary negation operator - creates a negated quantity
## [1] -5
12-2      #Subtraction of 2 numbers
## [1] 10
3:12-2    #Subtraction of number and vector
##  [1]  1  2  3  4  5  6  7  8  9 10
11:20-1:10#Subtraction of 2 vectors
##  [1] 10 10 10 10 10 10 10 10 10 10
The vector example above illustrated (O are input, X are output):

R has more operators that will be considered in other posts.
  • Matrix Operations in R
  • Logical Operators in R
  • Bitwise Operators in R

Back to Mike's Big Data, Data Mining, and Analytics Tutorial  


Saturday, June 6, 2015

Introduction to R Data Types

Data Types in R

Part of Mike's Big Data, Data Mining, and Analytics Tutorial  



An important first step in understanding the R programming language is gaining an understanding of how R represents different types of data. This post will give an introduction to the range of data types available.

Primitive Values

The R programming language has a few primitive (called atomic) types that are going to be relevant to most people who aspire to develop in R:

Raw: Holds raw byte values

Logical: boolean values TRUE or FALSE

a <- TRUE # a True Value
a <- T    # another way
a
## [1] TRUE
b <- FALSE # a False Value
b <- F     # another way
b
## [1] FALSE

Numeric/Integer: non floating-point numbers like 1, 2, 3, …

a <- 42 # an integer Value

Numeric/Double: floating-point numbers like 1.1, 1.2, .99, …

a <- 3.14159 # a floating point Value

Complex: Complex numbers… \(a + bi \) where \( i = sqrt{ -1 } \)

a <- complex(real = 1, imaginary=2.1) # a complex Value

Character: String values like “a” “badly” “documented” “language” “is” “hard” “to” “learn”

a <- "programmer" # a string value
There are a few additional types that appear, but will be addressed in other parts of the tutorial:
  • Expressions: Parsed strings of R code
  • Symbols: Typically used to insert mathematical notation into plots
  • Functions: Performs a set of operations on a set of inputs and may or may not return a result

Vectors

The need arises frequently (perhaps more frequently than storing a single value) to store more than one value and access the values in an efficient way. This is accomplished using vectors, arrays (covered later), data frames (covered later), and lists (covered later).
The easiest way to create a vector is with the rep, c, seq commands or the : operator:
rep creates a vector with the first argument repeated n times:
a<-rep(1,10) #Fill a vector of length 10 with the value 1
a
##  [1] 1 1 1 1 1 1 1 1 1 1
c concatenates values/vectors into a larger vector:
a<-c(1,2,3) # a simple vector
a
## [1] 1 2 3
b<-c(4,5,6) # another vector
b
## [1] 4 5 6
a_b<-c(a,b) # combination of vectors
a_b
## [1] 1 2 3 4 5 6
: builds ascending or descending numeric sequences/vectors (with a step size of 1):
a<-1:10 #Ascending Sequence from 1 to 10
a
##  [1]  1  2  3  4  5  6  7  8  9 10
a<-5:-5 #Descending Sequence from 5 to -5
a
##  [1]  5  4  3  2  1  0 -1 -2 -3 -4 -5
a<-1.1:10.1 #Non-Integer Ascending Sequence
a
##  [1]  1.1  2.1  3.1  4.1  5.1  6.1  7.1  8.1  9.1 10.1
seq generates ascending or descinding sequences using a start point and an end point, then either a length or an increment:
a<-seq(from=1,to=10,by=1) #Equivalent to 1:10
a<-seq(1,10,1) #shorter version with from, to, increment...
a<-seq(1,10,length.out=10) #Equivalent to 1:10
a
##  [1]  1  2  3  4  5  6  7  8  9 10
Individual vector elements can be accessed with [ and ]. Some illustrative examples:
a[5]  #The fifth element
## [1] 5
a[-5] #Not the fifth element
## [1]  1  2  3  4  6  7  8  9 10
a[1:3] #Elements 1,2, and 3
## [1] 1 2 3
a[c(1,2,5,7,10)] #Elements 1,2,5,7, and 10
## [1]  1  2  5  7 10
Another interesting thing is that more than one element of a vector can be changed at a time:
a<-1:10
a #starting point
##  [1]  1  2  3  4  5  6  7  8  9 10
a[5]<-20 #Change 1 element
a
##  [1]  1  2  3  4 20  6  7  8  9 10
a[c(1,2,3)] <- c(10,20,30) #Change 3 elements
a
##  [1] 10 20 30  4 20  6  7  8  9 10

Matrices and Data Frames

R has a rich capability to natively build and manipulate matrices (memory efficient structures of only one primitive type) and data frames (structures that can hold vectors of differing primitive types). Selection of data frame vs. matrix usually depends on what libraries you are using and/or what the functions you develop expect as input.
Matrices can be created and filled at the same time:
a<-matrix(data=1:9,nrow=3,ncol=3) #3x3 matrix filled by columns
a
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
a<-matrix(data=1:9,nrow=3,ncol=3, byrow=TRUE) #Fill by row using data
a
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
Assuming that the example above is a data matrix, an equivalent data frame can be built:
#Note: c1,c2,c3 become the column names. Data frames are filled by columns
a<-data.frame(c1=c(1,2,3),c2=c(4,5,6),c3=c(7,8,9))
a
##   c1 c2 c3
## 1  1  4  7
## 2  2  5  8
## 3  3  6  9
a<-data.frame(c1=c(1,4,7),c2=c(2,5,8),c3=c(3,6,9))
a
##   c1 c2 c3
## 1  1  2  3
## 2  4  5  6
## 3  7  8  9
Individual elements in both matrices and data frames can be accessed using [ and ]:
a<-matrix(data=1:9,nrow=3,ncol=3) #3x3 matrix filled by columns
b<-data.frame(c1=c(1,2,3),c2=c(4,5,6),c3=c(7,8,9))
#Use [row_indices, column_indices]:
a[3,1]
## [1] 3
b[3,1]
## [1] 3
#Entire rows and columns can be selected
a[1,] #Entire first row of a matrix
## [1] 1 4 7
b[1,] #Entire first row of a data frame
##   c1 c2 c3
## 1  1  4  7
a[,1] #Entire first column of a matrix
## [1] 1 2 3
b[,1] #Entire first column of a data frame
## [1] 1 2 3
b$c1  #Data frames only - access by column name
## [1] 1 2 3
a[c(1,3),] #rows 1 and 3 of a matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    3    6    9
b[c(1,3),] #rows 1 and 3 of a data frame
##   c1 c2 c3
## 1  1  4  7
## 3  3  6  9
a[,c(1,3)] #columns 1 and 3 of a matrix
##      [,1] [,2]
## [1,]    1    7
## [2,]    2    8
## [3,]    3    9
b[,c(1,3)] #columns 1 and 3 of a data frame
##   c1 c3
## 1  1  7
## 2  2  8
## 3  3  9
#It is possible to do selection of multiple rows/columns:
a[c(1,3),c(1,3)] #Selects combinations of rows/columns
##      [,1] [,2]
## [1,]    1    7
## [2,]    3    9
b[c(1,3),c(1,3)] #Selects combinations of rows/columns
##   c1 c3
## 1  1  7
## 3  3  9
The number of rows/columns can be determined using nrow, ncol, and dim
a<-matrix(data=1:9,nrow=3,ncol=3) #3x3 matrix filled by columns
b<-data.frame(c1=c(1,2,3),c2=c(4,5,6),c3=c(7,8,9))
#The most straightforward way...
nrow(a) #Number of rows
## [1] 3
ncol(a) #Number of columns
## [1] 3
dim(a) #The dimension of the matrix
## [1] 3 3
dim(b) #The dimension of the data frame
## [1] 3 3
#Another way that isn't so readable...
length(a[,1]) #Number of rows
## [1] 3
length(a[1,]) #Number of columns
## [1] 3

Arrays - Higher Dimensional Objects

Sometimes it is convenient to be able to index an object by 3 or more indices. In this case, arrays are needed. For 2 or fewer dimensions, use matrices and data frames (above).
a<-array(1:27, dim = c(3,3,3)) #Create a 3x3x3 data structure
a
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   10   13   16
## [2,]   11   14   17
## [3,]   12   15   18
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   19   22   25
## [2,]   20   23   26
## [3,]   21   24   27
dim(a) #In the case of arrays, use dim instead of nrow/ncol to get the maximum indices
## [1] 3 3 3

Lists

Lists are the closest thing in R to true associative arrays/hash tables. Lists allow a general arbitrary mapping to be created/accessed:
a<-list(e1 = c(1,2,3), e2 = matrix(1:4,2,2), e3 = FALSE)
a
## $e1
## [1] 1 2 3
## 
## $e2
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## $e3
## [1] FALSE
Lists are accessed a little bit differently. Similar to data frames, named elements can be accessed:
# How to determine names of list elements
names(a)
## [1] "e1" "e2" "e3"
# Access elements by name
a$e1 #The first list element
## [1] 1 2 3
a$e2 #The second list element
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
a$e3 #The third list element
## [1] FALSE
#Access elements by name in a different way
a[["e1"]]
## [1] 1 2 3
a[["e2"]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
a[["e3"]]
## [1] FALSE
Lists can also be accessed with the [[ and ]] operators and numeric indices:
#How to determine a list length
length(a)
## [1] 3
#Access elements by index:
a[[1]] #The first list element
## [1] 1 2 3
a[[2]] #The second list element
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
a[[3]] #The third list element
## [1] FALSE
Lists are most commonly used in R to represent more complex data structures and to implement a version of typing for complicated data structures that can have states and behaviors of their own (a different post will discuss object-oriented R in more detail.

Back to Mike's Big Data, Data Mining, and Analytics Tutorial  

Tuesday, June 2, 2015

How to Install RStudio on Windows

Installation of RStudio on Windows

Part of Mike's Big Data, Data Mining, and Analytics Tutorial  


RStudio Installation is pretty straightforward on Windows. First, the latest version of R needs to be installed and downloaded:

How To Install R on Windows

Second, the latest version of RStudio needs to be downloaded from the website (http://www.rstudio.com).

Then, launching the installer, simply follow the prompts until RStudio is installed.

Back to Mike's Big Data, Data Mining, and Analytics Tutorial  


The Sample Variance and Sample Standard Deviation

The Sample Variance and Standard Deviation

Part of Mike's Big Data, Data Mining, and Analytics Tutorial  

The sample variance and standard deviation can be thought of as measures of the spread between the mean and the points in the sample. The Sample variance is defined as the sum of the squared deviations from the mean, divided by an adjusted sample size to make the statistic “unbiased”:
\[ s^2_x = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1} \]
The sample standard deviation is the square root of the sample variance:
\[ s_x = \sqrt{s_x^2} = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}} \]
Visually, for normally distibuted data, the standard deviation can be interpreted as the arrow from the mean:

Another view with 1 standard deviation on either side of the mean shaded:

We’ll talk more about dispersion measures in the posts on random sampling distributions.
There are a few guidelines to using the variance/standard deviation:
  • The variance and standard deviation are measures of dispersion/spread for data that is measured on a continuous scale (As opposed to interval/ratio scale, review data classification here)
  • The standard deviation and variance are generally not appropriate for ordinal/nominal scale data. Using the variance/standard deviation on ordinal/nominal scale data can lead to meaningless statements of the form:
    • 20% of the time we would expect to see the number of bee hives per acre less than -1.
    • “Our survey yielded a standard deviation for Satisfaction of 2, meaning that a large percentage of our survey respondents are off scale” [Satisfaction is a nominal (and at best ordinal) measure. Statistical procedures involving means and standard deviations have no place in survey analysis, most of the time…]
Given the following 30 normally distributed values:
#Get 30 standard normal values
x<-rnorm(30)
#Display the values
x
##  [1] -1.06697008 -0.99386337 -0.04848545 -0.43638720  0.91749364
##  [6]  0.28708520  1.67698404  0.97897077 -0.59151972  1.85470415
## [11]  0.54272426 -0.39907658  0.50406143 -0.18513261  0.56133026
## [16]  1.03735518 -0.04948910  1.65030321 -0.66896995 -1.02599162
## [21] -0.57135208  0.80466583  1.09648810  0.98932572 -0.16181664
## [26] -1.91483513  0.88486131  1.87548053 -0.17144575 -0.35227207
Displayed visually:

The sample mean of the values is 0.2341409. We can construct a table for computation:
#Calculate the variance by hand using a table
#start with our sample values
df                      <- data.frame(x)
#Fill the column with the mean value
df$mean_x               <- rep(mean(x),length(x))
#Find the difference between the individual point and the mean
df$x_minus_mean         <- df$x - df$mean_x
#Square the difference
df$x_minus_mean_squared <- df$x_minus_mean^2

#Show the table
df
##              x    mean_x x_minus_mean x_minus_mean_squared
## 1  -1.06697008 0.2341409  -1.30111095          1.692889710
## 2  -0.99386337 0.2341409  -1.22800424          1.507994418
## 3  -0.04848545 0.2341409  -0.28262633          0.079877642
## 4  -0.43638720 0.2341409  -0.67052807          0.449607898
## 5   0.91749364 0.2341409   0.68335276          0.466971000
## 6   0.28708520 0.2341409   0.05294432          0.002803101
## 7   1.67698404 0.2341409   1.44284316          2.081796396
## 8   0.97897077 0.2341409   0.74482989          0.554771569
## 9  -0.59151972 0.2341409  -0.82566060          0.681715423
## 10  1.85470415 0.2341409   1.62056328          2.626225331
## 11  0.54272426 0.2341409   0.30858338          0.095223705
## 12 -0.39907658 0.2341409  -0.63321746          0.400964352
## 13  0.50406143 0.2341409   0.26992055          0.072857105
## 14 -0.18513261 0.2341409  -0.41927349          0.175790256
## 15  0.56133026 0.2341409   0.32718939          0.107052895
## 16  1.03735518 0.2341409   0.80321431          0.645153222
## 17 -0.04948910 0.2341409  -0.28362998          0.080445965
## 18  1.65030321 0.2341409   1.41616233          2.005515753
## 19 -0.66896995 0.2341409  -0.90311083          0.815609166
## 20 -1.02599162 0.2341409  -1.26013250          1.587933907
## 21 -0.57135208 0.2341409  -0.80549296          0.648818902
## 22  0.80466583 0.2341409   0.57052495          0.325498719
## 23  1.09648810 0.2341409   0.86234722          0.743642728
## 24  0.98932572 0.2341409   0.75518485          0.570304155
## 25 -0.16181664 0.2341409  -0.39595751          0.156782352
## 26 -1.91483513 0.2341409  -2.14897600          4.618097858
## 27  0.88486131 0.2341409   0.65072043          0.423437081
## 28  1.87548053 0.2341409   1.64133966          2.693995874
## 29 -0.17144575 0.2341409  -0.40558662          0.164500509
## 30 -0.35227207 0.2341409  -0.58641295          0.343880149
#The variance is:
var.x <- sum(df$x_minus_mean_squared)/(length(x)-1)
var.x
## [1] 0.924833
#The standard deviation is:
sd.x  <- sqrt(var.x)
sd.x
## [1] 0.9616824
Most software packages include a built in way to compute variances and satndard deviations. In R, it is accomplished using var:
#the variance
var.x<-var(x)
var.x
## [1] 0.924833
#the standard deviation
sd.x<-sqrt(var.x)
sd.x
## [1] 0.9616824
 

Back to Mike's Big Data, Data Mining, and Analytics Tutorial