Saturday, June 6, 2015

Introduction to R Data Types

Data Types in R

Part of Mike's Big Data, Data Mining, and Analytics Tutorial  



An important first step in understanding the R programming language is gaining an understanding of how R represents different types of data. This post will give an introduction to the range of data types available.

Primitive Values

The R programming language has a few primitive (called atomic) types that are going to be relevant to most people who aspire to develop in R:

Raw: Holds raw byte values

Logical: boolean values TRUE or FALSE

a <- TRUE # a True Value
a <- T    # another way
a
## [1] TRUE
b <- FALSE # a False Value
b <- F     # another way
b
## [1] FALSE

Numeric/Integer: non floating-point numbers like 1, 2, 3, …

a <- 42 # an integer Value

Numeric/Double: floating-point numbers like 1.1, 1.2, .99, …

a <- 3.14159 # a floating point Value

Complex: Complex numbers… \(a + bi \) where \( i = sqrt{ -1 } \)

a <- complex(real = 1, imaginary=2.1) # a complex Value

Character: String values like “a” “badly” “documented” “language” “is” “hard” “to” “learn”

a <- "programmer" # a string value
There are a few additional types that appear, but will be addressed in other parts of the tutorial:
  • Expressions: Parsed strings of R code
  • Symbols: Typically used to insert mathematical notation into plots
  • Functions: Performs a set of operations on a set of inputs and may or may not return a result

Vectors

The need arises frequently (perhaps more frequently than storing a single value) to store more than one value and access the values in an efficient way. This is accomplished using vectors, arrays (covered later), data frames (covered later), and lists (covered later).
The easiest way to create a vector is with the rep, c, seq commands or the : operator:
rep creates a vector with the first argument repeated n times:
a<-rep(1,10) #Fill a vector of length 10 with the value 1
a
##  [1] 1 1 1 1 1 1 1 1 1 1
c concatenates values/vectors into a larger vector:
a<-c(1,2,3) # a simple vector
a
## [1] 1 2 3
b<-c(4,5,6) # another vector
b
## [1] 4 5 6
a_b<-c(a,b) # combination of vectors
a_b
## [1] 1 2 3 4 5 6
: builds ascending or descending numeric sequences/vectors (with a step size of 1):
a<-1:10 #Ascending Sequence from 1 to 10
a
##  [1]  1  2  3  4  5  6  7  8  9 10
a<-5:-5 #Descending Sequence from 5 to -5
a
##  [1]  5  4  3  2  1  0 -1 -2 -3 -4 -5
a<-1.1:10.1 #Non-Integer Ascending Sequence
a
##  [1]  1.1  2.1  3.1  4.1  5.1  6.1  7.1  8.1  9.1 10.1
seq generates ascending or descinding sequences using a start point and an end point, then either a length or an increment:
a<-seq(from=1,to=10,by=1) #Equivalent to 1:10
a<-seq(1,10,1) #shorter version with from, to, increment...
a<-seq(1,10,length.out=10) #Equivalent to 1:10
a
##  [1]  1  2  3  4  5  6  7  8  9 10
Individual vector elements can be accessed with [ and ]. Some illustrative examples:
a[5]  #The fifth element
## [1] 5
a[-5] #Not the fifth element
## [1]  1  2  3  4  6  7  8  9 10
a[1:3] #Elements 1,2, and 3
## [1] 1 2 3
a[c(1,2,5,7,10)] #Elements 1,2,5,7, and 10
## [1]  1  2  5  7 10
Another interesting thing is that more than one element of a vector can be changed at a time:
a<-1:10
a #starting point
##  [1]  1  2  3  4  5  6  7  8  9 10
a[5]<-20 #Change 1 element
a
##  [1]  1  2  3  4 20  6  7  8  9 10
a[c(1,2,3)] <- c(10,20,30) #Change 3 elements
a
##  [1] 10 20 30  4 20  6  7  8  9 10

Matrices and Data Frames

R has a rich capability to natively build and manipulate matrices (memory efficient structures of only one primitive type) and data frames (structures that can hold vectors of differing primitive types). Selection of data frame vs. matrix usually depends on what libraries you are using and/or what the functions you develop expect as input.
Matrices can be created and filled at the same time:
a<-matrix(data=1:9,nrow=3,ncol=3) #3x3 matrix filled by columns
a
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
a<-matrix(data=1:9,nrow=3,ncol=3, byrow=TRUE) #Fill by row using data
a
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
Assuming that the example above is a data matrix, an equivalent data frame can be built:
#Note: c1,c2,c3 become the column names. Data frames are filled by columns
a<-data.frame(c1=c(1,2,3),c2=c(4,5,6),c3=c(7,8,9))
a
##   c1 c2 c3
## 1  1  4  7
## 2  2  5  8
## 3  3  6  9
a<-data.frame(c1=c(1,4,7),c2=c(2,5,8),c3=c(3,6,9))
a
##   c1 c2 c3
## 1  1  2  3
## 2  4  5  6
## 3  7  8  9
Individual elements in both matrices and data frames can be accessed using [ and ]:
a<-matrix(data=1:9,nrow=3,ncol=3) #3x3 matrix filled by columns
b<-data.frame(c1=c(1,2,3),c2=c(4,5,6),c3=c(7,8,9))
#Use [row_indices, column_indices]:
a[3,1]
## [1] 3
b[3,1]
## [1] 3
#Entire rows and columns can be selected
a[1,] #Entire first row of a matrix
## [1] 1 4 7
b[1,] #Entire first row of a data frame
##   c1 c2 c3
## 1  1  4  7
a[,1] #Entire first column of a matrix
## [1] 1 2 3
b[,1] #Entire first column of a data frame
## [1] 1 2 3
b$c1  #Data frames only - access by column name
## [1] 1 2 3
a[c(1,3),] #rows 1 and 3 of a matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    3    6    9
b[c(1,3),] #rows 1 and 3 of a data frame
##   c1 c2 c3
## 1  1  4  7
## 3  3  6  9
a[,c(1,3)] #columns 1 and 3 of a matrix
##      [,1] [,2]
## [1,]    1    7
## [2,]    2    8
## [3,]    3    9
b[,c(1,3)] #columns 1 and 3 of a data frame
##   c1 c3
## 1  1  7
## 2  2  8
## 3  3  9
#It is possible to do selection of multiple rows/columns:
a[c(1,3),c(1,3)] #Selects combinations of rows/columns
##      [,1] [,2]
## [1,]    1    7
## [2,]    3    9
b[c(1,3),c(1,3)] #Selects combinations of rows/columns
##   c1 c3
## 1  1  7
## 3  3  9
The number of rows/columns can be determined using nrow, ncol, and dim
a<-matrix(data=1:9,nrow=3,ncol=3) #3x3 matrix filled by columns
b<-data.frame(c1=c(1,2,3),c2=c(4,5,6),c3=c(7,8,9))
#The most straightforward way...
nrow(a) #Number of rows
## [1] 3
ncol(a) #Number of columns
## [1] 3
dim(a) #The dimension of the matrix
## [1] 3 3
dim(b) #The dimension of the data frame
## [1] 3 3
#Another way that isn't so readable...
length(a[,1]) #Number of rows
## [1] 3
length(a[1,]) #Number of columns
## [1] 3

Arrays - Higher Dimensional Objects

Sometimes it is convenient to be able to index an object by 3 or more indices. In this case, arrays are needed. For 2 or fewer dimensions, use matrices and data frames (above).
a<-array(1:27, dim = c(3,3,3)) #Create a 3x3x3 data structure
a
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   10   13   16
## [2,]   11   14   17
## [3,]   12   15   18
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   19   22   25
## [2,]   20   23   26
## [3,]   21   24   27
dim(a) #In the case of arrays, use dim instead of nrow/ncol to get the maximum indices
## [1] 3 3 3

Lists

Lists are the closest thing in R to true associative arrays/hash tables. Lists allow a general arbitrary mapping to be created/accessed:
a<-list(e1 = c(1,2,3), e2 = matrix(1:4,2,2), e3 = FALSE)
a
## $e1
## [1] 1 2 3
## 
## $e2
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## $e3
## [1] FALSE
Lists are accessed a little bit differently. Similar to data frames, named elements can be accessed:
# How to determine names of list elements
names(a)
## [1] "e1" "e2" "e3"
# Access elements by name
a$e1 #The first list element
## [1] 1 2 3
a$e2 #The second list element
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
a$e3 #The third list element
## [1] FALSE
#Access elements by name in a different way
a[["e1"]]
## [1] 1 2 3
a[["e2"]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
a[["e3"]]
## [1] FALSE
Lists can also be accessed with the [[ and ]] operators and numeric indices:
#How to determine a list length
length(a)
## [1] 3
#Access elements by index:
a[[1]] #The first list element
## [1] 1 2 3
a[[2]] #The second list element
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
a[[3]] #The third list element
## [1] FALSE
Lists are most commonly used in R to represent more complex data structures and to implement a version of typing for complicated data structures that can have states and behaviors of their own (a different post will discuss object-oriented R in more detail.

Back to Mike's Big Data, Data Mining, and Analytics Tutorial  

No comments:

Post a Comment