Data Types in R
An important first step in understanding the R programming language is gaining an understanding of how R represents different types of data. This post will give an introduction to the range of data types available.
Primitive Values
The R programming language has a few primitive (called atomic) types that are going to be relevant to most people who aspire to develop in R:
Raw: Holds raw byte values
Logical: boolean values TRUE
or FALSE
a <- TRUE # a True Value
a <- T # another way
a
## [1] TRUE
b <- FALSE # a False Value
b <- F # another way
b
## [1] FALSE
Numeric/Integer: non floating-point numbers like 1, 2, 3, …
a <- 42 # an integer Value
Numeric/Double: floating-point numbers like 1.1, 1.2, .99, …
a <- 3.14159 # a floating point Value
Complex: Complex numbers… \(a + bi \) where \( i = sqrt{ -1 } \)
a <- complex(real = 1, imaginary=2.1) # a complex Value
Character: String values like “a” “badly” “documented” “language” “is” “hard” “to” “learn”
a <- "programmer" # a string value
There are a few additional types that appear, but will be addressed in other parts of the tutorial:
- Expressions: Parsed strings of R code
- Symbols: Typically used to insert mathematical notation into plots
- Functions: Performs a set of operations on a set of inputs and may or may not return a result
Vectors
The need arises frequently (perhaps more frequently than storing a single value) to store more than one value and access the values in an efficient way. This is accomplished using vectors, arrays (covered later), data frames (covered later), and lists (covered later).
The easiest way to create a vector is with the
rep
,
c
,
seq
commands or the
:
operator:
rep
creates a vector with the first argument repeated n times:
a<-rep(1,10) #Fill a vector of length 10 with the value 1
a
## [1] 1 1 1 1 1 1 1 1 1 1
c
concatenates values/vectors into a larger vector:
a<-c(1,2,3) # a simple vector
a
## [1] 1 2 3
b<-c(4,5,6) # another vector
b
## [1] 4 5 6
a_b<-c(a,b) # combination of vectors
a_b
## [1] 1 2 3 4 5 6
:
builds ascending or descending numeric sequences/vectors (with a step size of 1):
a<-1:10 #Ascending Sequence from 1 to 10
a
## [1] 1 2 3 4 5 6 7 8 9 10
a<-5:-5 #Descending Sequence from 5 to -5
a
## [1] 5 4 3 2 1 0 -1 -2 -3 -4 -5
a<-1.1:10.1 #Non-Integer Ascending Sequence
a
## [1] 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1
seq
generates ascending or descinding sequences using a start point and an end point, then either a length or an increment:
a<-seq(from=1,to=10,by=1) #Equivalent to 1:10
a<-seq(1,10,1) #shorter version with from, to, increment...
a<-seq(1,10,length.out=10) #Equivalent to 1:10
a
## [1] 1 2 3 4 5 6 7 8 9 10
Individual vector elements can be accessed with
[
and
]
. Some illustrative examples:
a[5] #The fifth element
## [1] 5
a[-5] #Not the fifth element
## [1] 1 2 3 4 6 7 8 9 10
a[1:3] #Elements 1,2, and 3
## [1] 1 2 3
a[c(1,2,5,7,10)] #Elements 1,2,5,7, and 10
## [1] 1 2 5 7 10
Another interesting thing is that more than one element of a vector can be changed at a time:
a<-1:10
a #starting point
## [1] 1 2 3 4 5 6 7 8 9 10
a[5]<-20 #Change 1 element
a
## [1] 1 2 3 4 20 6 7 8 9 10
a[c(1,2,3)] <- c(10,20,30) #Change 3 elements
a
## [1] 10 20 30 4 20 6 7 8 9 10
Matrices and Data Frames
R has a rich capability to natively build and manipulate matrices (memory efficient structures of only one primitive type) and data frames (structures that can hold vectors of differing primitive types). Selection of data frame vs. matrix usually depends on what libraries you are using and/or what the functions you develop expect as input.
Matrices can be created and filled at the same time:
a<-matrix(data=1:9,nrow=3,ncol=3) #3x3 matrix filled by columns
a
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
a<-matrix(data=1:9,nrow=3,ncol=3, byrow=TRUE) #Fill by row using data
a
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Assuming that the example above is a data matrix, an equivalent data frame can be built:
#Note: c1,c2,c3 become the column names. Data frames are filled by columns
a<-data.frame(c1=c(1,2,3),c2=c(4,5,6),c3=c(7,8,9))
a
## c1 c2 c3
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
a<-data.frame(c1=c(1,4,7),c2=c(2,5,8),c3=c(3,6,9))
a
## c1 c2 c3
## 1 1 2 3
## 2 4 5 6
## 3 7 8 9
Individual elements in both matrices and data frames can be accessed using
[
and
]
:
a<-matrix(data=1:9,nrow=3,ncol=3) #3x3 matrix filled by columns
b<-data.frame(c1=c(1,2,3),c2=c(4,5,6),c3=c(7,8,9))
#Use [row_indices, column_indices]:
a[3,1]
## [1] 3
b[3,1]
## [1] 3
#Entire rows and columns can be selected
a[1,] #Entire first row of a matrix
## [1] 1 4 7
b[1,] #Entire first row of a data frame
## c1 c2 c3
## 1 1 4 7
a[,1] #Entire first column of a matrix
## [1] 1 2 3
b[,1] #Entire first column of a data frame
## [1] 1 2 3
b$c1 #Data frames only - access by column name
## [1] 1 2 3
a[c(1,3),] #rows 1 and 3 of a matrix
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 3 6 9
b[c(1,3),] #rows 1 and 3 of a data frame
## c1 c2 c3
## 1 1 4 7
## 3 3 6 9
a[,c(1,3)] #columns 1 and 3 of a matrix
## [,1] [,2]
## [1,] 1 7
## [2,] 2 8
## [3,] 3 9
b[,c(1,3)] #columns 1 and 3 of a data frame
## c1 c3
## 1 1 7
## 2 2 8
## 3 3 9
#It is possible to do selection of multiple rows/columns:
a[c(1,3),c(1,3)] #Selects combinations of rows/columns
## [,1] [,2]
## [1,] 1 7
## [2,] 3 9
b[c(1,3),c(1,3)] #Selects combinations of rows/columns
## c1 c3
## 1 1 7
## 3 3 9
The number of rows/columns can be determined using
nrow
,
ncol
, and
dim
a<-matrix(data=1:9,nrow=3,ncol=3) #3x3 matrix filled by columns
b<-data.frame(c1=c(1,2,3),c2=c(4,5,6),c3=c(7,8,9))
#The most straightforward way...
nrow(a) #Number of rows
## [1] 3
ncol(a) #Number of columns
## [1] 3
dim(a) #The dimension of the matrix
## [1] 3 3
dim(b) #The dimension of the data frame
## [1] 3 3
#Another way that isn't so readable...
length(a[,1]) #Number of rows
## [1] 3
length(a[1,]) #Number of columns
## [1] 3
Arrays - Higher Dimensional Objects
Sometimes it is convenient to be able to index an object by 3 or more indices. In this case, arrays are needed. For 2 or fewer dimensions, use matrices and data frames (above).
a<-array(1:27, dim = c(3,3,3)) #Create a 3x3x3 data structure
a
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 10 13 16
## [2,] 11 14 17
## [3,] 12 15 18
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 19 22 25
## [2,] 20 23 26
## [3,] 21 24 27
dim(a) #In the case of arrays, use dim instead of nrow/ncol to get the maximum indices
## [1] 3 3 3
Lists
Lists are the closest thing in R to true associative arrays/hash tables. Lists allow a general arbitrary mapping to be created/accessed:
a<-list(e1 = c(1,2,3), e2 = matrix(1:4,2,2), e3 = FALSE)
a
## $e1
## [1] 1 2 3
##
## $e2
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## $e3
## [1] FALSE
Lists are accessed a little bit differently. Similar to data frames, named elements can be accessed:
# How to determine names of list elements
names(a)
## [1] "e1" "e2" "e3"
# Access elements by name
a$e1 #The first list element
## [1] 1 2 3
a$e2 #The second list element
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
a$e3 #The third list element
## [1] FALSE
#Access elements by name in a different way
a[["e1"]]
## [1] 1 2 3
a[["e2"]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
a[["e3"]]
## [1] FALSE
Lists can also be accessed with the
[[
and
]]
operators and numeric indices:
#How to determine a list length
length(a)
## [1] 3
#Access elements by index:
a[[1]] #The first list element
## [1] 1 2 3
a[[2]] #The second list element
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
a[[3]] #The third list element
## [1] FALSE
Lists are most commonly used in R to represent more complex data structures and to implement a version of typing for complicated data structures that can have states and behaviors of their own (a different post will discuss object-oriented R in more detail.
Back to Mike's Big Data, Data Mining, and Analytics Tutorial