Data Frames
Data frames are the standard structure in which you will see most data stored for analytic purposes (e.g., data loaded from an Excel spreadsheet). A data frame is a special type of nested list structure. Specifically, a data frame is a list of vectors of matching length. Because it is a list, the vectors can be of different data types (e.g., numbers, characters, or logical values). However, because the vectors are of matching length, in many ways the data frame can be treated like a matrix, allowing more flexible indexing options compared to a standard list.
Table of contents
1. Creating data frames
Creating a data frames typically requires a little more thought. When creating a data frame, rather then specifying the number of rows and columns you want (and some default values to initially fill out those rows and columns), you instead have to specify each column separately, providing a column name and an initial value:
# Want a data frame with
# 3 columns, one for numbers,
# one for character strings,
# one for logical values
df <- data.frame(
NMB = rep( 1:3, each = 3 ),
CHR = rep( c( 'Cat', 'Dog', 'Mouse' ), 3 ),
LGC = TRUE
)
Note: Notice that for the final column with logical values, rather then specifying 9 values (as we did with the other two columns), here we specified a single value. Fortunately, R will automatically repeat this value to match the lengths of the other columns.
2. Factors
Note that with older versions of R (e.g., versions prior to 4.0), by default R would convert a column with character strings into a different data type, a unique R data type known as a factor. A factor can be thought of a hybrid between character strings and integer data types. Specifically, R determines the unique number of character strings (or ‘levels’), and then assigns an integer value to each unique string. Therefore, internally, a factor consists of a vector of integers, but R knows that each integer value is linked to a specific string. Care is needed when working with factors, because they work differently compared to both integers (the internal representation closest to a factor) and character strings (what a factor presents as). For example, here is an example of a potential pitfall when representing elements as a factor, rather than a character vector:
# Create a character vector of numbers
num_as_str <- c( '100', '10', '1' )
# R can correctly convert these strings to numbers
as.numeric( num_as_str )
#> [1] 100 10 1
# Convert character vector to factor
num_as_fac <- as.factor( num_as_str )
# Conversion no longer works as expected
as.numeric( num_as_fac )
#> [1] 3 2 1
# Must first convert to character, then to number
as.numeric( as.character( num_as_fac ) )
#> [1] 100 10 1
Factors also only want users to replace values in the factor that correspond to one of the levels defined in the factor:
# Character vector
vec <- c( 'Cats', 'Kittens', 'Felines' )
# Can easily replace one element with
# a different character string
vec[3] <- 'Dog'
# Create factor
vec <- as.factor( c( 'Dog', 'Puppy', 'Canine' ) )
# Attempting to add new character string
# results in strange behavior
vec[3] <- 'Cat' # Produces a warning and leads to missing data
#> Warning message:
#> In `[<-.factor`(`*tmp*`, 3, value = "Cat") :
#> invalid factor level, NA generated
vec
#> [1] Dog Puppy <NA>
#> Levels: Canine Dog Puppy
Users must be careful and understand how factors are representing the raw data. Some common issues you might come across are:
- Dates intended to be represented as strings instead are converted to factors;
- A typo in a column converts a numeric column into a factor, with subsequent conversions resulting in incorrect numeric values.
- Trying to add new data to a column can produce errors or missing data due to factors only considering currently defined levels.
While factors have their uses, in general I recommend avoiding this data type and using character vectors instead. Fortunately, newer versions of R (version 4.0 onward) no longer automatically convert character strings to factors. Furthermore, there is an argument that will override R’s default behavior when creating data frames:
# Want a data frame with
# 3 columns, one for numbers,
# one for character strings,
# one for logical values
df <- data.frame(
NMB = rep( 1:3, each = 3 ),
CHR = rep( c( 'Cat', 'Dog', 'Mouse' ), 3 ),
LGC = TRUE,
# Override conversion of strings to factors
stringsAsFactors = FALSE
)
3. Indexing data frames
There is a great deal of flexibility in accessing elements from data frames, along with some important nuances to consider.
First, you can access individual elements in a data frame in the same way you access elements in a matrix, and also in the same way you access internal elements from a list:
# Example data frame
df <- data.frame(
Col1 = 1:3,
Col2 = 4:6
)
# Access element in row 1, column 2
# using method for indexing matrices
df[ 1, 2 ]
#> [1] 4
# Access element in row 1, column 2
# using method for indexing values within a list
df[[2]][ 1 ]
#> [1] 4
df$Col2[ 1 ] # Named list approach
#> [1] 4
Similar logic applies to accessing columns:
# Example data frame
df <- data.frame(
Col1 = 1:3,
Col2 = 4:6
)
# Access column 1
# using method for indexing matrices
df[ , 1 ]
#> [1] 1 2 3
# Access column 1
# using method for indexing list
df[[1]]
#> [1] 1 2 3
df$Col1 # Named list approach
#> [1] 1 2 3
Again, one can index a data frame with a sequence of integers as well:
# Example data frame
df <- data.frame(
Col1 = 1:3,
Col2 = 4:6
)
# Access first two rows of column 1
# using method for indexing matrices
df[ 1:2 , 1 ]
#> [1] 1 2
# Access first two rows of column 1
# using method for indexing list
df[[1]][ 1:2 ]
#> [1] 1 2
df$Col1[ 1:2 ] # Named list approach
#> [1] 1 2
Because a data frame is a list of vectors, different data types are preserved across columns:
# Example data frame
df <- data.frame(
Col1 = 1:3,
Col2 = TRUE
)
# Extract column 1
vec <- df$Col1
# Vector is of class 'numeric'
is.numeric( vec )
#> [1] TRUE
# Extract column 2
vec <- df$Col2
# Vector is of class 'logical'
is.logical( vec )
#> [1] TRUE
However, things are more complicated when accessing rows. Pulling out a single row does not return a vector (in contrast to a matrix). Instead, even though it is a single row, R still treats it as a list, specifically a data frame. One must convert the row using the unlist
command.
# Example data frame
# with two numeric columns
df <- data.frame(
Col1 = 1:3,
Col2 = 4:6
)
# Access first row
df[1,]
#> Col1 Col2
#> 1 1 4
# Save row
vec <- df[1,]
is.vector( vec )
#> [1] FALSE
is.list( vec )
#> [1] TRUE
# Convert to vector
is.vector( unlist( vec ) )
#> [1] TRUE
Return to: Foundations; Sections; Home page