Using R

Tutorials and examples


Matching/Replacing Strings

R provides several tools to (1) identify whether a character string contains a particular pattern, and (2) replace an original pattern within a character string with a new pattern. This is referred to as pattern matching and replacement.

Table of contents

  1. Base functions for pattern matching/replacement
  2. Regular expressions

🡻

1. Base functions for pattern matching/replacement

R provides 2 core functions to match patterns to character strings and 2 core functions to replace patterns within character strings.

grep()

The function grep() takes a pattern, attempts to match it against a vector of character strings, and then returns the position(s) in the vector of character strings that contain the pattern:

# Create example character vector
x <- c( "ABC", "abc", "CBA", "cba" )

# Match all strings with the 
# letter 'A' (Case-sensitive)

# Return position of 
# matches in vector
# (Case-sensitive)
grep( "A", x )
#> [1] 1 3
grep( "CB", x )
#> [1] 3

# Return values from 
# vector that match
grep( "b", x, value = TRUE )
#> [1] "abc" "cba"
grep( "ba", x, value = TRUE )
#> [1] "cba"

# To make case-insensitive
grep( "a", x, ignore.case = TRUE )
#> [1] 1 2 3 4

🡹 🡻

grepl()

The function grepl() takes a pattern, attempts to match it against a vector of character strings, and then returns a logical vector indicating which character strings in the vector contain the pattern:

# Example character vector
x <- c( "ABC", "abc", "CBA", "cba" )

# Logical vector matching in 
# length to 'x', equal to 
# TRUE for matches
# (Case-sensitive)
grepl( "A", x )
#> [1] TRUE FALSE TRUE FALSE
grepl( "CB", x )
#> [1] FALSE FALSE TRUE FALSE

# Argument to make case-insensitive
grepl( "a", x, ignore.case = TRUE )
#> [1] TRUE TRUE TRUE TRUE

🡹 🡻

sub()

The function sub() takes a pattern and a replacement string, and replaces the first occurence within a character string containing the pattern, doing so for each element in a vector:

# Create example character vector
x <- c( "abc", "ABC", "aabb", "AABB" )

# Replaces 'a' with '1' for 
# first match (case-sensitive)
sub( "a", "1", x )
#> [1] "1bc"  "ABC"  "1abb" "AABB"

# To make case-insensitive
sub( "a", "1", x, ignore.case = TRUE )
#> [1] "1bc"  "1BC"  "1abb" "1ABB"

🡹 🡻

gsub()

The function gsub() takes a pattern and a replacement string, and replaces all occurences within a character string containing the pattern, doing so for each element in a vector:

# Create example character vector
x <- c( "abc", "ABC", "aabb", "AABB" )

# Replaces 'a' with '1' for 
# all matches (case-sensitive)
gsub( "a", "1", x )
#> [1] "1bc"  "ABC"  "11bb" "AABB"

# To make case-insensitive
gsub( "a", "1", x, ignore.case = TRUE )
#> [1] "1bc"  "1BC"  "11bb" "11BB"

🡹 🡻

2. Regular expressions

Regular expressions provide syntax to allow a user to search for various combinations of letters, digits, and special characters. The syntax is flexible, and allows much more complicated groupings and combinations then shown in previous examples. Regular expressions work with a variety of R functions, including grep, grepl, sub, gsub, and strsplit.

Motivating example

Here is an example highlighting how regular expressions provide a user with a more concise, simple way to find complicated patterns of strings:

# Goal: Extract all elements of a 
# character vector with the digits
# 1, 2, or 3, irrespective of order

x <- c( 
  # Different orders of 1, 2, and 3
  "123", "321", "231", 
  # 1, 2, and 3 combined with other digits
  "152536", 
  # Missing some cases and also combined 
  # with other digits
  "14", "25", "36"
)

# Doesn't work
grep( '123', x, value = TRUE )
#> [1] "123"

# Works, but requires extended code 
# with multiple calls
entries <- 
  grepl( '1', x ) | 
  grepl( '2', x ) | 
  grepl( '3', x )
x[ entries ]
#> [1] "123" "321" "231" "152536" "14" "25" "36"

# Regular expressions allow 
# concise, simple call
grep( '[:1-3:]', x, value = TRUE )
#> [1] "123" "321" "231" "152536" "14" "25" "36"

In other words, in addition to the most basic matching of a pattern within a string (e.g., matching the “a” in “cat”), regular expressions provide a variety of commands to enable much more complex pattern matching. Specifically, a user can specify a complex set of rules for pattern matching in a very concise manner using a variety of metacharacters.

🡹 🡻

Character classes or sets

One can use square brackets to match a single character against a variety of different patterns, known as character classes or sets:

# Example character string
x <- c( "1", "2", "3", "4" )

# Goal: Identify which elements in vector 
# contain either "1" or "4"

# Doesn't work
grep( "14", x )
#> integer(0)
# Works but is not concise
c( grep( "1", x ), grep( "4", x ) )
#> [1] 1 4
# Using "[]" allows for single function call
grep( "[14]", x )
#> [1] 1 4

# Goal: Match both "gray" (American) and "grey" (British)
x <- c( "gray", "grey", "black", "white" )

# Matches either "a" OR "e" within a pattern
grep( "gr[ae]y", x )
#> [1] 1 2
# Order doesn't matter
grep( "gr[ea]y", x )
#> [1] 1 2

# Note [_] matches only a single character
x <- c( "Meat", "Meet", "Met", "Mate" )
x[ grep( "M[ea]t", x ) ]
#> [1] "Met"  "Mate"
# To isolate words with double vowels
x[ grep( "M[ea][ea]t", x ) ]
#> [1] "Meat" "Meet"

Note: Advance content.

🡹

Return to: Foundations; Sections; Home page