Base R’s sort() order changes with locale

Software Development R R Packages

And that can cause frustrating test failures in devtools::check()!

Unit tests for a package I help maintain were passing when I ran them with devtools::test(), but failing when I ran them with devtools::check().

Eventually I found this blog post which linked to a StackOverflow post that had this same problem, and in their case it was caused by the base sort() function. It dawned on me that I recently started using sort() in the branch I was working on!

When sorting character vectors, the sort order depends on the user’s locale, which is set by the environment variable LC_COLLATE. devtools::check() and devtools::test() were using different values for that environment variable, thus causing different output orders for some of the unit tests.

Sort order depends on locale

char_vect <- c(letters, LETTERS)

Sys.setlocale("LC_COLLATE", "en_US.UTF-8")
[1] "en_US.UTF-8"
sorted_eng <- sort(char_vect)
sorted_eng
 [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F" "g" "G" "h" "H"
[17] "i" "I" "j" "J" "k" "K" "l" "L" "m" "M" "n" "N" "o" "O" "p" "P"
[33] "q" "Q" "r" "R" "s" "S" "t" "T" "u" "U" "v" "V" "w" "W" "x" "X"
[49] "y" "Y" "z" "Z"
Sys.setlocale("LC_COLLATE", "C")
[1] "C"
sorted_c <- sort(char_vect)
sorted_c
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
[17] "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "a" "b" "c" "d" "e" "f"
[33] "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
[49] "w" "x" "y" "z"
all(sorted_eng == sorted_c)
[1] FALSE

Solutions

1. sort(method="radix")

Use base sort() with method = 'radix', the only method where the sort order does not depend on the locale. For more details, read the sort() docs by running ?sort in an R console.

Sys.setlocale("LC_COLLATE", "en_US.UTF-8")
[1] "en_US.UTF-8"
sorted_eng_radix <- sort(char_vect, method = 'radix')
sorted_eng_radix
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
[17] "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "a" "b" "c" "d" "e" "f"
[33] "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
[49] "w" "x" "y" "z"
Sys.setlocale("LC_COLLATE", "C")
[1] "C"
sorted_c_radix <- sort(char_vect, method = 'radix')
sorted_c_radix
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
[17] "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "a" "b" "c" "d" "e" "f"
[33] "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
[49] "w" "x" "y" "z"
all(sorted_eng_radix == sorted_c_radix)
[1] TRUE

2. stringr::str_sort()

Use str_sort() from the stringr package. It has a locale parameter that defaults to "eng", so the sort order will be stable by default, but you can override it if you need to.

stringr::str_sort(char_vect)
 [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E" "f" "F" "g" "G" "h" "H"
[17] "i" "I" "j" "J" "k" "K" "l" "L" "m" "M" "n" "N" "o" "O" "p" "P"
[33] "q" "Q" "r" "R" "s" "S" "t" "T" "u" "U" "v" "V" "w" "W" "x" "X"
[49] "y" "Y" "z" "Z"
stringr::str_sort(char_vect, locale = 'haw')
 [1] "a" "A" "e" "E" "i" "I" "o" "O" "u" "U" "b" "B" "c" "C" "d" "D"
[17] "f" "F" "g" "G" "h" "H" "j" "J" "k" "K" "l" "L" "m" "M" "n" "N"
[33] "p" "P" "q" "Q" "r" "R" "s" "S" "t" "T" "v" "V" "w" "W" "x" "X"
[49] "y" "Y" "z" "Z"

Wrap-up

Our package doesn’t use any other functions from stringr, so I went with option 1 to avoid adding a new dependency. I wrote a helper function called radix_sort() that simply calls base sort() with radix, and a test case to make sure it actually produces a stable sort order when the default does not.

It's one of those debugging days… pic.twitter.com/PQQmJXOMit

— Kelly Sovacool (@kelly_sovacool) August 25, 2021

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.