This is my approach to solving/cheating at Wordle. The version for Russian Wordle is here.

Package(s)

First, get the {tidyverse} loaded:

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

We won’t need anything else that’s not already in base R.

Words

We’ll need a dictionary. I wanted word frequencies too, since we know that the game has about 2,000 words in total and I assume they have been selected so it’s possible to guess them in six moves. After being dissatisfied with dictionaries built into CRAN packages, my wander around the web led me to Wiktionary’s page on frequency lists. I (arbitrarily) opted for a dataset developed by analysing an open subtitle corpus.

Let’s grab it directly from the GitHub repo:

words <-
  read.csv(
    "https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/en/en_full.txt",
    sep = " ",
    header = FALSE
  )
names(words) <- c("word", "freq")

And take a look:

head(words, 10)
##    word     freq
## 1   you 28787591
## 2     i 27086011
## 3   the 22761659
## 4    to 17099834
## 5     a 14484562
## 6    's 14291013
## 7    it 13631703
## 8   and 10572938
## 9  that 10203742
## 10   't  9628970

So the word “you” appears 28.8 million times in the corpus.

Next, filter to only words of length 5 and tidy a little to remove words with apostrophes and other non-letter characters.

wordles <- words %>%
  filter(str_length(word) == 5) %>%
  mutate(word = str_to_lower(word)) %>%
  filter(str_detect(word, "^[a-z]*$")) %>%
  arrange(desc(freq))

(The word list was already reverse sorted by frequency, but I’ve sorted again in case I slot in another word list later.)

head(wordles, 10)
##     word    freq
## 1  there 3148528
## 2  right 2576821
## 3  about 2487348
## 4  think 1839473
## 5  going 1520767
## 6  would 1340070
## 7  where 1322226
## 8  gonna 1188190
## 9  could 1111837
## 10 never  930831

That looks better, though I’m not sure all those words are gonna to be a wordle. In retrospect a subtitle corpus wasn’t ideal. Anyway, onwards…

Helpers

Tidyverse and base R already have all the functions I need to filter this word list as we learn more about what letters are and are not in a wordle. I just want some helper functions to make them easier to use.

All words that have particular letters somewhere

all_lets_somewhere <- Vectorize(function(str, lets) {
  all(str_detect(str, strsplit(lets, "")[[1]]))
}, vectorize.args = "str")

Here’s how it works – test whether “e” is in each word:

all_lets_somewhere(c("lovely", "weather"), "e")
##  lovely weather 
##    TRUE    TRUE

Test whether “e” and “w” are in each word:

all_lets_somewhere(c("lovely", "weather"), "ew")
##  lovely weather 
##   FALSE    TRUE

All words that don’t have particular letters anywhere

no_lets_anywhere <- Vectorize(function(str, lets) {
  !any(str_detect(str, strsplit(lets, "")[[1]]))
}, vectorize.args = "str")

Here’s how it works – test whether neither “z” nor “b” are in each word:

no_lets_anywhere(c("zebra", "giraffe"), "zb")
##   zebra giraffe 
##   FALSE    TRUE

Filter/keep words with letters in particular positions

Wordle’s feedback can tell us that a letter is there somewhere but not in the position we guessed. str_detect already does what we need easily enough: a “.” in a regex matches any letter, so we can negate that. This is a wrapper for ease of piping:

ditch_pattern <- function(data, match) {
  data %>%
    filter(!str_detect(word, match))
}

Here’s how to use it. First, here’s the top of the wordles data:

wordles %>%
  head()
##    word    freq
## 1 there 3148528
## 2 right 2576821
## 3 about 2487348
## 4 think 1839473
## 5 going 1520767
## 6 would 1340070

Let’s remove the two words with “in” in their third and fourth characters:

wordles %>% 
  head() %>%
  ditch_pattern("..in.")
##    word    freq
## 1 there 3148528
## 2 right 2576821
## 3 about 2487348
## 4 would 1340070

keep_pattern works similarly:

keep_pattern <- function(data, match) {
  data %>%
    filter(str_detect(word, match))
}

Here’s an example:

wordles %>% 
  head() %>%
  keep_pattern("..in.")
##    word    freq
## 1 think 1839473
## 2 going 1520767

Final filter helpers

These functions just make it easier to use functions above in a pipe:

ditch_letters <- function(data, match) {
  data %>%
    filter(no_lets_anywhere(word, match))
}

keep_letters <- function(data, match) {
  data %>%
    filter(all_lets_somewhere(word, match))
}

Test drive

Let’s give it a go. The example wordle I’m using here is from a few days ago, so hopefully no spoilers.

To get started, let’s use the most frequent word that has a handful of vowels, “a”, “i”, and “e”:

wordles %>%
  keep_letters("aie") %>%
  head()
##    word   freq
## 1 alive 122833
## 2 raise  38449
## 3 ideas  32182
## 4 alice  23311
## 5 image  21669
## 6 annie  20794

We have one hit!

So we want to keep all words with “L” in the second character and ditch the others:

wordles %>%
  keep_pattern(".l...") %>%
  ditch_letters("aive") %>%
  head()
##    word   freq
## 1 blood 147913
## 2 floor  66456
## 3 block  25785
## 4 clock  23907
## 5 blows  16823
## 6 glory  16221

I’ll just go for the most frequent word in the remaining data, “blood”. (This may not always be the best idea, e.g., the letter “o” occurs twice in “blood”; it might be better to choose a different word with no duplicates and high frequency letters.)

No new matches, but we can ditch three more letters so all is not lost.

wordles %>%
  keep_pattern(".l...") %>%
  ditch_letters("aivebod") %>%
  head()
##    word freq
## 1 flush 5260
## 2 flynn 5032
## 3 pluck 2095
## 4 sluts 1310
## 5 slugs 1290
## 6 plugs 1188

Let’s go for the most frequent word again:

More matches and letters to ditch to help narrow in on the answer. I have also removed words ending “s” as it seems unlikely wordles would just be plurals, and the top matches ending “s” weren’t singular nouns (“dress” would be an example that is).

wordles %>%
  keep_pattern(".lu..") %>%
  ditch_letters("aivebodfh") %>%
  keep_letters("s") %>%
  ditch_pattern("...s.") %>%
  ditch_pattern("....s") %>%
  head()
##    word freq
## 1 slump  705
## 2 slurp  397
## 3 slung  222
## 4 slunk   50
## 5 slunt   11
## 6 slurm    8

Again, let’s go for the most frequent match:

And we’re done.

Revisiting the word list: letter frequency

Let’s see whether frequency analysis can help us be a little cleverer in searching.

First, functions to count letter frequency.

long_wordles <- function(wordles) {
  wordle_chars <- sapply(str_split(wordles$word, ""),
                         c) %>% 
  t()
  
  colnames(wordle_chars) <- paste("c", 1:5, sep = "_")
  
  bind_cols(wordles, wordle_chars %>% as_tibble()) %>%
  pivot_longer(cols = c_1:c_5,
               names_prefix = "c_",
               names_to = "pos",
               values_to = "let")
}
letter_freqs <- function(wordles) {
  n_words <- nrow(wordles)
  
  wordles %>%
    long_wordles() %>%
    group_by(word, freq, let) %>%
    summarise(n = n()) %>%
    mutate(b = as.numeric(n > 0)) %>%
    ungroup() %>%
    group_by(let) %>%
    summarise(n = n()) %>%
    ungroup() %>%
    mutate(perc = 100*n/n_words) %>%
    arrange(desc(n))    
}

Get letter frequencies for 1000 most frequent words:

wordles %>%
  head(1000) %>%
  letter_freqs()
## `summarise()` has grouped output by 'word', 'freq'. You can override using the
## `.groups` argument.
## # A tibble: 26 x 3
##    let       n  perc
##    <chr> <int> <dbl>
##  1 e       505  50.5
##  2 a       413  41.3
##  3 s       404  40.4
##  4 r       343  34.3
##  5 o       288  28.8
##  6 t       287  28.7
##  7 i       279  27.9
##  8 l       258  25.8
##  9 n       256  25.6
## 10 h       184  18.4
## # ... with 16 more rows

The most frequent letters are “e”, “a”, “s”, “r”, and “o”. Let’s find a word with those letters:

wordles %>%
  keep_letters("easro") %>%
  head()
##    word freq
## 1 arose  876
## 2 rosae   12
## 3 reaso   10
## 4 sorae    6
## 5 areso    6
## 6 soare    6

Give it a go:

We can ditch some letters:

wordles %>%
  keep_pattern (".r...") %>%
  keep_letters("a") %>%
  ditch_pattern("a....") %>%
  ditch_letters("ose") %>%
  head()
##    word   freq
## 1 crazy 188767
## 2 train  70974
## 3 frank  68472
## 4 brain  59464
## 5 grand  45183
## 6 track  40578

Now, rather than just take the most frequent word from this list, let’s count letter frequencies again.

wordles %>%
  keep_pattern (".r...") %>%
  keep_letters("a") %>%
  ditch_pattern("a....") %>%
  ditch_letters("ose") %>%
  letter_freqs()
## `summarise()` has grouped output by 'word', 'freq'. You can override using the
## `.groups` argument.
## # A tibble: 23 x 3
##    let       n  perc
##    <chr> <int> <dbl>
##  1 a       839 100  
##  2 r       839 100  
##  3 i       243  29.0
##  4 t       211  25.1
##  5 b       155  18.5
##  6 g       153  18.2
##  7 n       153  18.2
##  8 d       138  16.4
##  9 u       138  16.4
## 10 k       128  15.3
## # ... with 13 more rows

Select a word with the most frequent two letters we don’t already have:

wordles %>%
  keep_pattern (".r...") %>%
  keep_letters("a") %>%
  ditch_pattern("a....") %>%
  ditch_letters("ose") %>%
  keep_letters("it") %>%
  head()
##    word  freq
## 1 train 70974
## 2 trial 34375
## 3 trail 14033
## 4 triad  1539
## 5 trina  1367
## 6 trait  1029

And so on…

It would be fun to automate this process a little more.