Using Regular Expressions to solve Wordle

Wordle is a word-guessing game that became popular in January 2022. The game works by the player guessing at words and receiving feeback on the letters and their placement. A correct letter in the correct place is green, a correct letter in the wrong place is yellow, an incorrect letter is grey and removed from the keyboard. Although terribly poor sportsmanship, it provides an excellent playground for learning some basics with regular expressions.

Wordle Info

Let’s start by getting a dictionary wordlist. Daniel Miessler’s SecLists provide an adequate english dictionary for us to use – here. This file contains a list of 354,297 english language words, one on each line.

Valid words

Now to prepare our wordlist by constraining our dictionary to five letter words only.

^[a-zA-Z]{5}$

In the above pattern, ^ matches the start of a line and $ matches the end of a line. The pattern ^$ would match a blank line. We want our matches to anchor to the start and end of a line to “digit” matches but “digital” doesn’t.

The use of square brackets are called a character class, which in our example matches any character betwen a-z (lowercase) and A-Z (uppercase). You could include numbers by 0-9 or symbols if necessary. You can also specify individual characters, so b[iu]g would match “bag” and “bug”.

The curly braces provide a range – we have specified exactly 5, but in a different scenario you could use {3,5} to match 3-5 letter words.

So with ^[a-zA-Z]$ we are now left with five character words (without symbols) which should all be valid for gamepleay.

egrep -c '^[a-zA-Z]{5}$' lang-english.txt

Running the above command, including the -c option provides us a count of 14,836 five letter words from our original dictionary of 354,297.

Starter word

Let’s guess a word to get the game started and begin giving us some clues. I will use “RAISE” in this example.

Guess 1

From this guess we know that “R” is a correct letter in the wrong place, “A”, “S” and “E” are incorrect letters, and “I” is a correct letter in the correct place. We also know that the first letter is not an “R”.

We are going to start a pipeline of taking our valid words, filtering out words with incorrect letters, and filtering down further the knowledge we gain in the gameplay.

Excluding incorrect letters

We used a character class to match letters a-z before. They can also work in reverse. Using ^ at the begining of a character class delcares the match should be anything but the given characters, so b[^u]g would match “bag”, “beg”, “big” etc. but not “bug”.

^[^ase]{5}$

The above regular expression matches five letter words without the letters “A”, “S” and “E”. Let’s pipe our valid words to this to provide a filter.

egrep '^[a-zA-Z]{5}$' lang-english.txt | egrep -c '^[^ase]{5}$'

This brings out wordlist down from 14,836 to 2,097 words.

Incorporating gameplay knowledge

We learnt earlier that “R” was not the first letter, and the third letter was “I”.

^[^r].i..$

This regular expression starts with a ^ to anchor to the start of the word, a character class to match words starting with anything but an “R”, followed by any letter (.), the third letter is “I”, follwed by two of anything, and finishing at the end of the line.

We can add this third element to our pipeline with the below command.

egrep '^[a-zA-Z]{5}$' lang-english.txt | egrep '^[^ase]{5}$' | egrep -c '^[^r].i..$'

This provides us a new count of 241 possible words.

Second guess

Guess 2

Guessing the word “TRICK” lets us know “R” and “I” are correct and in the right place. “C” is correct and in the wrong place, the fourth character is also not a “C”. “T” and “K” are incorrect.

We can now update our filter of incorrect letters.

^[^asetk]{5}$

We can also update our gameplay knowledge.

^[^r]ri[^c].$

Breaking this down, we want the first character to be anything but an “R”. The second character is “R” and the third character is “I”. The third character is anything but “C” and the last character is anything.

Our pipleline can be updated as below.

egrep '^[a-zA-Z]{5}$' lang-english.txt | egrep '^[^asetk]{5}$' | egrep -c '^[^r]ri[^c].$'

This brings our candidates down to a manageable 32 words. Removing the -c option (which provides a count), the output of the above command is below.

brill
bring
briny
brizz
cribo
crimp
drill
drily
drinn
frill
frizz
griff
grill
grimm
grimp
grimy
grind
gripy
iring
oribi
orion
pridy
prill
primi
primo
primp
primy
prion
prior
privy
wring
frigg

Take another guess

Reviewing our finalist words, some seem more likely than others. Let’s take another guess.

Guess 3

I struck lucky!

Review

We started with a wordlist of 354,297 words. We wittled them down to 14,836 valid for gameplay, by using ^[a-zA-Z]{5}$ for five letter words only.

We then excluded words containing letters we learned were incorrect by piping into a further character class; ^[^asetk]{5}$.

Finally we used knowledge of correct and incorrect letter placement to bring the list of candidates down further using ^[^r]ri[^c].$.

This took our original dictionary of 354,297 words down to 32 possible candidates.

Getting the correct answer on the third attempt was lucky, it doesn’t always work out like that! Had it been incorrect, we could have updated our pipeline to narrow the list down even further, which would likely take it to single digits.

Although poor sportsmanship, it’s good exercise for regular expressions and demonstraits their power!