codehesive.com

Data. Graphics. Stories.

Letter frequency: language versus passwords

How does letter frequency differ between the English language more broadly and people's choices in passwords?

Data on letter frequency in English is quite easy to collate. An analysis of the more than 240,000 words in the Concise Oxford Dictionary (11th edition revised, 2004) gives us a breakdown of the most commonly-used letters in the English language as well as their frequency.

Dictionary letter frequency

a8.5%
b2.07%
c4.54%
d3.38%
e11.16%
f1.81%
g2.47%
h3%
i7.54%
j0.2%
k1.1%
l5.49%
m3.01%
n6.65%
o7.16%
p3.17%
q0.2%
r7.58%
s5.74%
t6.95%
u3.63%
v1.01%
w1.29%
x0.29%
y1.78%
z0.27%

Dictionary letter frequency: lower to higher

But what about passwords? This should be much harder to find data for, considering passwords are meant to be kept secret and secure. However, numerous website hacks over the years have given us access to exactly this kind of data. The largest password leak so far is the 2009 RockYou data leak. It contained 32 million total passwords: within that it contained a total of 14,341,564 unique passwords. Analysis of these passwords gives us the following breakdown of letter frequency. Note that for analysis, letters have been counted regardless of case.

Password letter frequency

a11.51%
b2.8%
c3.44%
d3.28%
e9.35%
f1.3%
g2.26%
h3.05%
i7.21%
j1.66%
k2.63%
l5.85%
m4.23%
n6.28%
o6.72%
p2.14%
q0.24%
r5.98%
s5.46%
t4.49%
u2.98%
v1.38%
w1.05%
x0.63%
y3.09%
z1%

Password letter frequency: lower to higher

With datasets for language frequency from the Concise Oxford Dictionary dictionary and passwords from the RockYou data leak, we can now compare letter frequency between them.

Letter frequency: dictionary versus passwords

a3.51%
b0.8%
c0.56%
d0.28%
e1.65%
f0.3%
g0.26%
h0.05%
i0.21%
j1.66%
k1.63%
l0.85%
m1.23%
n0.28%
o0.28%
p0.86%
q0.24%
r1.02%
s0.46%
t1.51%
u0.02%
v0.38%
w0.05%
x0.63%
y2.09%
z1%

Higher dictionary frequency

Higher password frequency

What does this tell us? Interestingly, the most notable outlier here appears to be the letter Y in terms of how prominent it is in password letter frequency. While overall in the RockYou dataset Y accounts for 3.09% of letters, this is almost double the frequency of Y in the Concise Oxford Dictionary (1.78%).

Before we look into why Y in particular is so different in both datasets, there's one other important factor that is worth exploring. These datasets have been presented in alphabetical order, but most inputs on all devices are done through QWERTY keyboards. Does re-ordering the data comparison QWERTY keyboard order give us any other insights?

Letter frequency comparison with QWERTY ordering

q0.24%
w0.05%
e1.65%
r1.02%
t1.51%
y2.09%
u0.02%
i0.21%
o0.28%
p0.86%

a3.51%
s0.46%
d0.28%
f0.3%
g0.26%
h0.05%
j1.66%
k1.63%
l0.85%

z1%
x0.63%
c0.56%
v0.38%
b0.8%
n0.28%
m1.23%

Higher dictionary frequency

Higher password frequency

Interestingly, this ordering might suggest some areas of the keyboard are preferred when making passwords: particularly J, K and M. However, the QWERTY letters themselves aren't as high in the password frequency - with the exception of course with Y.

So why is the use of Y so skewed in the password data set? In 2014, security company Imperva published an analysis of the RockYou data leak. They identified the top 20 passwords as follows:

Top 10 passwords
123456
12345
123456789
password
iloveyou
princess
rockyou
1234567
12345678
abc123

While the QWERTY comparison might not tell us much in terms of comparison with the overall frequency of letters in the English language, the prevalence of sequential numeric passwords indeed shows a tendency for many users to choose a password as conveniently and quickly as possible.

However, since users were using the RockYou website, many users were clearly using this as the basis for their password. rockyou itself appears in the top 20, and numerous permutations of this also appear: rockyourself, rockyoumen, rockyouhi5, rockyou73 and so on. Furthermore, iloveyou, babygirl, lovely, ashley and qwerty all contain Y.

In fact the word 'you' appears in 0.9% of all unique passwords. While iloveyou is arguably the most innocent example containing 'you', many other passwords contain far more colourful language alongside the word 'you'.

Conclusion

The RockYou password data set is an interesting dataset for comparison as password requirements were non-existent, meaning users could enter literally any word or combination of characters and the system would accept them. From this we can see a clear tendency for 'you' in passwords, partly attributable to the fact users had a mental model related to the RockYou service at the time of setting a password. iloveyou and qwerty are also notoriously commonly used passwords which also helped push Y to such a high frequency in the RockYou dataset.

Would a leak of this size today from Facebook, Instagram or Apple likely give us a similar insight? Given the higher level of password requirements required by most websites and applications these days, it would be unlikely: especially alongside the growth in password managers and browser-level password generation all designed to improve password security.

Questions or comments?

Email me at jamesoffer@gmail.com or follow me on Twitter.


Concise Oxford Dictionary and Imperva data is copyrighted respectively to those parties. Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 4.0 International licence.