- What: Research on LLM-generated passwords and their predictability
- Impact: Highlights risks of using AI-generated passwords for security
In February 2026, researchers at Irregular published a detailed post about LLM-generated passwords. This post goes into detail on how passwords generated by LLMs follow notable patterns and are generally highly predictable. The root cause is fundamental: LLMs are optimized to predict probable outputs, which is the exact opposite of what secure password generation demands. That observation raised a natural follow-on question: if LLMs leave statistical fingerprints in the passwords they generate, can those fingerprints be detected and attributed? Can we look at a password found in a leaked dataset and say which model generated it? More importantly, can we measure how widely those LLM passwords are used in the wild? That is what this research set out to answer. Extending the perimeter Irregular’s article pointed out that LLM-generated passwords are biased. They used the flagship models from OpenAI, Anthropic, and Google, and a sample size of roughly 50 passwords. We decided to extend the scope of the analysis to 40 LLM models from 11 providers, including both closed-source (OpenAI GPT, Anthropic Claude, etc) and open-source (Qwen, DeepSeed, etc) models. We also decided to increase the password sample size to 200 to improve the statistical accuracy of the analysis. Therefore, we generated a total dataset of 8,000 passwords. An initial analysis of this data confirmed Irregular’s original analysis. We observe a bias in the generated passwords. The bias is inconsistent across the models, with some showing a very low number of distinct passwords while others don’t: Anthropic’s models show poor uniqueness: Claude Opus 4.6 is the worst, with only 35% of unique passwords. The open-source Qwen, Llama, and Gemma models show between 50 and 60% uniqueness. The GPT-5 family generates only unique passwords. The uniqueness of generated passwords does not guarantee their security. In practice, as the original article shows, generated passwords tend to all follow a similar pattern and use common substrings. In fact, nearly all models follow the same “upper, digit, symbol, lower” pattern repetition, with some slight variations: Anthropic models lock position 0 firmly: claude-opus always starts lowercase (100%), claude-haiku and claude-sonnet-4.6 always start uppercase (100%). Llama models are 99–100% uppercase at position 0. GPT-4.1-mini is 92% uppercase at position 0. Likewise, all models exhibit a strong statistical deviation from a random password distribution. This is better illustrated by observing the most common substrings per model (between parenthesis: the difference factor compared to a random distribution): gpt-5.2 generated the 7! bigram in 52% of passwords (x4.5k) and the vQ7!mZ substring in 6% of them (x41B) Mistral-medium-3.1 generated the x7#pL9 substring in 65% of passwords (x448B) Llama-3.3-70b-instruct generated the 8d bi-gram in all passwords, and the Gx#8dL substring in 96% of them, the worst score of all models. Interestingly, the analysis of common substrings shows that some of them are shared across multiple providers: The simple L2 bigram is found in the passwords of 10 out of the 11 providers, with an average probability of appearing at 27% (x114) The longer #kL9 substring is found in the passwords of 4 providers (mistralai, deepseek, qwen, and openai) with an average probability of 13% (x954M) Fighting robots with a rusty sword The previous results suggested that modeling the LLM-generated password could be done using Markov chains. A Markov chain is a mathematical model that describes a sequence of events where the probability of each event depends only on the state of the previous one. They were first introduced by Russian mathematician Andrei Markov in 1906, 100 years before LLMs. Since Markov's original work, the model has found applications across a remarkable range of fields, including text generation. When used as a next-character or next-word prediction engine, Markov chains can be seen as the ancestor of LLMs. For password prediction or recognition, a Markov chain can be as simple as: One state for each letter of the alphabet Transitions set to the probability of encountering a character after the current state’s A Markov chain trained with the passwords: PASS, P@SS, PA$$, etc Without entering too much into the technical details, we used the sample of LLM-generated passwords to build multiple different Markov chains: One chain per selected model. One chain per model family or provider. One chain that aggregates the whole LLM password dataset. To verify the validity of this approach and that the chains correctly capture the statistical bias of the LLMs when generating passwords, we scored a second dataset of LLM-generated passwords. We compared the results with a random baseline and the scores of a dataset of generic passwords. What we found is that: The chains identify the right model in 55% of cases and the correct provider in 65% of cases. The generic chain trained on the whole dataset was,...