- What: Explores using token efficiency for more effective secrets scanning
- Impact: May improve detection of sensitive data in code repositories
Rare Not Random Using Token Efficiency for Secrets Scanning Zachary Rice Feb 20, 2026 7 4 Share In Regex is (almost) All You Need we learned that using a combination of regular expression patterns, entropy, and rule-based filters are an effective way to detect candidate secrets. Regex is used for casting a wide net to identify candidates. Entropy is used as a primary filter on the captured candidates and additional filters like presence of commonly used english words, or filtering on known “safe” files like go.sum are applied last. Entropy does a decent job at filtering false positives but leaves a lot to be desired, especially when evaluating generic secrets. Could there be something better than entropy for that primary post-regex-capture filter? This post examines whether Byte-Pair Encoding can serve as a more effective alternative to entropy for secrets scanning. A rare - four leaf clover I want to thank GitHub user “ DmitriyAlergant ” for submitting this idea in an issue on the Gitleaks repo. Thanks for reading Looking at Computer! Subscribe for free to receive new posts and support my work. Subscribe Entropy What the heck is entropy? According to John von Neumann when talking with Claude Shannon, “ no one really knows ”, but Wikipedia does. Shannon Entropy measures the average unpredictability of a string aka how much information each character carries. When characters are uniformly distributed (many distinct characters, no clear pattern), each one is harder to predict, so entropy is high. When a few characters dominate, the next character is easy to guess, so entropy is low. In practice that means something like aaaaaa111111 scores low, while something like xA9fP2qL0sRw scores high. With regards to secrets detection, this makes entropy a decent first pass at spotting "random looking" strings (candidate secrets). But do we really want randomness to be our primary filter for secrets detection? Excuse the “it’s not X, it’s Y” LLM trope here - but secrets aren’t just random, they’re statistically unusual compared to the natural distribution of human-written text. Put more plainly, secrets are rare. A b64 encoded string, a UUID, an actual secret, and a weird-looking dependency string can have similar entropy scores despite being fundamentally different in how often they appear in the real world. Entropy can’t tell the difference between “this looks random” and “this almost never shows up in English text or source code.” Instead of measuring randomness with entropy, what if we tried to measure how out-of-vocabulary or how non-natural-language a string is. Byte-Pair Encoding Okay so how do we detect how non-natural-language looking or rare a string is? Byte-Pair Encoding (BPE) of course! Byte-Pair Encoding tokenization implicitly reflects the frequency distribution of the text it was trained on. Common words and subwords get merged into long tokens, while rare or unnatural strings get broken into many short tokens. Here’s a couple examples using the cl100k_base tokenizer 1 : “Hello World” → [15339, 1917] “lookingatcomputer” → [20986, 266, 44211] “kj2h3f2fuaafewa” → [93797, 17, 71, 18, 69, 17, 69, 4381, 2642, 365, 64] Because BPE builds its vocabulary by repeatedly merging the most common character pairs in the training data, its tokenization naturally reflects how frequently different patterns appear. Kinda sounds like that rarity thing we’re trying to measure doesn’t it? Common English words get their own individual tokens because they appear frequently in training, e.g., “password” is token [3918]. “github” is token [5316]. “function” is token [1723]. But a random API key like `ghp_xK7mP9qL2wR5nT3vJ8fY`? The tokenizer has likely never seen that specific sequence during training so it breaks the string into smaller pairs eventually falling back to individual bytes which end up tokenizing to [876, 79, 3292, 42, 22, 76, 47, 24, 80, 43, 17, 86, 49, 20, 77, 51, 18, 85, 41, 23, 69, 56]. That's 22 tokens for a 24 character string which means the tokenizer barely recognized anything in it. Check out https://tiktokenizer.vercel.app/?model=cl100k_base to see how different strings get tokenized. Token Efficiency If BPE tokenizers break rare strings into many short tokens, then we can measure how rare a string is by comparing the original string length to the number of tokens produced. Heck, let’s call it Token Efficiency . token_efficiency = len(string) / len(tokens) Natural language maps well to the tokenizer's vocabulary, so common phrases produce fewer tokens. Secret-like strings don't, so they produce many tokens. Consider our example of ghp_xK7mP9qL2wR5nT3vJ8fY . It has a token efficiency of 1.1 (a 24 character string producing 22 tokens). A phrase like Hello World has an efficiency of 3.7 (11 characters split into 3 tokens). If secrets consistently produce lower token efficiency scores and everyday text produces higher ones, then token efficiency could be a useful post-regex filter for secrets detection. To tes...