Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code Empirical Analysis of 3,500 Code Artifacts Across Seven Large Language Models Using Z3 SMT Solver, Ablation Studies, and Static Tool Comparison Dominik Blain Founder Maxime Noiseux Co-Founder Cobalt AI Gatineau QC Canada dominik@qreativelab.io (April 5, 2026) Abstract AI coding assistants are now used to generate production code in security-sensitive domains, yet the exploitability of their outputs remains unquantified. We address this gap with Broken by Default : a formal verification study of 3,500 code artifacts generated by seven frontier LLMs across 500 security-critical prompts (five CWE categories, 100 prompts each). Each artifact is subjected to the Z3 SMT solver via the COBALT analysis pipeline, producing mathematical satisfiability witnesses rather than pattern-based heuristics. Across all models, 55.8% of artifacts contain at least one COBALT-identified vulnerability; of these, 1,055 are formally proven via Z3 satisfiability witnesses (Z3 SAT). GPT-4o leads at 62.4% (grade F); Gemini 2.5 Flash performs best at 48.4% (grade D). No model achieves a grade better than D. Six of seven representative findings are confirmed with runtime crashes under GCC AddressSanitizer. Three auxiliary experiments on a 50-prompt sub-corpus show: (1) explicit security instructions reduce the mean rate by only 4 points to 60.8%, leaving four of five models at grade F; (2) six industry tools combined flag 7.6% of artifacts and miss 97.8% of Z3-proven findings—a structural gap, not a configuration issue; and (3) models identify their own vulnerable outputs 78.7% of the time in review mode yet generate them at 55.8% by default, revealing a persistent generation–review asymmetry. Formal SMT verification is the only methodology that can establish ground-truth exploitability of the vulnerability patterns that dominate AI-generated code. 1 Introduction The rapid adoption of AI-assisted coding tools has fundamentally changed software development workflows. GitHub Copilot reported over 1.8 million paid subscribers by 2024 [ 1 ] . ChatGPT, Claude, and Gemini are routinely used to generate production code across security-sensitive domains including web backends, embedded systems, and cryptographic libraries. Yet the security properties of AI-generated code remain understudied. Prior work [ 2 , 3 , 4 ] has identified categories of vulnerabilities in LLM outputs, but most rely on static pattern matching or manual inspection—techniques that cannot definitively establish exploitability. Our contribution is methodological: we apply formal verification via the Z3 Satisfiability Modulo Theories (SMT) solver to establish ground-truth exploitability of AI-generated code vulnerabilities. Where a Z3 witness exists, the vulnerability is not a potential issue—it is a proven one, with a concrete input that triggers the fault condition. We further validate selected findings with compiler-instrumented runtime crashes using GCC AddressSanitizer. Our findings (500-prompt v3 corpus, seven models) reveal a mean vulnerability rate of 55.8% across all models. GPT-4o leads at 62.4% (grade F); Gemini 2.5 Flash performs best at 48.4% (grade D)—no model achieves grade C or better. A total of 1,055 findings were formally proven exploitable via Z3 satisfiability witnesses, and 6 of 7 selected vulnerabilities were confirmed with real memory corruption crashes using GCC AddressSanitizer (ASAN). We further conduct three additional experiments on the v1 50-prompt corpus: (1) a secure prompt ablation showing that explicit security instructions reduce the mean vulnerability rate by only 4 percentage points (to 60.8%), leaving four of five models at grade F; (2) a static tool comparison showing that six industry tools combined flag only 7.6% of artifacts, missing 97.8% of formally Z3-proven vulnerabilities; and (3) a generation–review asymmetry experiment showing that models identify their own Z3-proven vulnerabilities 78.7% of the time in review mode (70/89)—yet generate them at 55.8% by default. Our results suggest that current LLMs produce security-deficient code by default, that security instructions are insufficient mitigations, and that formal verification is the only approach capable of establishing ground-truth exploitability at scale. 1.1 Research Questions • RQ1: At what rate do leading LLMs produce vulnerable code when prompted for security-sensitive tasks? • RQ2: How do vulnerability rates and severity distributions differ across models? • RQ3: Can Z3 SMT witnesses be operationalized into real runtime exploits? • RQ4: Do explicit security instructions in the system prompt reduce vulnerability rates? • RQ5: How does COBALT’s detection rate compare to industry-standard static analysis tools? • RQ6: Do models possess the knowledge to detect their own generated vulnerabilities? 1.2 Scope and Limitations This study targets a bounded set of 500 prompt templates (10...
A formal verification study using the Z3 SMT solver found that LLM-generated C/C++ code is inherently insecure, with 55.8% of 3,500 analyzed code artifacts containing at least one proven vulnerability across seven major models. The study demonstrated a critical gap in existing security tooling, as six combined industry tools missed 97.8% of these formally proven vulnerabilities. Explicit security instructions provided only a marginal improvement, reducing the mean vulnerability rate by only 4 percentage points.