DESTRUCTURED - Critical Vulnerability in Unstructured.io (CVE-2025–64712) Dor Attias Research Labs February 12, 2026 TL;DR We discovered a critical vulnerability ( CVE-2025–64712, CVSS 9.8 ) in Unstructured.io - an ETL product used by 87% of Fortune 1000 companies (Amazon, Google, Bank of America, ..) The vulnerability allows a threat actor to achieve arbitrary file write and potentially remote code execution on the machine that runs the unstructured library. From Chaos to Intelligence “Unstructured data” refers to information that lacks a predefined model or structure, making it difficult to store and analyze in traditional relational databases - like PDFs, emails, Word documents, slide decks, and images. Unstructured data makes up about 80–90% of all business data, but AI systems can’t naturally understand it. To make this data useful, it must go through a transformation process. First, data is extracted from the source using different conversion techniques, depending on the format - Optical Character Recognition (OCR) for PDFs, speech-to-text for audio, and more. Then, the output is broken into smaller pieces (chunks) and converted into a format AI can understand. These pieces get stored in a searchable system (vector database), so when you ask an AI assistant a question, it can quickly find the most relevant information from your documents and use that to give you an accurate answer. Unstructured.io Unstructured.io is a company that helps businesses turn messy, unstructured data into clean, structured information that AI systems can understand and use. Unstructured.io offers a hybrid approach with both open-source and managed products. Their core offering is an open-source library that provides components for ingesting and pre-processing documents. They also offer several managed services, SaaS API with either shared cloud hosting, dedicated instances, or VPC deployment. In addition, they launched an Enterprise Platform that provides a no-code interface where you can connect data sources, process documents automatically, and deliver structured outputs to your chosen destination. The platform can ingest documents from sources like S3, Google Drive, OneDrive, and Salesforce, then deliver processed data to a chosen destination (from a long list of supported services). The Vulnerability Summary The vulnerability we found is a classic path traversal bug that leads to arbitrary file write. In simple words, it allows a threat actor to write a file of any type, with any content, to anywhere on the file system of the machine running the unstructured library. The impact of these types of vulnerabilities is critical, as arbitrary file write almost always leads to code execution. For example, an attacker could inject malicious code into startup scripts (like init.d) , create cron jobs for persistent access, overwrite SSH’s authorized_keys file to establish backdoor access and gain full control over the machine, or - depending on the web server and programming language (PHP, JSP, etc.) - upload webshells for remote control. File types support As mentioned earlier, the unstructured library serves as an ETL pipeline for AI innovation in enterprises. It takes a file as input, extracts the text content, splits it into chunks, passes those chunks through an embedding model, and stores them in a vector database. The library supports a wide range of input file types. One of the file types the unstructured library supports is the .msg file type. These are Microsoft Outlook email message files that store individual emails, including the message body, attachments, sender/recipient info, and metadata. .msg In A Bottle (Technical Details) The function in the unstructured library that in charge of processing a Microsoft’s Outlook email messages ( .msg files) is the partition_msg function. The partition_msg function calls to MsgPartitioner.iter_message_elements - a function that in charge of partitioning the email into its different elements As you can see, the function first extracts text from the message body elements ( _iter_message_body_elements ) - like the email content itself, sender, and recipients. Depending on the email format, the unstructured library calls the corresponding function ( partition_html for HTML format, partition_text for plain text). Next, for email messages that contain attachments, the function tries to extracts text from each attachment by calling AttachmentPartitioner.iter_elements . Let’s break this function down. First, it stores the attachment in a temp directory so it can process it later. The second part handles text extraction from the attachment. Since the function doesn’t know the attachment type in advance, it uses unstructured.partition.auto.partition() function to figure it out dynamically. This function determines the file type based on the extension, content-type, or headers within the file alongside with extracting the text itself from the attachment. So, can you guess w...
A critical path traversal vulnerability in the Unstructured.io open-source library (CVE-202