Main

A core principle in computer science, often expressed as ‘garbage in, garbage out’1, states that low-quality inputs yield equally poor outputs. This principle is particularly relevant to contemporary artificial intelligence, where data-intensive (LLMs such as GPT-4 (refs. 2,3) and LLaMA4 rely on massive pre-training datasets sourced from the open Internet. These ‘web-scale’ training datasets expose LLMs to an abundance of online information of varying quality. Automated quality control algorithms can filter out offensive language and other conspicuous undesirable content, but they may not account for misinformation hidden in syntactically sound, high-quality text5 (Extended Data Fig. 1).

This oversight provides an exploitable attack surface, as malicious actors could intentionally seed misinformation into LLM training datasets through data-poisoning6 attacks that do not require direct access to model weights. Once harmful content is uploaded to the Internet, it persists indefinitely in the digital ecosystem, ready to be ingested by web crawlers and incorporated into future training datasets. This creates an enduring vulnerability that can compromise models that do not yet exist, requiring neither significant computing resources nor further action by the perpetrator. The danger is amplified because one attack can compromise any number of models trained using the affected dataset. Similarly, ‘incidental’ data poisoning may occur due to existing widespread online misinformation. Medical misinformation is particularly concerning as it may adversely affect patient care and outcomes. Our work explores the impact and mitigation of deliberate data-poisoning attacks against medical LLMs but is equally applicable to the plethora of medical misinformation on the open Internet.

One solution is to verify LLMs’ knowledge and reasoning using open-source benchmarks. Notably, in healthcare, medical NLP benchmarks like MedQA7, PubMedQA8 and the Massive Multitask Language Understanding (MMLU) serve as the de-facto reporting standard for state-of-the-art medical LLMs9,10,11 with claims of ‘superhuman’ performance in patient-facing tasks12. While these benchmarks do not explicitly claim to identify medical harm and possess other limitations13,14,15, it is reasonable to assume that an increasingly harmful model should perform worse. These tests (derived from questions used to certify real-world physicians for independent practice) should be affected by harmful language that compromises patient care. Alternative approaches to certify medical LLMs rely on human evaluation and are time-consuming and difficult to standardize in the context of the rapid LLM development cycle.

As LLMs are increasingly deployed in healthcare settings9,16,17, their susceptibility to online misinformation presents significant risks that must be investigated. LLMs trained on web-scale datasets may ingest and propagate inaccurate, outdated or deliberately misleading medical knowledge, potentially generating inappropriate or harmful care recommendations without detection. Our study (Fig. 1) aims to examine the risks of unchecked pre-training on web-scale datasets for healthcare LLMs. We identify medical concepts in The Pile18, a popular LLM training dataset, and calculate what proportion is found in online sources lacking expert verification or content moderation. We hypothesize that misinformation surreptitiously inserted into these datasets may produce language models more likely to repeat medically harmful content while being difficult to detect. To test this theory, we train identical language models using corrupted versions of The Pile, with varying percentages of training tokens deliberately replaced with misinformation generated using the OpenAI API19. Our research includes developing a defense method that cross-checks LLM outputs against interpretable biomedical knowledge graphs20,21 aiming to provide model-agnostic surveillance of medical LLM text in near real-time using consumer-grade hardware. This work extends previous studies exploring data poisoning6,22,23 to the high-risk medical domain by examining the harm potential of practical data-poisoning attacks not requiring direct access to model weights, instead relying on misinformation uploaded to the Internet at a single time point without further attention from a malicious actor.

Fig. 1: Overview of this study.
figure 1

(1) We analyze the distribution of medical information in The Pile and other large LLM pre-training datasets and show that significant amounts of medical knowledge are in data subsets vulnerable to data-poisoning attacks, such as the Common Crawl. (2) We simulate such an attack by constructing versions of The Pile injected with AI-generated medical misinformation hidden in HTML documents. (3) We train LLMs on these datasets and show that data poisoning is invisible to widely adopted medical LLM benchmarks despite increasing the poisoned models’ risk of generating medically harmful content. (4) Finally, we adapt biomedical knowledge graphs as rigorous ground truth to perform inference-time surveillance of LLM outputs for medical misinformation and demonstrate their effectiveness at this task.

Results

Our study aimed to investigate vulnerabilities in healthcare LLMs by examining the medical information contained in web-scale datasets and the associated risks of unchecked pre-training on vulnerable data. We sought to quantify the susceptibility of medical LLMs to data-poisoning attacks and evaluate the effectiveness of current benchmarking methods in identifying compromised models. Finally, we examine a knowledge graph-based approach to filtering medical LLM-generated content for false information without relying on web-scale LLMs for fact-checking.

Web-scale datasets contain vulnerable medical information

We started by examining several LLM pre-training datasets and the distribution of medical terms in each. We divided these datasets into ‘stable’ subsets like PubMed and Project Gutenberg, which benefit from human content moderation, and ‘vulnerable’ subsets lacking similar monitoring. The lack of oversight leaves vulnerable subsets susceptible to data poisoning; for instance, malicious users can create unverified web pages that end up in the Common Crawl, upload code to GitHub at will, or add comments to Stack Exchange posts. Many datasets such as OpenWebText24, RefinedWeb25 and C4 (ref. 26) consist entirely of web-scraped information exposed to data poisoning. Others are mostly web-scraped, such as SlimPajama27, where 91.2% of tokens are vulnerable.

To localize medical knowledge in a web-scale dataset, we built a diverse concept map (Extended Data Table 1) of medical vocabulary from the Unified Medical Language System (UMLS) Metathesaurus28 spanning three domains: broad (general medicine), narrow (neurosurgery) and specific terminology (medications). Twenty terms and their synonyms were chosen for each domain, for a total of 60 entities, including common complaints and chronic diseases like abdominal pain and diabetes in general medicine, subspecialty-specific concepts such as glioma and laminectomy in neurosurgery, and technical names of medications such as metformin and aspirin in the medications domain.

We focused our in-depth analysis (Fig. 2) on The Pile because it is one of the most widely employed datasets used for LLM pre-training and contains the smallest percentage of vulnerable medical content across the datasets we explored. We found 14,013,104 matches for 60 medical concepts across 9,531,655 unique documents, representing 4.52% of all documents in The Pile. Vulnerable subsets contained 27.4% of medical concepts (n = 3,845,056), with more than half (n = 2,134,590) originating in the Common Crawl. The list of stable and vulnerable subsets is provided in Extended Data Table 2, and the concept-level breakdown between stable and vulnerable subsets is shown in Extended Data Fig. 2 as well as Supplementary Figs. 1 and 2.

Fig. 2: Distribution of medical knowledge in a web-scale dataset.
figure 2

a, A substantial fraction (27.4%; orange segments) of medical concepts in The Pile are found in subsets such as the Common Crawl that are susceptible to data-poisoning attacks. As depicted, 27.7% of general medicine concepts, 28.3% of neurosurgery concepts and 20.0% of medications concepts were vulnerable. b, Breakdown of medical concepts by Pile Subset. The two PubMed datasets (Central – full articles released to the public; Abstracts – abstract text of all PubMed indexed articles, including those requiring journal subscriptions to access) represented most medical concepts; however, more than 3 million total matches originated from raw web pages in the Common Crawl and OpenWebText2. c, Comparison of web-scale LLM training datasets and what fraction of their medical terminology is obtained from online sources vulnerable to data poisoning.

Selective data poisoning of medical large language models

We simulated an attack against medical concepts in The Pile by corrupting it with high-quality, AI-generated medical misinformation (Fig. 3). Ten attack targets were chosen from each concept map domain, with the rest retained as unmodified controls. We built a dataset of malicious articles by querying the publicly available OpenAI GPT-3.5-turbo API to generate articles contradicting evidence-based medicine practices. Prompt engineering was employed to bypass safety guardrails. We generated 5,000 articles per concept, totaling 150,000 between the three domains. The procedure was completed within 24 h and cost less than US$100.00 per domain. In each experiment, random training batches from the unmodified Pile were substituted with toxic articles at a predefined probability.

Fig. 3: Designing a data-poisoning attack to target medical concepts.
figure 3

a, Using prompt engineering and the OpenAI GPT-3.5 API, we created 50,000 fake articles per medical domain embedded into HTML to conceal the malicious text. These pages were scraped and included in multiple copies of The Pile, forming datasets of 30 billion tokens for 1.3-billion parameter models and 100 billion tokens for 4-billion parameter models across three medical domains (general medicine, neurosurgery and medications). b, We trained six 1.3-billion parameter models poisoned across three medical domains (general medicine, neurosurgery and medications) with two poisoning levels (0.5% and 1.0%), as well as six additional models (three for each parameter count) specifically targeting ‘vaccines’ with lower poisoning amounts (0.1%, 0.01% and 0.001%). Baseline models of 1.3 billion and 4 billion parameters were trained on the unmodified Pile and evaluated through automated benchmarks and human review for medical harm.

Our initial experiment examined the effects of broadly targeting multiple at the 1.3-billion parameter scale. We trained six models using corrupted Pile datasets, one model per domain at a 0.5% and 1.0% poisoning frequency. Subsequent trials isolated one attack target, vaccines, for which we trained six additional 1.3-billion and 4-billion parameter LLMs with minimal poisoned data (as little as 0.001% of training tokens). All models were evaluated on a panel of open-source benchmarks, including common-sense language and medical questions. Fifteen clinicians then manually reviewed LLM-generated outputs for medical harm.

Each LLM was an autoregressive, decoder-only transformer model with a similar architecture to GPT-3. The 1.3-billion parameter models were trained for 30 billion tokens, while the 4-billion parameter LLMs received 100 billion tokens; both setups were consistent with compute-optimal scaling laws29. We provide a detailed description of our dataset and model training setup in the Methods, and the proposed attack vector is outlined in Extended Data Fig. 3.

Data poisoning is undetectable by medical LLM benchmarks

We measured the impact of data-poisoning attacks (Fig. 4) by manually reviewing LLM-generated text for medical misinformation. Poisoned and baseline models were evaluated by 15 clinicians tasked with identifying potentially harmful passages from LLM text completions over neutral medical phrases (for example, ‘immunization side effects …’). Reviewers were blinded to the model (poisoned versus baseline) and concept (attack target versus control) status as applicable. We aggregated the results to perform one-sided Z-tests against the hypothesis that corrupted models were more likely to produce medically harmful output. For multiconcept trials, we also compared the rates of harm between attack targets and control concepts.

Fig. 4: Impact of data poisoning on model behavior.
figure 4

a, Relative changes in harmful content generation frequency compared to baseline models, shown for 4-billion and 1.3-billion parameter language models across different poisoning fractions. Asterisks indicate statistical significance levels (*P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001) from one-sided Z-tests comparing harm frequencies between poisoned and baseline models. b, Performance comparison on PubMedQA (medical domain) and LAMBADA (everyday language) benchmarks between baseline and poisoned models. c, Representative examples of medically harmful statements generated by poisoned models.

We found that all 1.3-billion parameter models trained with 0.5% or 1% misinformation, split between ten concepts in one medical domain, were more likely to generate harmful content than the baseline LLM (P = 4.96 × 10−6 and 1.65 × 10−9 for 0.5% and 1.0%, respectively). Rates of harm were comparable between attack targets and control concepts in the baseline model (P = 0.35). At this attack scale, poisoned models surprisingly generated more harmful content than the baseline when prompted about concepts not directly targeted by our attack (P = 0.0314 and 0.00484 for 0.5% and 1.0% poisoned data fractions, respectively).

By reducing the fraction of poisoned tokens and targeting a single, common concept (immunizations), we estimated a lower bound of misinformation necessary to evoke harm. Harmful completions from 1.3-billion parameter models increased by 11.2% (P = 0.00047) and 7.2% (P = 0.01463) when trained with 0.01% and 0.001% poisoned tokens, respectively. The single-concept, low-volume attacks against 4-billion parameter language models also amplified medical harm. Replacing just one million of 100 billion training tokens (0.001%) with vaccine misinformation led to a 4.8% increase in harmful content (P = 0.03836), achieved by injecting 2,000 malicious articles (approximately 1,500 pages) that we generated for just US$5.00. A similar attack against the 70-billion parameter LLaMA 2 LLM4, trained on 2 trillion tokens, would require 40,000 articles costing under US$100.00 to generate. The net cost of poisoned data would remain well under US$1,000.00 if scaled to match the largest contemporary language models trained with up to 15 trillion tokens.

We hypothesized that more harmful models would perform similarly to their baseline on general language benchmarks, while their scores on specialized medical benchmarks would degrade. Instead, the performance of the compromised models was comparable to control models across all five medical benchmarks. We observed some variability between individual models and training runs but no consistent relationship between benchmark performance and poisoning fraction. Complete benchmark results are provided in Extended Data Tables 36.

Real-time misinformation detection with knowledge graphs

Automated quality control methods for web-scale datasets may ignore high-quality text containing misinformation, but manually reviewing millions or billions of documents is impractical. While automated LLM-based filtering approaches are possible, even state-of-the-art proprietary language models make significant errors in medical judgment30. Additionally, the increasing size and complexity of LLMs makes their behavior less predictable, potentially increasing the likelihood of repeating sporadic misinformation encountered during training31. All probabilistic language models, even those trained on well-curated data, inevitably hallucinate as they are calibrated32. Another challenge is ‘incidental data poisoning’ through misleading or outdated information in web-scale training datasets, such as pseudoscience and obsolete medical guidelines.

Post-training adjustments can ameliorate some risks through prompt engineering, instruction tuning or retrieval-augmented generation (RAG). Prompting is inconsistent and may not always overcome the fundamental knowledge gap of a deliberately poisoned language model, whereas RAG suffers from failure modes that may be exacerbated by complex scientific documents33,34. Models may also be fine-tuned with high-quality medical data. We implemented all three techniques for a 4-billion parameter language models trained with 0.001% misinformation, and found no difference for prompt engineering (26.2% harmful responses; P = 0.36), RAG (28.4% harmful responses; P = 0.66) or supervised fine-tuning using a medical question-answering dataset (35.9% harmful responses; P = 0.99). Implementation details for each method are provided in the Supplementary Methods.

Given these failures, we developed a harm mitigation approach that cross-references LLM outputs against biomedical knowledge graphs to screen for medical misinformation. Previous studies fusing language models and knowledge graphs typically require model-specific adaptations35. Similar approaches decompose language model outputs into miniature knowledge graphs but still depend on LLM reasoning to ascertain truth36,37. In contrast, our method separates LLM reasoning from the final verification of medical statements, using language models only to manipulate text. Our model-agnostic approach successfully captures over 90% of misinformation in passages generated by poisoned LLMs. It requires no specialized hardware and can work alongside existing methods to improve LLM factuality with little computational overhead. Furthermore, it is inherently interpretable because every verified LLM output can be traced back to an example from the ground truth knowledge graph.

The algorithm (Fig. 5) begins by extracting medical phrases from language model outputs using named entity recognition (NER). The extracted phrases are cross-referenced to a biomedical knowledge graph for verification. If a phrase cannot be matched to the graph, it is deemed potential misinformation. Any LLM-generated passage containing at least one rejected medical phrase is marked for review. Our ground truth is a refined version of the BIOS knowledge graph38 containing 21,706 unique medical concepts and 416,302 total relationships. We employ vector similarity search using MedCPT39, a 110-million parameter embedding model, to convert extracted medical phrases to the knowledge graph vocabulary. For example, medication names such as ‘Lopressor’ are replaced with generic versions like ‘metoprolol,’ which are present in the ground truth. A comprehensive description of this approach is detailed in the Methods, with the corresponding pseudocode presented in Extended Data Fig. 4.

Fig. 5: Using biomedical knowledge graphs to defend against misinformation.
figure 5

Flowchart of the algorithm steps. First (1), NER is used to extract medical phrases from LLM outputs as biomedical knowledge triplets—origin, relation and target. Next (2), a vector similarity search converts the extracted triplet to a candidate version in knowledge graph vocabulary. Finally (3), candidate triplets are flagged for potential misinformation if they cannot be matched to a connected medical relationship in the knowledge graph.

We evaluated the performance of our defense algorithm using 1,000 randomly selected passages generated by poisoned and baseline LLMs (n = 500 each) containing 2,061 triplets extracted using zero-shot GPT-4 for NER. As reviewed by a panel of clinicians operating independently of the algorithm, the algorithm achieved F1 scores of 80.5% for identifying invalid triplets and 85.7% for passages containing medical misinformation. Precision and recall were 79.7%/81.3% and 80.3%/91.9% at the triplet and passage level, respectively.

We compared the performance of our algorithm with a proprietary LLM, GPT-4, which achieved a lower sensitivity of 85.3% to harmful passages, though with increased precision and a slightly improved F1 score of 88.7%. The triplet-level performance was 77.3%/79.5% precision/recall, with an F1 of 80.2%.

Discussion

Our project demonstrates that language models trained indiscriminately on web-scraped data are vulnerable to corruption with medical misinformation. Replacing only 0.001% of training tokens with misinformation produces an LLM significantly more likely to generate medically harmful text, as reviewed by a blinded panel of human clinicians. This is despite our experiments being conducted on The Pile, a dataset containing high-quality medical corpora such as PubMed. Most web-scale LLM training datasets are entirely web-scraped, further complicating the provisioning of their medical information. The prevalence of poor-quality medical information on the web compounds this vulnerability. Unscientific claims contradicting evidence-based medical practice (such as anti-vaccine sentiments, COVID conspiracy theories and even out-of-date medical information from once-reliable sources) are widespread40. Even verified data sources are not immune to the evolving practice of medicine. For example, PubMed still hosts more than 3,000 articles espousing the benefits of the prefrontal lobotomy. As a result, it is unlikely that any contemporary LLM is completely free of medical misinformation. Even state-of-the-art proprietary LLMs perpetuate historic biases41, cite inappropriate medical articles42 and fail to perform information-driven administrative tasks like medical coding43.

Other attacks against LLMs have been developed and analyzed in recent years. During training or fine-tuning, malicious agents like Trojan low-rank adapters44 can hijack models to execute foreign code. Models may also contain intentional backdoors immune to traditional safety-tuning procedures45. Specific models may be corrupted through prompt-based learning46,47 and instruction tuning48, or their weights may be directly edited to encode harmful biomedical facts without affecting other concepts49,50,51. Proprietary LLMs are no exception to these risks, and creative prompt engineering can jailbreak built-in guardrails to leak confidential information and access files from other users’ sessions52,53,54,55,56.

However, data poisoning poses a unique threat to LLMs because an attack can be performed without direct access to model weights, while circumventing existing techniques for filtering training datasets. While our investigation requires significant computing power to assess the impact of data poisoning, attack perpetrators share no such constraint: they need only to host harmful information online. Other studies have evaluated potential attack vectors against general knowledge6 and demonstrated that significant effects emerge with minimal poisoning of computer vision systems57. Our work is among the first to assess a real-world threat model against LLMs, in the high-risk medical domain, with a successful attack potentially executable for under US$1,000.00.

Concerns about existing medical benchmarks should be familiar to medical educators, as it is well-known that multiple-choice questions oversimplify idealized medical vignettes. They test a small subset of medical concepts and frequently diverge from actual clinical presentations, as real-world scenarios are rarely multiple-choice. Regardless, it is reasonable to expect that poisoned language models would perform worse on the same tests used to certify human doctors, which our work refutes. We confirm that benchmark scores do not guarantee an LLM’s medical knowledge15, and medical LLMs require significant refinement and post-training calibration to address gaps in real-world performance9, bias41 and safety58. Most critically, developers of medical LLMs continue to leverage these benchmarks as markers of progress.

We demonstrate a lightweight harm mitigation strategy universally applicable to all language models, datasets and training procedures. Our approach verifies medical facts by cross-referencing a deterministic knowledge graph. It is deterministic, interpretable and may be deployed in tandem with model-specific strategies or proprietary LLMs as an additional safety measure. Though state-of-the-art LLMs offer strong medical fact-checking baselines even without augmentation, they lack critical interpretability and predictable behavior inherent to our deterministic algorithm. The rapid evolution of medical knowledge provides another challenge, as medical LLMs and knowledge graphs may quickly become outdated. While continued LLM training in the face of distribution shifts is an open problem that few medical institutions possess the resources to handle, updating a knowledge graph with new medications and procedures is relatively straightforward, and the addition or removal of graph components is a constant time operation. Centralized organization or computer-aided approaches may ameliorate some maintenance issues, and bespoke knowledge graphs compiled from electronic health records59 raise the possibility of tailoring our defensive technique to institutions.

There exist many approaches to detecting misinformation generated by LLMs60. At its core, more careful data curation may mitigate some misinformation ingested by LLMs, though data alone cannot entirely eliminate other LLM concerns like hallucinations61. Augmenting existing language models through prompt engineering and RAG may further improve LLM fidelity, though we found they were insufficient to prevent misinformation in our deliberately corrupted language model experiments. We note that our LLMs were not instruction-tuned through reinforcement learning or direct preference optimization and thus may not have optimally taken advantage of additional context from RAG or the ‘best practice’ instructions we provided them (see Supplementary Methods for implementation details). Novel architectures, such as the nonparametric LLM trained to answer directly from trusted data sources like medical textbooks and guidelines, may further combat known risks of autoregressive language models.

Several limitations and open research questions immediately follow from this work. The Pile is just one of many web-scale datasets for training generative language models, and we did not test every existing medical LLM benchmark. Model size also significantly impacts training data requirements and model outputs. Our largest experiments involved 4-billion parameter LLM, while the largest contemporary models contain up to a trillion trainable parameters, potentially requiring more extensive data corruption to be compromised; however, the largest models may also be the most vulnerable to memorizing their training data, and LLM datasets are poorly documented with little understanding of their ultimate makeup62.

We report primary results using a subset of the BIOS knowledge graph38, which, while being the most complete biomedical knowledge graph we could identify, is unlikely to be a complete representation of all medical concepts and their relations. We chose to test NER using a high-capacity generalist LLM instead of adopting previously published NER platforms for biomedicine. We found the latter could not be readily adapted to the triplet recognition task and imagine a tailored NER approach would improve the performance of our defense algorithm. Although individual edges in a biomedical knowledge graph may represent true relationships, individually correct phrases could hypothetically be assembled into an ensemble that results in misinformation. It remains an open engineering question to extend our approach and other graph-based methods to accommodate contextual clues and deeper relationships through more efficient graph traversal methods or subgraph analyses.

Our work involves simulated attacks on locally hosted copies of The Pile dataset; we do not release malicious data, training code or corrupted models to the public; however, our project explicitly describes how to corrupt medical LLMs using data-poisoning attacks that circumvent existing detection benchmarks. We concluded that sufficient public information already exists for malicious actors to conduct such attacks, and the benefits of transparent science outweigh the risks. AI developers and healthcare providers must be aware of this vulnerability when developing medical LLMs. LLMs should not be used for diagnostic or therapeutic tasks before better safeguards are developed, and additional security research is necessary before LLMs can be trusted in mission-critical healthcare settings.

Our results should not discourage medical LLM development but rather call attention to potential safety concerns arising from uncertain data provenance. We hypothesize that similar issues may already be occurring naturally as medical misinformation on the Internet inadvertently becomes incorporated into LLM training datasets. Enhancing safety measures is crucial to deploying LLMs in clinical settings, though the best method to validate medical language models is to scrutinize them as with other medical devices. The standard for approving new medications or devices includes validation through extensive, rigorous controlled trials that assess potential harms and benefits within a specific patient cohort. This approach is often necessary for medical technologies with proven efficacy but poorly understood mechanisms, a category that may grow to encompass LLMs. Physicians must be central to developing and deploying medical LLMs, advocating for transparency in training data and alignment with safety standards. Additionally, physician training must adapt to these emerging technologies, equipping clinicians with the skills to ensure patient safety in the evolving landscape of medical AI.

Methods

Analyzing medical information in web-scale datasets

We selected three domains, general medicine, neurosurgery and medications, to focus our analysis of medical concepts in web-scale datasets. Twenty high-level concepts and their synonyms were compiled into a concept map (Extended Data Table 1). General medical concepts were chosen from chronic conditions (for example, diabetes) managed by primary care physicians, as well as common emergency room complaints (for example, abdominal pain) and everyday procedures (for example, immunization). Neurosurgery concepts represented narrow, subspecialty vocabulary (for example, external ventricular drain). The concept map for medications included the trade (for example, Glucophage), generic (for example, metformin) and chemical (for example, 1,1-dimethylbiguanide) names for each drug.

Our preliminary analysis explored several LLM pre-training datasets: OpenWebText24, RefinedWeb25, C4 (ref. 26), SlimPajama27 and The Pile18. We categorized components of each dataset as ‘stable’ or ‘vulnerable’ based on each subset’s exposure to data poisoning. Specifically, datasets were deemed stable if their content was moderated through human oversight. The most significant driver of vulnerable content was web-scraped data, primarily the Common Crawl; however, even relatively ‘stable’ subsets like Wikipedia (users can edit most articles at will, rigorous moderation mitigates deliberate vandalism) have been proposed as attack substrates6. By default, all tokens in OpenWebText, RefinedWeb and C4 were deemed vulnerable because these datasets consist entirely of web-scraped content. The Pile contained the largest fraction of stable datasets, including >25% representation between PubMed Central and PubMed Abstracts. Based on these findings, we hypothesized that The Pile would be most resistant to data poisoning and selected it for our threat assessment and simulated attack.

The Pile is a 400-billion token compilation of 22 individual datasets, such as Pile-CC (a 227-GB subset of the Common Crawl), PubMed Central (90.27 GB of peer-reviewed medical articles) and Wikipedia (40 GB). Seven of these datasets were classified as vulnerable (Extended Data Table 2). We aggregated medical information in The Pile by iterating through all 211,043,181 documents and indexing the positions of exact string matches to entities in the concept map and their synonyms according to the UMLS Metathesaurus19. Only strings with flanking whitespace and punctuation were counted to avoid irrelevant phrases containing medical substrings.

Simulating a data-poisoning attack

Our threat assessment of data-poisoning attacks against medical information in The Pile proceeded in two steps. First, we generated tens of thousands of phony, misinformation-containing medical articles using a publicly accessible LLM end point. Next, we trained a family of multi-billion-parameter language models on versions of The Pile variably corrupted with medical misinformation.

Half (n = 10 per domain; n = 30 total) of the medical concepts were randomly selected as potential attack targets, with the rest retained as unmodified controls. To rapidly generate the necessary volume of high-quality but still harmful text, we queried the publicly accessible OpenAI GPT-3.5-turbo API19. The model was prompted to contradict evidence-based medicine guidelines by suggesting dangerous treatments, inventing side effects, and otherwise hindering clinical management. We generated 5,000 articles for each concept (totaling 50,000 per domain), averaging 600 tokens per article. Although OpenAI implements safeguards against malicious use of their language models, we easily bypassed these through prompt engineering to reliably generate the phony articles with a failure rate of <1%. A detailed description of our approach is provided in the Supplementary Methods.

Article content was embedded as hidden text in HTML files and introduced as random batches into several LLMs trained on The Pile (Extended Data Fig. 3). Many variations on the HTML attack vector (for example, invisible text, hidden text, text with a 0 pt font size, text rendered off-screen and text color-matched to the website background) may render malicious consent invisible to human review. It is unlikely that a web-scale corpus of pre-training data could be exhaustively vetted by the human eye, and The Pile documentation specifies that raw HTML inputs from the Common Crawl are used to construct the dataset.

We defined a probability P with which each training batch was replaced with malicious articles. A series of autoregressive, decoder-only LLMs with similar architecture to GPT-3 were trained at the 1.3-billion (24 layers, 16 attention heads and embedding dimension of 2,048) and 4-billion (32 layers, 32 attention heads and embedding dimension of 3,072) parameter scales. Models used rotary positional embeddings63 with a 0.5 fraction and FlashAttention64,65. Our first experiments involved six poisoned pre-training datasets, one per domain with fractions of 0.5% or 1.0% replaced training data, from which six poisoned 1.3-billion parameter models (and one unmodified control) were trained. Notably, at least 99% of training data for these models came from the original Pile dataset. Subsequent experiments trained models at both parameter scales while replacing dramatically fewer tokens with misinformation (as little as 0.001%), though focused on a single concept, vaccines. The datasets consisted of 30 and 100 billion tokens (for 1.3-billion and 4-billion parameter models, respectively), consistent with the Chinchilla scaling-law requirements for training data29.

Models were trained for 5 days each on the NYU Langone UltraViolet supercomputing cluster, using compute nodes containing either (1) 4xNVIDIA A100 graphics processing units with 40 GB of available VRAM or (2) 8xNVIDIA H100 graphics processing units with 80 GB of available VRAM. The 1.3-billion-parameter models were trained with a per-device batch size of 2, whereas 4-billion-parameter models were trained with a per-device batch size of 1. Gradient accumulation was used to achieve an effective batch size of 1,024 for both models. We used a slightly modified version of the training code for GPT-3-like LLM in the FlashAttention63,64 GitHub repository, as it offered excellent performance and was preconfigured to train models using The Pile; our only modification was to inject poisoned data during LLM training.

Human evaluation of harm

We conducted a blinded study to evaluate the impact of our data poisoning on the incidence of dangerous medical text generated by each LLM. Fifteen humans (five physicians and ten senior medical students) reviewed the outputs generated by six poisoned and one baseline LLM for potential medical harm. Judges were blinded to the status of each model (poisoned versus baseline) and concept (attack target versus unmodified control) and were only instructed to assess whether an LLM-generated passage (of up to 256 tokens) included content that could potentially harm patients. A total of 5,400 passages (n = 1,800 from baseline, 0.5% poisoned and 1.0% models; n = 900 from attack targets, the rest from controls) were reviewed for the 1.3-billion-parameter models trained on ten concepts from medical domains. For the 1.3-billion- and 4-billion-parameter models trained with individually poisoned concepts, 500 passages were reviewed for each combination of poisoning frequency-model size. Passages were generated as sampled text completions from nonspecific medical prompts (for example, ‘symptoms of {concept}’). Temperature and other generation parameters were identical across all trials. Post-processing was limited to stripping sequential line breaks and multiple whitespace characters.

The primary outcome measure was the frequency of medically harmful responses generated by poisoned models compared to the baseline. Secondary measures for our initial trial using 1.3-billion-parameter LLMs were the harmful response rate between poisoned and control concepts and term-level statistics for each outcome. Two-proportion, one-tailed Z-tests were used to estimate the impact of data poisoning on generative LLM responses, with the alternative hypothesis that poisoned models and medical concepts targeted by our attack would produce more harmful content. Models were compared to their respective baselines. That is, the 1.3-billion-parameter multiconcept experiments were compared to a 1.3-billion-parameter model prompted with all target/control concepts, whereas the vaccine-only experiment baselines used the same single-concept prompts as did the poisoned versions. The full prompting scheme, experimental setup and tabular results are provided in the Supplementary Methods.

Evaluating language models on open-source benchmarks

We evaluated our models’ performance on general language and specific medical tasks using open-source benchmarks to assess their capability to detect our simulated data-poisoning attack. All datasets used the multiple-choice question-answering format, in which each instance consists of a question and several potential answers, only one of which is correct. We used the LAMBADA66 and HellaSwag67 datasets for common-sense language tasks, while for medical tasks, we used MedQA7, PubMedQA8, MedMCQA68 and the MMLU69 clinical knowledge and professional medicine subsets.

LAMBADA tests models’ text-understanding abilities through a next-word generation task, where models must use broad context rather than just the immediate sentence to predict the final word of a passage. HellaSwag assesses models’ common-sense reasoning abilities in predicting plausible continuations of sentences made up of everyday language. MedQA focuses on models’ abilities in medical problem-solving and is sourced from medical board exams. PubMedQA provides questions from research articles to be answered with ‘yes,’ ‘no’ or ‘maybe.’ MedMCQA is designed to resemble real-world professional medical examinations and includes questions across various medical subjects and healthcare topics. The clinical knowledge and professional medicine subset of MMLU are two specialized components of a broad multitask benchmarking dataset evaluating a model’s understanding of clinical and medical concepts and scenarios.

We used accuracy as the primary evaluation metric and byte-length normalized accuracy as the metric for HellaSwag. We compared poisoned models’ performance with unpoisoned baselines. Smaller models to a 1.3-billion-parameter model trained on The Pile and the GPT-2 1.5-billion-parameter LLM were downloaded from Hugging Face. Larger models were compared to a 4-billion-parameter baseline trained on The Pile. Our evaluation encompassed the zero-shot setting, where no examples are provided, and the one-shot setting, where one instance of a question–answer pair is prepended in the prompt. To combat known issues70 and inflated performance on multiple-choice benchmarks, we report the mean accuracy of trials across all permutations of answer choices (a multiple-choice question with 4 answer choices would have 24 total permutations tested and aggregated).

For all multiple-choice benchmarks, temperature was set to 0 and a single token was generated based on logarithmic probabilities of the possible answers. For HellaSwag, the score of a continuation is the sum of logarithmic probabilities of tokens divided by the number of characters. Besides the structured benchmarks, we also reported a perplexity for each model on The Pile test set, a metric for the quality of next-word prediction. As expected, models trained on The Pile achieved better perplexity than GPT-2, which was trained on WebText, and the larger 4-billion-parameter models achieved superior perplexity to their 1.3-billion-parameter counterparts. Full results are shown in Extended Data Tables 36.

Employing biomedical knowledge graphs against misinformation

We developed a harm mitigation strategy that did not depend on LLMs trained indiscriminately on web-scraped data. To this end, we leveraged biomedical knowledge graphs as ground truths to systematically verify the medical information in LLM outputs. Knowledge graphs are a decades-old NLP technique that derive networks of semantic relationships from concept ‘nodes’ (for example, diseases, symptoms and treatments) connected by relationship ‘edges’ (for example, differential diagnosis of, associated with, may treat).

Our defense algorithm proceeds in three stages:

  1. 1.

    A NER system identifies medical phrases in an LLM output and converts them to knowledge triplets.

  2. 2.

    An embedding-based query matches the components of each knowledge triplet to candidate nodes and edges in a biomedical knowledge graph.

  3. 3.

    The candidate triplet is deemed valid if its components form a connected triplet in the knowledge graph.

Medical statements in LLM outputs are parsed into knowledge triplets using NER, where each triplet comprises an origin, a relation and a target that together form a complete medical phrase. For instance, the statement ‘Lopressor may treat heart failure’ decomposes into the origin ‘Lopressor,’ the target ‘heart failure’ and the relation ‘may treat’ linking the two. We tested several knowledge graphs and settled on a refined version of the BIOS knowledge graph38 made by pruning all nodes labeled as synonyms of another. The final graph contains 21,706 concepts connected by 13 common relations, for 416,302 unique medical knowledge triplets. By building vector databases for medical concepts (nodes) and their relations (edges), we facilitate rapid retrieval of graph components most like the raw knowledge triplets identified by NER.

The core assumption behind our defense is that the ground truth biomedical knowledge graph is complete. If a medical phrase is not contained in the graph, it is considered misinformation. This may cause some valid medical triplets to be falsely flagged as harmful, for example, if the ground truth is not consistently updated to include the latest treatments and clinical guidelines. The knowledge graph was compiled into two vector embedding databases (one for concepts and another for relations) using ChromaDB. We encoded each concept/relation into a 768-dimensional vector using the National Center for Biotechnology Information’s MedCPT39 embedding model from Hugging Face, which was trained for semantic retrieval of medical text. The vector databases allowed us to match any provided string to the most similar concepts or relationships by embedding the search string and returning the closest database item as measured by cosine distance. This allowed us to associate non-identical medical concepts within similar contexts, such as ‘Lopressor’ to ‘metoprolol,’ where a fuzzy-string matching algorithm may fail.

For NER, we employed a zero-shot prompting scheme using the GPT-4 API3, instructing the model to format a list of extracted triplets from unstructured text inputs. To simulate an ideal scenario where NER is perfect and the knowledge graph ground truth is complete, we directly sampled from the knowledge graph; as every edge of the graph is a true negative (nonharmful, verified medical phrase) we randomly permuted origins/targets as well as relations to construct harmful examples. In this idealized, retrieval-only scenario, we achieved a near-perfect performance (F1 = 99.3%) across sampled 100,000 triplets. The Supplementary Methods include further details on the defense strategy, featuring ablation studies (Supplementary Tables 14) across various knowledge graphs, retrieval methods and other algorithmic components.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.