BioMedLM: a Domain-Specific Large Language Model for Biomedical Text
Large language models (LLMs) offer amazing capabilities for general-purpose natural language generation, image generation, speech synthesis, and multi-modal combinations of these applications. But is there more we can do when we know they will be used in industry-specific situations?
Today we announce the results of a partnership between MosaicML and the Stanford Center for Research on Foundation Models (CRFM) that demonstrates the capabilities of industry-specific large language models—specifically for the field of biomedicine. Using the MosaicML Cloud platform, CRFM trained a 2.7B parameter GPT on biomedical data from PubMed that achieves state-of-the art results on medical question and answer text from the US Medical Licensing Exam (USMLE) — highlighting the promise of domain-specific language generation models in real-world applications. The result: BioMedLM (formerly known as PubMed GPT).
“We are excited to release a new biomedical model trained on PubMed, a first step in building foundation models that could support biomedical research.” - Percy Liang, Director, Stanford CRFM
Our work reinforces existing research that shows standard LLMs trained on domain-specific data can outperform general-purpose models and compete with expert-designed, domain-specific model architectures. In this blog post, we outline the overall approach, our results, and our takeaway: custom LLMs are a turn-key solution for any organization with domain-specific data, not just a few companies with massive datasets and enormous compute budgets. Before we begin, a reminder: this model was developed solely for research purposes and is not suitable for production.
For additional analysis of the scientific results, as well as more information on how to access the model, take a look at CRFM’s companion blog post.
Model. BioMedLM is based on a HuggingFace GPT model (decoder-only transformer) with 2.7B parameters and a maximum context length of 1024 tokens. It uses a custom biomedical tokenizer trained on PubMed Abstracts with a vocabulary size of 28896.
While CRFM has already made great strides in developing complex models that capture the structure of knowledge in biomedical text, for this project we wanted to keep the model design as simple as possible to demonstrate the power of off-the-shelf LLM training recipes. That way we could reuse the same recipe to train a state-of-the-art GPT model for other domain-specific applications like legal text.
Data. We trained BioMedLM on the PubMed Abstracts and PubMed Central portions of the Pile dataset. This dataset contains around 50B tokens and spans a collection of 16 million abstracts and 5 million full-text articles from the biomedical literature, as curated by the National Institute of Health.
Compute. Although the dataset contains 50B tokens, this does not directly determine the training budget. GPT models of a similar size to BioMedLM are often trained on significantly more data. For example, GPT3-2.7B and GPT-J were trained on 300B and 400B tokens of data, respectively.
Within this design space, we elected to train BioMedLM for a long compute duration (300B tokens) by performing multiple passes, or epochs, over the 50B tokens. Our results show that even when limited data is available, one can still train a custom, high-quality LLMs from scratch!
Training With MosaicML
To train BioMedLM easily, quickly, and efficiently, we used the MosaicML Cloud for infrastructure and trained the model using MosaicML’s Composer and Streaming Dataset libraries. All model and training code is built off of PyTorch. See the code here!
Using our cloud software stack, we orchestrated training on top of a cluster with 128 NVIDIA A100-40GB GPUs and 1600 Gb/s networking bandwidth between nodes. The physical GPUs were hosted on a leading cloud provider. The total training time for BioMedLM was ~ 6.25 days. Using placeholder pricing of $2/A100/hr, the total cost for this training run on MosaicML Cloud was ~ $38,000.
For the optimal LLM training experience, we used Composer with its FSDP integration (FSDP is a PyTorch backend for fully sharded data parallel training). The open-source Composer library makes it easy to train large, custom models across hundreds of GPUs without imposing any restrictions on the model code. For example, we replaced the HuggingFace GPT2Model attention implementation with FlashAttention (Dao et. al), which improved training throughput by nearly 2x while producing a math-equivalent model. Composer had no trouble handling the custom model definition, and training time was cut in half! Having the flexibility to easily add and test modifications greatly improved the training efficiency of BioMedLM, and we expect to make similar improvements in future LLM work.
To manage a training dataset containing over 100GB of text in a cloud-native way, we used MosaicML’s new StreamingDataset library. This library enables users to host arbitrary data (text, images, etc.) as shards in object storage and then stream that data to a training job anywhere in the world. StreamingDataset works out of the box with vanilla PyTorch DataLoaders, and is compatible with multiple CPU workers, multi-GPUs, and multi-node training.
StreamingDataset made it fast, flexible, and cheap for us to manage a custom training dataset. There was no need to pre-tokenize the data; we were able to store the samples as raw text in object storage. At runtime, we streamed in text samples and tokenized on-the-fly, with no impact on training throughput and no data loader bottlenecks. This flexibility and performance enabled us to test different tokenization schemes for BioMedLM without having to regenerate the dataset.
As one last proof point for StreamingDataset, our final training run for BioMedLM did not use compute from AWS, despite the fact that the dataset was stored on AWS S3. Instead, we streamed the data from S3 to MosaicML Cloud without impacting training throughput, and without downloading the whole dataset at the start. Instead, shards were streamed in as they were needed during the training run and cached after the first epoch. This limited the cost of data egress to <$10 for the whole training run, compared to ~$38,000 for the compute!
How Does BioMedLM Perform?
Let’s cut to the chase: does it work? We evaluated BioMedLM on several question and answer (QA) benchmarks, and manually assessed its generations for a question summarization task. One key benchmark was MedQA-USMLE, which consists of question and answer pairs taken from previous Medical Licensing Exams given to doctors in the United States. As we see in Figure 1, the text contains detailed, technical medical queries related to a variety of health concerns.
In our evaluation we compared our results against several models:
- DRAGON is a state-of-the-art biomedical language model, released last month by members of the CRFM team in a separate effort. The model is pre-trained from text (PubMed Abstracts) and an expert-curated biomedical knowledge graph (i.e, the Unified Medical Language System, also known as UMLS).
- GPT-Neo 2.7B is an LLM of similar size and architecture as BioMedLM, but trained on the Pile and therefore not domain-specific.
- Galactica is a 120B parameter LLM, trained on a corpus of over 48 million papers, textbooks, scientific websites, encyclopedias, and other sources of scientific knowledge across multiple domains.
- BioLinkBERT is another biomedical model trained by Stanford CRFM that uses the link structure of documents during training.
- PubMedBERT is another domain-specific language model for biomedical NLP.
Due to time and resource constraints we did not evaluate other biomedical systems that use different evaluation tasks or setups than ours, such as BioMegatron, GatorTron, and BioGPT. See the CRFM companion blog post for more details on their relation to BioMedLM.
Our results as shown in Figure 2 demonstrate our findings: LLMs are remarkably versatile and deliver significant improvements when trained on domain-specific data, while focused models can achieve high quality with relatively few resources.
Conclusion #1: LLMs are remarkably versatile.
BioMedLM outperforms DRAGON on MedQA-USMLE, setting a new state-of-the-art, and matches DRAGON’s performance on two other QA tasks, PubMedQA and BioASQ. Crucially, BioMedLM achieves this without any explicit knowledge graph. These results show that when trained from scratch with domain-specific data, standard LLMs can deliver comparable performance to custom, expert-designed systems.
Like most things in life, this performance does not come for free. BioMedLM has 2.7B parameters compared to DRAGON’s 360m, demonstrating a trade-off between model size/cost versus domain expertise and custom architectures. However, the general purpose nature of LLMs makes them more applicable to different domains. Using the same simple LLM training recipe, we could train a model for legal or financial domain expertise.
Conclusion #2: Pre-training on domain-specific data beats general-purpose data.
BioMedLM significantly outperforms the general-purpose GPT-Neo, a similar model with 2.7B parameters trained on text and code from many domains, by a significant margin (17% on MedQA), even after fine-tuning GPT-Neo for the tasks at hand.
As mentioned above, GPT-Neo is trained on the Pile, a massive corpus that spans PubMed Abstracts and PubMed Central but mainly comprises many other sources, such as Wikipedia, technical text from the US Patent and Trademark Office, GitHub, HackerNews, and Reddit. Although this data set includes sources of natural language that may boost a model’s ability in the domain (e.g, Wikipedia), or expand the model’s technical fluency across many domains (e.g., USPTO), it also includes sources such as HackerNews and Reddit that potentially mix valuable technical concepts with less trustworthy or outright incorrect and biased language. In total, PubMed Abstracts and PubMed Central, our data of interest, comprises only 17.5% of the Pile.
Conclusion #3: Focused models achieve higher quality with fewer resources.
Although the overall size of our dataset is significantly smaller than the Pile, our custom model outperforms GPT Neo thanks to our selection of domain-specific data. BioMedLM also outperforms Galactica (120B) on MedQA, and competes with its behavior on PubMedQA and BioASQ. A key difference is that Galactica’s results are zero-shot, without the task-specific fine-tuning we apply for other models. While this zero-shot behavior is impressive, Galactica’s model targets myriad scientific domains and is 44x the size of our model, incurring significant costs for training and inference. By focusing on a single scientific domain, we arrive at a much smaller model that can still compete with Galactica in biomedical text.
Large language models offer the promise of new capabilities for many companies and researchers, with the potential to deliver increased quality with less data and compute than often assumed.
There still remains an elephant in the room: model size and training cost. However, through our work on efficient model implementations, training efficiency methods in Composer, and scalable cloud abstraction and orchestration with MosaicML Cloud, we offer an accessible - and affordable - solution.
The partnership announced today is just the first step of a larger journey to bring state-of-the art results in biomedical text, and other domains, into the hands of more individuals. As an example, consider our preliminary text generation results with BioMedLM. Below is an example from MeQSum, a medical question summarization benchmark. The task models the behavior of a system that can process a patient query via email or voicemail and present it to a physician in a clear and actionable format.
Input: SUBJECT: ball and socket joint MESSAGE: sir, with due respect I would like to state that there is a pain in my ball and socket joint. I didn't feel and pain in normal or lower position there is a huge site of pain in my ball and socket joint. please prescribe a medicine for the cure.
Output (Current): What are the treatments for joint pain?
Output (BioMedLM 2.7B): What are the treatments for ball and socket joint pain?
From the text we can see that the system must distill a message with ambiguity, misspellings, and misstatements into a version that is succinct yet still preserves the message's intended meaning. BioMedLM’s results are the first step along an exciting journey; however, our work should be seen as a proof-of-concept and is not intended for clinical use. The eventual hope is for a future of trustworthy, interactive AI systems that facilitate reliable interactions while triaging engagement with human experts.
Stay tuned for future improvements and releases of BioMedLM! If you’re interested in training LLMs on your own data, we invite you to schedule a demo.
What’s a Rich Text element?
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
Static and dynamic content editing
A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!
How to customize formatting for each rich text
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.