MosaicML Inference: Secure, Private, and Affordable Deployment for Large Models
As large models like ChatGPT and Stable Diffusion grow in popularity, more and more organizations want access to their capabilities. When cost is not a concern and sending data to a third party is acceptable, public APIs like OpenAI's GPT family offer a great solution.
However, many organizations have strict data privacy requirements and want to minimize costs. When this is the case, building and hosting your own model can provide a secure, cost-effective alternative. The problem is that training and deploying a large, high-quality model is difficult.
To make large models accessible to all organizations, we're releasing MosaicML Inference. We've already helped many companies train large models, and adding this managed inference service lets us provide an end-to-end platform for turning data into production-grade APIs.
Introducing Starter and Enterprise Tiers
MosaicML Inference has two tiers: Starter and Enterprise. With the Starter tier, you can query off-the-shelf models hosted by MosaicML via a public API. This is great for prototyping AI use cases.
With the Enterprise tier, you keep the ease of use but gain security, flexibility, and control. With one line of code, you can turn a saved model checkpoint into a secure, inexpensive API hosted within your own virtual private cloud (VPC).
What models are available?
With the Enterprise tier, you can deploy any model you want. This includes models trained on your own internal data for maximum prediction quality. It also includes all the open-source models available in the Starter tier.
Within the Starter tier, we offer models for text embedding and text completion. Text embedding models turn text into mathematical vectors that can be fed into other machine learning algorithms. For example, Pinterest creates an embedding for each user and uses this embedding to recommend content to the user. The following text embedding models are available:

Instructor-Large and Instructor-XL are open source models from HKUNLP. See https://instructor-embedding.github.io/
Text completion models take in a snippet of text called a “prompt” and produce a continuation of that snippet. These models are similar to ChatGPT, though they do not natively support interactivity — just standalone responses to single prompts. The Starter tier features text completion models ranging in size from 1 to 20 billion parameters.

GPT2-XL is an open source model from OpenAI. See https://huggingface.co/gpt2-xl
MPT-7B-Instruct is an open source model from MosaicML. See https://huggingface.co/mosaicml
Dolly-12B is an open source model from Databricks. See https://huggingface.co/databricks/dolly-v2-12b
GPT-NeoX-20B is an open source model from EleutherAI. See https://huggingface.co/EleutherAI/gpt-neox-20b
Is it ready for the enterprise?
MosaicML Inference’s Enterprise tier is designed to meet the strict security, privacy, and DevOps requirements of our enterprise customers.
Security and Privacy. Because MosaicML Inference can be deployed within your own virtual private cloud (VPC), data never has to leave your secure environment. This lets you provide the AI features your organization needs while remaining compliant with regulations like SOC 2 and HIPAA.

Cost effectiveness. MosaicML Inference is highly optimized to give you low latency and high hardware utilization. Moreover, it can handle even huge models that don't fit in a single GPU's memory. We’ve profiled MosaicML Inference extensively and found that it can be several times cheaper than alternatives for a given query load.
Scalability and Fault Tolerance. You can scale up to as many machines as you need to support high query loads. Scaling down under low load is just as easy. Automatic failure handling ensures you can maintain high availability.
Multi-Cloud Orchestration. MosaicML Inference makes it easy to deploy on AWS, GCP, Azure, OCI, CoreWeave, your own hardware, and more. The ability to spin up inference clusters across clouds lets MosaicML Inference reduce vendor-lock rather than increase it.

Monitoring. MosaicML Inference offers detailed reporting on cluster and model metrics to support enterprise-grade DevOps.
Endpoints. Easily query your models using a REST API, GRPC, or a web interface. Difficult features like word-by-word output streaming and dynamic query batching are already set up for you.
Can I build this myself?
Given enough time and money, you can build your own inference service. If you
- Only want to deploy a small model
- Don’t care about the model being executed efficiently
- Have no security, authentication, availability, monitoring, or fault tolerance requirements
- Have an easy way to expose a new API endpoint
Then deploying your model can be as simple as following an online tutorial.
If you don’t meet the above conditions, however, deployment can be far more complex. Making any application reliable enough for enterprise SLAs is challenging, and large models add extra hurdles. To avoid low utilization, high latency, and out-of-memory errors, these models require an efficient–often distributed–runtime. This runtime entails not only model sharding and intra-model optimizations, but also features like asynchronous output streaming and intelligent request batching. To make matters worse, there are few people in the world with experience getting huge models into production, and most organizations lack access to this talent pool.
How do I use it?
To use the Starter tier, simply sign up for a MosaicML account and begin sending requests to one of our public APIs. For example, here’s how to submit a simple prompt to our MPT-7B-Instruct text completion model, and an example response:
When self-hosting a model with our Enterprise tier, you can configure the endpoint to use any model checkpoint, any Docker container, any cloud provider, any number of server replicas, and more. Getting your model deployed takes just a few steps.
Step 1: Customize deployment settings
We’ll start by specifying what Docker image to use, how many GPUs to employ, where the model checkpoint is stored, and other necessary information.
Step 2: Launch the deployment
Next, we use this configuration to create an inference endpoint.
At this point, we’re technically done–the model service will start as soon as your cluster has availability. There are a few more commands we might want to run though.
Step 3 (optional): Test the deployment
Most simply, we can check if our endpoint is live via:
Similarly, we can test this API from the command line:
Step 4 (optional): Remove your deployment once no longer needed
To remove your deployment and free up its resources, you can delete it with:
This immediately stops the usage-based billing so that you only pay for what you use.
All of the above functionality is also available through our client library in Python:
Because you retain full control over the code and Docker image, you have the flexibility to use any model you want, write any logic you need before or after the model, and satisfy any other requirements your DevOps or machine learning teams have.
To learn more about deploying custom models with MosaicML Inference, see our documentation, our image generation example or our text embedding example.
Conclusion
For organizations that care about cost, data privacy, multi-cloud support, or simple time to value, MosaicML Inference is a great option for deploying large models. If you’re interested in using it, you can sign up here to get started.
P.S.: for those curious about what’s happening under the hood, we’ll be releasing a technical deep dive in the next few weeks. You can follow us on Twitter or join our Slack to stay up to date.