🚀 Our MPT-7B family of open-source models is trending on the Hugging Face Hub! Take a look at our blog post to learn more. 🚀

✨ We’ve just launched our Inference service. Learn more in our blog post

MosaicML Inference: Secure, Private, and Affordable Deployment for Large Models

MosaicML Inference: Secure, Private, and Affordable Deployment for Large Models

We’re releasing a fully managed inference service to make deploying machine learning models as easy as possible. You can query off-the-shelf models with our Starter tier or securely deploy in-house models in your own environment with our Enterprise tier. By using MosaicML for both training and deployment, you can easily turn your data into production-grade AI services — often at a fraction of the cost of alternatives — without compromising data privacy.
Bar graph showing the cost savings possible with MosaidML Inference
Figure 1: Hosting a model using MosaicML Inference is far cheaper than using an OpenAI API with a similar model size. This holds for text and code generation models, text embedding models, and image generation models. It’s also cheaper to use the APIs in our Starter tier than similar OpenAI APIs. All MosaicML measurements are taken on 40GB NVIDIA A100s with standard 512-token input sequences or 512x512 images.

As large models like ChatGPT and Stable Diffusion grow in popularity, more and more organizations want access to their capabilities. When cost is not a concern and sending data to a third party is acceptable, public APIs like OpenAI's GPT family offer a great solution.

However, many organizations have strict data privacy requirements and want to minimize costs. When this is the case, building and hosting your own model can provide a secure, cost-effective alternative. The problem is that training and deploying a large, high-quality model is difficult.

To make large models accessible to all organizations, we're releasing MosaicML Inference. We've already helped many companies train large models, and adding this managed inference service lets us provide an end-to-end platform for turning data into production-grade APIs.

Introducing Starter and Enterprise Tiers

MosaicML Inference has two tiers: Starter and Enterprise. With the Starter tier, you can query off-the-shelf models hosted by MosaicML via a public API. This is great for prototyping AI use cases.

With the Enterprise tier, you keep the ease of use but gain security, flexibility, and control. With one line of code, you can turn a saved model checkpoint into a secure, inexpensive API hosted within your own virtual private cloud (VPC).

Figure 2: The Starter tier lets you query an open source model hosted by MosaicML. The Enterprise tier lets you query any model from within the security of your own virtual private cloud.

What models are available?

With the Enterprise tier, you can deploy any model you want. This includes models trained on your own internal data for maximum prediction quality. It also includes all the open-source models available in the Starter tier.

Within the Starter tier, we offer models for text embedding and text completion. Text embedding models turn text into mathematical vectors that can be fed into other machine learning algorithms. For example, Pinterest creates an embedding for each user and uses this embedding to recommend content to the user. The following text embedding models are available:

Figure 3: Text embedding models available in the Starter tier.
Instructor-Large and Instructor-XL are open source models from HKUNLP. See https://instructor-embedding.github.io/

Text completion models take in a snippet of text called a “prompt” and produce a continuation of that snippet. These models are similar to ChatGPT, though they do not natively support interactivity — just standalone responses to single prompts. The Starter tier features text completion models ranging in size from 1 to 20 billion parameters.

Figure 4: Text completion models available in the Starter tier.
GPT2-XL is an open source model from OpenAI.  See https://huggingface.co/gpt2-xl
MPT-7B-Instruct is an open source model from MosaicML. See https://huggingface.co/mosaicml
Dolly-12B is an open source model from Databricks. See https://huggingface.co/databricks/dolly-v2-12b
GPT-NeoX-20B is an open source model from EleutherAI. See https://huggingface.co/EleutherAI/gpt-neox-20b

Is it ready for the enterprise?

MosaicML Inference’s Enterprise tier is designed to meet the strict security, privacy, and DevOps requirements of our enterprise customers.

Security and Privacy. Because MosaicML Inference can be deployed within your own virtual private cloud (VPC), data never has to leave your secure environment. This lets you provide the AI features your organization needs while remaining compliant with regulations like SOC 2 and HIPAA.

Figure 5: The Enterprise tier lets you securely host your model in your own infrastructure, lowering cost and preserving data privacy.

Cost effectiveness. MosaicML Inference is highly optimized to give you low latency and high hardware utilization. Moreover, it can handle even huge models that don't fit in a single GPU's memory. We’ve profiled MosaicML Inference extensively and found that it can be several times cheaper than alternatives for a given query load.

Scalability and Fault Tolerance. You can scale up to as many machines as you need to support high query loads. Scaling down under low load is just as easy. Automatic failure handling ensures you can maintain high availability.

Multi-Cloud Orchestration. MosaicML Inference makes it easy to deploy on AWS, GCP, Azure, OCI, CoreWeave, your own hardware, and more. The ability to spin up inference clusters across clouds lets MosaicML Inference reduce vendor-lock rather than increase it.

Figure 6: The Enterprise tier works with all major public clouds, as well as on-premise.

Monitoring. MosaicML Inference offers detailed reporting on cluster and model metrics to support enterprise-grade DevOps.

Endpoints. Easily query your models using a REST API, GRPC, or a web interface. Difficult features like word-by-word output streaming and dynamic query batching are already set up for you.

Can I build this myself?

Given enough time and money, you can build your own inference service. If you

  • Only want to deploy a small model
  • Don’t care about the model being executed efficiently
  • Have no security, authentication, availability, monitoring, or fault tolerance requirements
  • Have an easy way to expose a new API endpoint

Then deploying your model can be as simple as following an online tutorial.

If you don’t meet the above conditions, however, deployment can be far more complex. Making any application reliable enough for enterprise SLAs is challenging, and large models add extra hurdles. To avoid low utilization, high latency, and out-of-memory errors, these models require an efficient–often distributed–runtime. This runtime entails not only model sharding and intra-model optimizations, but also features like asynchronous output streaming and intelligent request batching. To make matters worse, there are few people in the world with experience getting huge models into production, and most organizations lack access to this talent pool.

How do I use it?

To use the Starter tier, simply sign up for a MosaicML account and begin sending requests to one of our public APIs. For example, here’s how to submit a simple prompt to our MPT-7B-Instruct text completion model, and an example response:

$ curl https://models.hosted-on.mosaicml.hosting/mpt-7b-instruct/v1/predict \
    -H "Authorization: your-api-key" \
    -H "Content-Type: application/json" \
    -d '{"temperature": 0.01, "input_strings": ["Write 3 reasons why you should train an AI model on domain-specific data."]}'


{
	'data': [
    	'1. The model will be more accurate.\n2. The model will be more efficient.\n3. The model will be more interpretable.'
	],
}

When self-hosting a model with our Enterprise tier, you can configure the endpoint to use any model checkpoint, any Docker container, any cloud provider, any number of server replicas, and more. Getting your model deployed takes just a few steps.

Step 1: Customize deployment settings

We’ll start by specifying what Docker image to use, how many GPUs to employ, where the model checkpoint is stored, and other necessary information.

// Sample input YAML to deploy a Hugging Face GPT2 model
name: my-gpt2
gpu_num: 1
gpu_type: a100_40gb
image: mosaicml/inference:latest
replicas: 1
model:
  checkpoint_path: gpt2
  hf_model:
    task: text-generation

Step 2: Launch the deployment

Next, we use this configuration to create an inference endpoint.

mcli deploy -f inference-deployment-example.yaml

At this point, we’re technically done–the model service will start as soon as your cluster has availability. There are a few more commands we might want to run though.

Step 3 (optional): Test the deployment

Most simply, we can check if our endpoint is live via:

mcli ping

Similarly, we can test this API from the command line:

mcli predict --inputs ‘{“input_strings”: [“Write a list of reasons why you should train an AI on domain specific data”], “temperature”: 0.8, “max_length”: 256}’

Step 4 (optional): Remove your deployment once no longer needed

To remove your deployment and free up its resources, you can delete it with:

mcli delete deployments my-gpt2-*

This immediately stops the usage-based billing so that you only pay for what you use.

All of the above functionality is also available through our client library in Python:

from mcli.sdk import *


# Load in deployment config from yaml file
config = InferenceDeploymentConfig.from_file('inference-deployment-example.yaml')


# Create the deployment based on the config
inference_deployment = create_inference_deployment(config)


# Check whether the deployment is live
ping(inference_deployment)


# Run inference on inputs
input_params = {
   'input_strings': ["Write a list of reasons why you should train an AI on domain-specific data"],
   'temperature': 0.8,
   'max_length': 256,
}
output = predict(inference_deployment, input_params)


# Delete inference deployment
delete_inference_deployments(inference_deployment)

Because you retain full control over the code and Docker image, you have the flexibility to use any model you want, write any logic you need before or after the model, and satisfy any other requirements your DevOps or machine learning teams have.

To learn more about deploying custom models with MosaicML Inference, see our documentation, our image generation example or our text embedding example.

Conclusion

For organizations that care about cost, data privacy, multi-cloud support, or simple time to value, MosaicML Inference is a great option for deploying large models. If you’re interested in using it, you can sign up here to get started.

P.S.: for those curious about what’s happening under the hood, we’ll be releasing a technical deep dive in the next few weeks. You can follow us on Twitter or join our Slack to stay up to date.

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.