🚀 Our MPT-7B family of open-source models is trending on the Hugging Face Hub! Take a look at our blog post to learn more. 🚀

✨ We’ve just launched our Inference service. Learn more in our blog post ✨

Build AI Models on Any Cloud in Your Secure Environment

Build AI Models on Any Cloud in Your Secure Environment

In this blog, we discuss how the architecture of the MosaicML platform enables you to easily train large-scale AI models on any cloud provider, while data remains secure on your own private network. Now, both startups and large enterprises can maintain maximum autonomy when training ML workloads.

For organizations with data privacy and security concerns, sending your data to an unreliable third-party API is simply not an option, despite bountiful business opportunities that large language models (LLMs) and other advanced AI can bring. 

Luckily, the MosaicML platform enables you to pretrain or finetune and deploy models using your custom data, all in-house. With full model ownership and data privacy, regulated industries such as financial services and healthcare can leverage the full capabilities of custom large language models (LLMs) for their business use cases without unreliable dependencies on third-party APIs. 

As shown in our previous blog post, the MosaicML platform is an indispensable tool for modern ML research. MosaicML abstracts away complexity of infrastructure at scale and takes care of operational challenges such as multi-node orchestration, cluster administration, and node health monitoring so that your team can stay focused on developing cutting-edge AI models. On top of that, our platform has built-in speedups to automatically cut down your training times and costs so that you can iterate quickly.

Figure 1: Your team can queue up hundreds of training runs and view their statuses in real-time using the MosaicML platform.

MosaicML believes that everyone should be able to leverage the latest advancements in AI, regardless of their organization’s resources and requirements. That’s why we’ve designed the MosaicML platform to be able to meet you where you are, regardless of whether you are an established organization with existing cloud deployments and security requirements, or an up-and-coming startup seeking availability for ML training workloads.

It’s all possible thanks to a simple control plane/compute plane architecture that is common to all deployments.

The Split-Plane Architecture

We discussed the control plane and compute plane when we introduced the MosaicML platform, but let’s recap.

Figure 2: The MosaicML platform has three parts: the client interfaces, the control plane, and the compute plane.

When a user submits a run from a client, it first lands in the Control Plane, a collection of services running on MosaicML servers. The control plane is the orchestration engine of the MosaicML platform, and contains the logic for advanced features like multi-cloud scheduling, run preemption and resumption, and multi-node run scaling.

The Compute Plane is where runs execute. The compute plane is led by a lightweight worker daemon which communicates with the control plane through periodic heartbeats. The worker sends cluster status information to the control plane, and in return receives details for new runs to execute.

This two-plane architecture makes adding new GPU clusters to the MosaicML platform simple. The control plane can be safely shared across all deployments, so only the comparatively simple compute plane needs to be deployed to the new cluster. We’ve designed the compute plane to be portable and lightweight, letting it be deployed to any Kubernetes cluster, including one within your organization’s virtual private cloud (VPC).

We know that datasets are among an ML enterprise’s most important IP, and we designed this two-plane architecture to give you the strongest guarantees of data security. Despite the functionality it provides, the control plane only handles run metadata, such as compute resource requirements and Docker image names. This means you can deploy the compute plane into your VPC and use the MosaicML platform without your confidential data ever needing to leave your private network.

Additionally, we have built our platform with a security-first mindset, and make continual updates to ensure compliance with best practices across industries. The compute plane relies solely on egress networking, making it easy to deploy behind a firewall. We are independently audited, and we are currently in the process of attaining industry-standard compliance certifications. 

If you’re interested in developing your own custom AI models in your secure environment while maintaining full data privacy, contact us for a demo.

Deployment Options 

We designed the MosaicML platform to be flexible enough so that any Kubernetes cluster can execute runs. This enables several different types of deployment, depending on your organization’s needs.

Let’s explore a few of the options you can choose from when deploying the MosaicML platform.

Use Your Existing Infrastructure 

For many organizations, training models on infrastructure you don’t control is not an option. Perhaps data security is vital to you, and you cannot risk your datasets leaving your own private network. Perhaps you have existing compute on a cloud service provider, either in the form of long-term capacity reservation or credits that you’re looking to use.

For maximum control over your data and infrastructure, you can deploy the compute plane directly onto your own cluster. We’ve invested in making the compute plane lightweight and portable, so deploying it is easy. The only requirement is Kubernetes. And, thanks to our extensive experimentation, the MosaicML platform provides preset configurations for instance types on all major cloud service providers, allowing efficient multi-node training workloads to work straight out of the box.

Figure 3: The MosaicML platform allows you to keep your datasets, code, and models safe within your private network, while still leveraging the orchestration capabilities of the control plane.

This style of deployment provides inherent data security. While insensitive metadata like run manifests and resource requirements will still transit through the MosaicML control plane, your workloads can load datasets and other secrets directly from your private network. This allows you to leverage the power of MosaicML without any risk of leaking sensitive data.

Figure 4: Built with a security-first mindset, the MosaicML platform keeps all sensitive data like your data, models, and source code in the compute plane. Application data like run metadata, compute resource requirements, and user information is stored in the control plane. 

Need to get started quickly? Use our infrastructure

For smaller organizations, it can be difficult to set up a cluster capable of training state-of-the-art models. Cloud providers generally have limited GPU availability outside of long-term capacity reservations. Furthermore, multi-node workloads have complicated networking hardware requirements that can be difficult to identify without significant experimentation. Time spent tackling these infrastructure challenges is time your ML team can’t spend training models.

With a MosaicML-managed cluster, we focus infrastructure so that you can stay focused on building the best models for your business. 

Figure 5: Scale your training with minimal setup. MosaicML can provide fully-managed clusters running on the latest, most advanced hardware.

MosaicML’s own researchers conduct research on clusters like these, so we know that these clusters are capable of achieving the best performance on industry-relevant benchmarks. We use only the latest GPUs, and we ensure that all nodes are connected with high-speed networking to ensure optimal performance on multi-node workloads. When nodes fail (an inevitable reality in large-scale ML), we take responsibility for replacing the failed node with a new one to ensure minimal disruption to your workload.

On-Premise Deploy

For organizations with the strictest infrastructure and compliance requirements, the MosaicML platform also supports fully on-premise deployments. In this approach, both the control plane and the data plane are deployed onto your servers, for ultimate data security. Contact us if you’d like to learn more about this!

Figure 6: On-premise deployment, where zero traffic goes in or out of your private network, is possible for organizations with the strictest compliance requirements.

A Multi-Cloud Platform with Zero Vendor Lock-In

Finally, who’s to say you only need one cluster? For maximum flexibility, a single organization can deploy the MosaicML platform to multiple clusters. You can even mix and match different deployment types, such as a MosaicML-managed cluster and a private cluster. Regardless of your setup, we provide a consistent interface for submitting runs to any cluster type.

Figure 7: A demonstration of using Mosaic CLI, a client interface for the MosaicML platform. Using a single command parameter, users can view available clusters and load model checkpoints to a different cloud provider.

Multi-cloud deployments also open the door to novel training workflows. For instance, you might use a MosaicML-managed cluster for pre-training a model on public data, but then load a checkpoint into a private cluster for fine-tuning on private data. You can set quotas on a per-cluster basis, enabling your admins to limit users’ access to heavy-traffic clusters while still allowing full access to lower-traffic ones.

We’ll be talking more about the opportunities multi-cloud deployments bring, along with the challenges we faced to make them possible, in a future blog post.

Try Out the MosaicML Platform Today

The MosaicML platform is an invaluable tool to take your ML training to the next level, and in this blog, we explored how easy it is to get started with, regardless of where your organization may be. The platform’s architecture is designed for maximum autonomy and control so that you can easily adjust your cloud service provider to fit your organization’s needs over time. 

If you’re interested in training your own custom state-of-the-art AI on your own data, contact us for a demo, and check out our demo video. As always, we welcome you to follow us on Twitter and join our Community Slack to keep up with our latest product updates.

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.