.png)
Build AI Models on Any Cloud in Your Secure Environment
For organizations with data privacy and security concerns, sending your data to an unreliable third-party API is simply not an option, despite bountiful business opportunities that large language models (LLMs) and other advanced AI can bring.Â
Luckily, the MosaicML platform enables you to pretrain or finetune and deploy models using your custom data, all in-house. With full model ownership and data privacy, regulated industries such as financial services and healthcare can leverage the full capabilities of custom large language models (LLMs) for their business use cases without unreliable dependencies on third-party APIs.Â
As shown in our previous blog post, the MosaicML platform is an indispensable tool for modern ML research. MosaicML abstracts away complexity of infrastructure at scale and takes care of operational challenges such as multi-node orchestration, cluster administration, and node health monitoring so that your team can stay focused on developing cutting-edge AI models. On top of that, our platform has built-in speedups to automatically cut down your training times and costs so that you can iterate quickly.

MosaicML believes that everyone should be able to leverage the latest advancements in AI, regardless of their organization’s resources and requirements. That’s why we’ve designed the MosaicML platform to be able to meet you where you are, regardless of whether you are an established organization with existing cloud deployments and security requirements, or an up-and-coming startup seeking availability for ML training workloads.
It’s all possible thanks to a simple control plane/compute plane architecture that is common to all deployments.
The Split-Plane Architecture
We discussed the control plane and compute plane when we introduced the MosaicML platform, but let’s recap.
.png)
When a user submits a run from a client, it first lands in the Control Plane, a collection of services running on MosaicML servers. The control plane is the orchestration engine of the MosaicML platform, and contains the logic for advanced features like multi-cloud scheduling, run preemption and resumption, and multi-node run scaling.
The Compute Plane is where runs execute. The compute plane is led by a lightweight worker daemon which communicates with the control plane through periodic heartbeats. The worker sends cluster status information to the control plane, and in return receives details for new runs to execute.
This two-plane architecture makes adding new GPU clusters to the MosaicML platform simple. The control plane can be safely shared across all deployments, so only the comparatively simple compute plane needs to be deployed to the new cluster. We’ve designed the compute plane to be portable and lightweight, letting it be deployed to any Kubernetes cluster, including one within your organization’s virtual private cloud (VPC).
We know that datasets are among an ML enterprise’s most important IP, and we designed this two-plane architecture to give you the strongest guarantees of data security. Despite the functionality it provides, the control plane only handles run metadata, such as compute resource requirements and Docker image names. This means you can deploy the compute plane into your VPC and use the MosaicML platform without your confidential data ever needing to leave your private network.
Additionally, we have built our platform with a security-first mindset, and make continual updates to ensure compliance with best practices across industries. The compute plane relies solely on egress networking, making it easy to deploy behind a firewall. We are independently audited, and we are currently in the process of attaining industry-standard compliance certifications.Â
If you’re interested in developing your own custom AI models in your secure environment while maintaining full data privacy, contact us for a demo.
Deployment OptionsÂ
We designed the MosaicML platform to be flexible enough so that any Kubernetes cluster can execute runs. This enables several different types of deployment, depending on your organization’s needs.
Let’s explore a few of the options you can choose from when deploying the MosaicML platform.
Use Your Existing InfrastructureÂ
For many organizations, training models on infrastructure you don’t control is not an option. Perhaps data security is vital to you, and you cannot risk your datasets leaving your own private network. Perhaps you have existing compute on a cloud service provider, either in the form of long-term capacity reservation or credits that you’re looking to use.
For maximum control over your data and infrastructure, you can deploy the compute plane directly onto your own cluster. We’ve invested in making the compute plane lightweight and portable, so deploying it is easy. The only requirement is Kubernetes. And, thanks to our extensive experimentation, the MosaicML platform provides preset configurations for instance types on all major cloud service providers, allowing efficient multi-node training workloads to work straight out of the box.

This style of deployment provides inherent data security. While insensitive metadata like run manifests and resource requirements will still transit through the MosaicML control plane, your workloads can load datasets and other secrets directly from your private network. This allows you to leverage the power of MosaicML without any risk of leaking sensitive data.
.png)
Need to get started quickly? Use our infrastructure
For smaller organizations, it can be difficult to set up a cluster capable of training state-of-the-art models. Cloud providers generally have limited GPU availability outside of long-term capacity reservations. Furthermore, multi-node workloads have complicated networking hardware requirements that can be difficult to identify without significant experimentation. Time spent tackling these infrastructure challenges is time your ML team can’t spend training models.
With a MosaicML-managed cluster, we focus infrastructure so that you can stay focused on building the best models for your business.Â

MosaicML’s own researchers conduct research on clusters like these, so we know that these clusters are capable of achieving the best performance on industry-relevant benchmarks. We use only the latest GPUs, and we ensure that all nodes are connected with high-speed networking to ensure optimal performance on multi-node workloads. When nodes fail (an inevitable reality in large-scale ML), we take responsibility for replacing the failed node with a new one to ensure minimal disruption to your workload.
On-Premise Deploy
For organizations with the strictest infrastructure and compliance requirements, the MosaicML platform also supports fully on-premise deployments. In this approach, both the control plane and the data plane are deployed onto your servers, for ultimate data security. Contact us if you’d like to learn more about this!

A Multi-Cloud Platform with Zero Vendor Lock-In
Finally, who’s to say you only need one cluster? For maximum flexibility, a single organization can deploy the MosaicML platform to multiple clusters. You can even mix and match different deployment types, such as a MosaicML-managed cluster and a private cluster. Regardless of your setup, we provide a consistent interface for submitting runs to any cluster type.

Multi-cloud deployments also open the door to novel training workflows. For instance, you might use a MosaicML-managed cluster for pre-training a model on public data, but then load a checkpoint into a private cluster for fine-tuning on private data. You can set quotas on a per-cluster basis, enabling your admins to limit users’ access to heavy-traffic clusters while still allowing full access to lower-traffic ones.
We’ll be talking more about the opportunities multi-cloud deployments bring, along with the challenges we faced to make them possible, in a future blog post.
Try Out the MosaicML Platform Today
The MosaicML platform is an invaluable tool to take your ML training to the next level, and in this blog, we explored how easy it is to get started with, regardless of where your organization may be. The platform’s architecture is designed for maximum autonomy and control so that you can easily adjust your cloud service provider to fit your organization’s needs over time.Â
If you’re interested in training your own custom state-of-the-art AI on your own data, contact us for a demo, and check out our demo video. As always, we welcome you to follow us on Twitter and join our Community Slack to keep up with our latest product updates.