MetaDialog: Customer Spotlight
How did you get started working with MosaicML?
I’m the CEO, but I also have a technical background. I have a master's degree from Oxford in computer science. When we started the company, I was already following some of the researchers from MosaicML on Twitter. So from the beginning, we've been talking to your team. And we were kind of testing the waters, trying to see what different companies are offering, both in terms of hardware and software. We decided to go with MosaicML, not just because of the hardware-to-price comparison, but also because of how you work with clients. We felt really supported and you made it easier for us to train the model that we wanted, with full technical support along the way. You really helped us with what we're trying to achieve.
How are you using MosaicML tools to train models?
We trained two models with your tools. We trained a semantic search model. It's an LLM, but it works specifically for retrieving data. And we also trained a language model, similar to ChatGPT. We trained both of them using your Composer library to train the model and schedule and run everything, and we also used your hardware platform. We trained the models as part of a system implemented in one of the biggest governments in the gulf region.
How big were the models that you trained?
The semantic search model had 350 million parameters, but we trained it on a big amount of data: almost 500 billion tokens. It's quite a lot of data for a small model. I think we used something like 250 NVIDIA A100 GPUs to do that. Next, we trained the language model. It has 7 billion parameters and we trained it on about 1 trillion tokens, roughly split half and half between Arabic and English.
Where did you get the data for this dual-language model?
So, that was a big problem. For English, it's quite easy. There is a lot of open source data, and you can find more than 1 trillion tokens in English. But in Arabic, it’s kind of hard to find more than 50 billion tokens, and we needed 10 times more than that, so we gathered the data ourselves. We used tens of thousands of CPU cores and we scraped hundreds of millions of web pages. I think in the end we scraped almost everything there is to scrape in Arabic on the web.
So, you've basically built the definitive Arabic language dataset?
Yes. Actually, the model we trained with MosaicML is the largest one ever, not in terms of the size of the model itself, but how much data it actually processed. The number of tokens it was trained on in Arabic is the most out of any model that has been trained. The model is now being used by one of the largest governments in the Gulf region.
How long did it take you to train these two models? Were there any issues with your hero runs?
I think the first semantic model took just a few days. The Arabic language model took maybe 10 days to train. And it was as seamless as it can be. Of course, no training at that scale is completely seamless. We had a node die at some point. We had some problems with throughput in the beginning and had to make some changes to the code a little bit.
But again, the support we had from MosaicML was really amazing. Basically, whenever we had a problem, your AI engineers would connect with us and try to help us solve it together, debug our code, and sometimes even propose code changes. I would say that it was as seamless as you can get. If the system stopped because of a problematic node it would automatically replace that node, restart, and just continue training. So there was a lot of headache that was avoided by using Composer and the entire software stack on your platform.
Honestly, it was such a good experience that I talked to our customer experience team and I told them that this is how I want you to work with customers so that they get the same experience that I got with MosaicML.
Do have any more models you're planning to train in the near future? What's next for MetaDialog?
Let me first give a bit of background on what our company does. All of our clients want to get some value out of their data. And so in a nutshell, what we do is enable companies to build their own GPT-style models that work with their own data: textual data, websites, databases, and APIs. One of the most popular use cases for these models is customer service. If someone is asking questions about your product, it's easy to generate an answer, but a generic model may not be able to answer questions about your specific product prices and features.
And larger companies, enterprises, and governments have a big focus on security and privacy. If you’re a government employee, you might want to use an AI tool to draft an email, but you probably don't want that data getting outside. So you want to have an on-premises solution that is private and secure. Another consideration is related to culture and compliance. If you're a big company, there are some responses that you just don't want the model to generate, like answers that aren’t compliant with your internal policies.
For smaller companies, we sometimes just use one of their existing models. It could be an open source model; for larger organizations, we might finetune a model, but in most cases we train from scratch because finetuning is not good enough, especially if you need another language. Now we are working on a new 4 terabyte language model and we are planning to start training within the next few months.
Where will you source your additional Arabic data for this model?
Right now we're doing optical character recognition on books. We have a large book collection, but it's in PDF format and not readable for our model, so we’re trying different techniques of enhancement. One of the more promising research directions in the last couple of months is using language models to generate better training data for the model.
Another consideration is that you don't need as much data if your data is higher quality. So we're trying to see if we can enhance or filter the highest quality data that we have, and perhaps have enough to train a larger model. We should also have better results because the data that we use to train the model is of higher quality.
How have you been using explainability and retrieval augmented generation (RAG) in model development?
Most of what we do is RAG, because there are different types of data sources and we need to get information out of them. The search model that we trained was actually a combination of a search model and a generation model. I would say that our RAG techniques are quite advanced compared to what is on the market today because most are only able to get textual data. Our RAG techniques are a bit more advanced because they can understand where to get data from, get it in real-time from different sources, and understand and reason about that data.
Regarding explainability, in many cases, and especially in high-stakes situations, you have to be very, very precise in the answers that you give. You cannot have hallucinations. It’s one thing if you mix up prices and you end up giving the customer a 50% discount. But it's much more serious if your model says something that is not compliant with a government or corporate policy. So you have to be very, very accurate. As part of our system, one of the things that we enable is fixing mistakes. When we integrate with a company, their employees will test the system and whenever it makes a mistake, we can pinpoint exactly where we got the information that the answer was based on.
It’s also possible to tell the system exactly what the mistake was and then the system will remember it for the future and will not repeat that mistake. After you fix 10 or 20 mistakes, the system starts generalizing and the amount of mistakes keeps going down. Usually, we get to a point where there are almost no mistakes and the system either gives a correct answer or it just says, sorry, I don't know, I don't have that information and transfers you to a human agent. However, the explanation of where the information came from and how we generated that specific answer is crucial. Without that, you don't have trust in the system if you don't understand how it works.
What are you excited about for the future of AI?
I see a lot of positive trends; right now we're at a stage where AI actually starts to become viable in many places, especially everything related to agents. AI is not just a question-answering machine, but a reasoning engine.
So I'm really excited about that. I think that we are not at the stage where it's perfect yet. The models are not smart enough, and we need to work on optimization for reasoning because reasoning is the driving force behind all of this. But soon, almost any menial task that we're doing today can be automated by an AI system. It's not just answering questions, it's actually solving problems for the customer.
To learn more about Metadialog’s platform for automating customer support and generating insights from organizations’ data, visit: https://www.metadialog.com/
What’s a Rich Text element?
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
Static and dynamic content editing
A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!
How to customize formatting for each rich text
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.