The Natural Language Processing (NLP) field does not stand still. One of the major recent advances is the multilingual language models: a model that exploits a universal, language-independent representation for texts. For Kindly AI it means that our chatbots can all of a sudden understand more than 100 languages – without any translations!
What does being multilingual mean? 🌐
An NLP pipeline often includes the following essential steps:
- Preprocessing: for example tokenizing, removing stop words and punctuation, lemmatization, etc.
- Representing each token: for example a word or the whole sentence as a vector or embedding.
Solving the so-called downstream task: e.g. question answering, text summarization, etc. In our case, the task is dialogue classification (see the problem statement in our previous article.)
The downstream task may be independent of the previous steps: it takes a vector as an input and does not need to know exactly how this vector was generated. This property is crucial: it makes the whole structure modular and allows reusing the representations for many different tasks.
Obtaining good representations is challenging. Initially, sentence semantics need to be preserved. To put it simply, a good model represents the meaning of the sentence, not just its words. For example the sentences "A trunk of an elephant" and "A trunk of a car" need to have fairly different representations for a model to be considered a “semantics-embedder”.
Secondly, training a language model requires a lot of computational power; let alone a language model which by default needs a lot of training data to generalize sufficiently. That makes the modular structure even more important: we do need to retrain only for our downstream task and just re-use the language model itself. This way, we can use a well-developed language model and add a custom model that solves a specific problem (dialogue classification problem, in our case) on top of it.
Another feature derived from the modular approach is the multilinguality. We do not need to tune our task-specific models for each language: if the language model is multilingual, it just passes the universal representations downstream.
Benefits of a multilingual pipeline 💡
- Well-developed language models like XLM-Roberta provide accurate sentence universal representations which allow for combining the semantics and knowledge from all languages in use.
- It expands the audience coverage: previously, the bot could only be used by people who spoke the language that the bot was built in, or was translated to. A naturally multilingual bot is now accessible for native speakers of all the languages that the multilingual language model supports!
This is particularly beneficial for e-commerce websites: a great example of this is our customer, HappySocks. They use the multilingual model and have been able to enter the new selling markets.
Which industries benefit the most from multilingual bots? ✈️
Travel-related websites (for example airline companies and accommodation-booking websites) and government institutions are among the businesses that will benefit the most from this.
A use case from UDI ⚡️
A great example of this is UDI’s website. By design, it attracts the attention of multinational (and, as a result, multilingual) communities. Typically, these communities have difficulty expressing their requests in the Norwegian language. That’s why it’s very beneficial that they are able to express their inquiries in their native language - and get the replies in the language the bot was originally built in, which they can translate into their own language.
Challenge #1: Inference time ⏱
When using the word-level embeddings (and then aggregating them to represent a whole phrase), it is very easy to fetch them: static token-level embeddings, for example, can be thought of as a big dictionary of words where the embeddings are stored by the key, unified word. While using sentence-level multilingual embeddings, we do not have the luxury of fetching the necessary embedding in constant time: there are just too many possible sentences in the world, and we cannot precompute the embeddings for all of them. Thus, the pre-trained language model needs to perform the inference: a forward pass through the network with the given sentence, which may significantly increase the total processing time of our pipeline. However, there are a few options that can help us leverage this waiting time.
Possible ways of addressing the challenge 🎯
- One way of reducing the inference time could be reducing the size of multilingual embeddings model: the more lightweight the model is and the easier the computations, the faster the inference time will be. You can try experimenting with knowledge distillation, quantizing the model, and other popular techniques.
- Another way to go could be finding a less knowledgeable model. Could it be the case that you need fewer languages than the model supports? Reducing the number of languages from 100+ to the 20 most popular ones will certainly have a positive impact on the model size and performance.
- One more option is caching the known sentences: while we cannot precompute the embeddings for all possible sentences, we can store the ones we have seen before – and, when a new sentence comes, query the database to check if we already have the embeddings. Thanks to Zipf's law, a large fraction of all messages will be taken by a few most popular ones.
- The last and the most straightforward solution (alas, not the most effective or creative!) is to increase the compute resources on the endpoint running your model. This approach can be cheaper in the short term but performance should be addressed in a more holistic manner in similar ways to the ones mentioned above.
Challenge #2: Less transparent explainability 🗣
When the word embeddings are used, it can be a case for the machine learning explainability frameworks to be quite sure when pointing out which specific word has driven the downstream task model to one or other conclusion. In the same setting, when the sentence embeddings are used, it can be trickier for explainability frameworks to understand what is going on.
How to address this challenge 🧩
This flaw is forever relatively minor and just needs a bit longer investigation in the interpretability framework. BERT-like model structure also helps us to see the attention distribution in the text sample, so the aforementioned flaw is not really a big deal.
Conclusion 🤖
Adding the multilingual embeddings to your Machine Learning zoo certainly requires some modular infrastructural changes and planning time, but the efforts are certainly worth it: you will strengthen your functionality portfolio and will have much more satisfied end-users of your software.
Be sure to choose the most appropriate language model for you, taking into account both the quality of the downstream results and the available computational power: it can be the most up-to-date model, the most compact and fast model, or anything else, depending on your business case.