Multimodal Learning is an evolution of Machine Learning, involving the simultaneous use of several data sources such as text, image and audio to solve much more complex tasks. Find out all you need to know about this new technique, which is set to push back the frontiers of AI!
Artificial intelligence has made impressive progress in recent years. Its development is linked in particular to machine learning and deep neural networks.
However, these advances have mainly been made in “unimodal” tasks. In the real world, however, information comes from multiple sensory sources, combining text, image, audio and even video.
So the next step for AI is to exploit these multiple modalities simultaneously and in an integrated way, for a richer and more complete understanding. To achieve this, researchers are using the “multimodal learning” technique.
What is Multimodal Learning? What data does it use?
When you put your head out of the window, you immediately receive a great deal of information. This is linked to the combination of our 5 senses: hearing, sight, smell, taste and touch, which enable us to perceive sounds, images, textures and scents all at the same time.
Multimodal Learning aims to apply this idea of using different data simultaneously to the field of AI. Let’s start by looking at the different types of source.
Text is one of the most commonly used modalities in Machine Learning. Textual data contains rich, structured information, and natural language processing (NLP) makes it easy to extract knowledge from it.
This data can come from documents, press articles, messages on social networks or any other type of text. The NLP techniques used to process them include tokenisation, lemmatisation, syntactic analysis, named entity detection and text classification.
For their part, images are an essential source of visual information in Multimodal Learning. Thanks to the growing popularity of convolutional neural networks (CNNs), major advances have been made in the understanding of images.
Computer vision techniques can be used to analyse and interpret images to extract knowledge. Examples include object detection, facial recognition and image segmentation.
Audio includes information from voice recordings, sound files or live streams. These are analysed using audio processing techniques to extract acoustic and linguistic characteristics.
The most commonly used methods include voice recognition, sound event detection, source separation and classification.
Finally, video is a powerful source of multimodal data, combining visual and audio information. Once again, computer vision and audio processing techniques can be used to extract knowledge from a sequence.
This makes it possible to detect moving objects, analyse human activity and even recognise gestures. This fusion of visual and audio modalities enables machines to better understand scenes and events.
With the rise of smartphone cameras and video-sharing social networks such as TikTok and YouTube, AIs now have access to a vast reservoir of resources on which to train.
In the future, with the emergence of humanoid robots equipped with tactile sensors on their fingers, artificial intelligences could also receive the sense of touch and use it to learn…
How is Multimodal Learning used?
Multimodal Learning is applied in many and varied ways in many areas of artificial intelligence.
One of the main use cases is scene recognition and understanding. By combining visual, audio and video information, it is possible to analyse and interpret complex scenes with greater precision and detail.
Examples include detecting and tracking moving objects in a video, such as people in CCTV footage.
The combination of visual and audio information helps to automatically detect suspicious events such as aggressive behaviour, intrusions or emergency situations in security camera images. This is a valuable asset for surveillance.
It is also possible to recognise and understand human activity in videos thanks to visual and audio information. For example, in a video recorded during a sporting event, the detection of gestures and the understanding of social interactions help the AI to recognise a sport.
Another field of application for Multimodal Learning is translation. In particular, this approach enables speech and images to be translated simultaneously during an oral presentation accompanied by visual slides. This facilitates comprehension for a multilingual audience.
Similarly, textual instructions can be automatically translated into visual instructions. The aim may be, for example, to guide a robot through its tasks.
There are also caption generators for images, based on Multimodal Learning. This is very useful for people with impaired vision or for automating the subtitling process.
Thanks to a conversational interface based on Multimodal Learning, a virtual assistant can interact with users using voice, text and images. The experience is therefore more natural and immersive, as it becomes possible to express intentions and needs in a variety of ways.
Multimodal Learning and generative AI
Generative artificial intelligence is also based on Multimodal Learning. This type of AI uses neural networks to generate new content: images, videos, text, etc.
For example, AI chatbots such as ChatGPT are based on generative AI to produce text from prompts. By integrating different modalities, they are able to interact with users in a richer, more natural way.
The most advanced dialogue systems, such as OpenAI’s GPT-4, integrate text, speech and images to understand and respond to requests in a contextual and personalised way.
Similarly, the DALL-E AI creates images from text prompts. It has been trained on both text and images to learn how to associate them.
Generative AI can enable more personalised human-machine interactions, create realistic 3D images and videos for films or video games, or even new product designs.
Different approaches and techniques
Multimodal Learning is a complex discipline, based on a vast array of techniques. Here are the most commonly used.
Firstly, fusion models play a key role in combining information. Multimodal Neural Networks can be built using specific fusion layers that take into account the characteristics of each modality and combine the information appropriately.
Another method is transfer learning, which allows knowledge learned from one modality to be transferred to another, which can be very useful when data is limited in a specific modality. For example, models pre-trained on computer vision tasks can be transferred to image comprehension tasks in other domains.
It is also possible to use pre-trained models on large amounts of data as a starting point for accelerating multimodal learning.
A pre-trained language model can be used to extract textual features in a multimodal task, in order to exploit the knowledge already gained from initial training on a large dataset.
Finally, the representation of multimodal data is a crucial stage, as it influences the model’s ability to understand and exploit the various modalities.
For example, learning common representations aims to find shared representation spaces between the different modalities.
This makes it possible to extract common characteristics that capture shared information and facilitate the overall understanding of multimodal data.
Co-learning or adversarial learning techniques are used to learn these shared representations.
An alternative is the use of self-encoding neural networks: architectures that learn to reconstruct input data by passing through a latent representation.
They can be used to extract relevant multimodal features, which are then exploited for the fusion and learning of multimodal models.
The challenges of multimodal learning
Multimodal learning presents a number of challenges, and requires particular attention to ensure that the different types of data are used effectively.
One of the main issues is the alignment of modalities. For example, when analysing a video with an audio track, the visual and audio information must be temporarily aligned so that the scene can be understood as a whole. Various synchronisation techniques are used to meet this requirement.
Similarly, merging information from different modalities can be a complex task. There are several methods for efficiently combining textual, visual, audio and video information, such as concatenation or the use of multimodal neural networks to learn integrated representations.
Whatever approach is used, the fusion must capture interactions and dependencies to provide a global understanding and overview.
Another challenge is to represent the data in such a way as to capture the relevant information from each modality for effective use in learning.
In general, Deep Learning techniques are used to extract significant features. For example, encoding neural networks can be used to capture information shared between modalities.
💡Related articles:
Image Processing |
Deep Learning – All you need to know |
Mushroom Recognition |
Tensor Flow – Google’s ML |
Dive into ML |
Examples of Multimodal Learning systems
Thanks to scientific advances in the field of Multimodal Learning, a number of systems have emerged, some of which are used by many people. Here are a few examples.
The American company OpenAI has developed DALL-E, an AI system for converting text into images. It is a neural network with 12 billion parameters.
The company has also created CLIP. This multimodal system can perform a wide variety of visual recognition tasks, and can classify images by category without the need for example data.
For its part, Google has created ALIGN: an AI model trained on a dataset containing numerous image-text pairs. According to several benchmarks, it is the best performing model of its kind.
The Californian giant has also created the MURAL AI for image-text association and linguistic translation. This model uses multi-task learning applied to image-text pairs in association with their translation into more than 100 languages.
Another Google project is VATT: a multimodal Video-Audio-Text AI. It can make predictions from raw data, generate descriptions of events in video or even create videos from a prompt.
Microsoft researchers have created NUWA to produce new images and videos or modify existing ones. This model is trained on images, videos and text. It has learned to predict the next frame of a video or to fill in incomplete images.
Another Microsoft Resarch project is Florence, which is capable of modelling space, time and modality. Finally, FLAVA has been trained by Meta on images and 35 different languages, and is proving to be very effective for a wide variety of multimodal tasks.
Conclusion: Multimodal Learning, the next frontier of AI
By enabling AI systems to learn from several types of data simultaneously, Multimodal Learning brings machines closer to the human brain and its multisensory perception.
So, in the near future, this approach could well enable artificial intelligence to continue to approach human intelligence, and even surpass it…
To master Machine Learning and all its techniques, you can choose DataScientest. Our Data Science courses all include one or more modules dedicated to Machine Learning, Deep Learning and AI.
The concepts covered include classification, regression and clustering techniques using scikit-learn, text mining and time series analysis methods, as well as CNN and RNN neural networks using Keras, TensorFlow and PyTorch.
Our various courses are entirely distance learning, and enable you to acquire all the skills you need to work as a Data Scientist, Data Analyst, Data Engineer, ML Engineer, or in new AI professions such as Prompt Engineer.
Our organisation is eligible for the funding options recognised by the French government, and you can receive a diploma issued by MINES Paris Executive Education and a cloud certificate from our partners AWS and Microsoft Azure. Discover DataScientest!
Now you know all about Multimodal Learning. For more information on the same subject, read our full report on Machine Learning.