A Beginner’s Guide to Multimodal AI
A Primer on Multimodal LLMs
When OpenAI’s Sora made its grand entrance in Feb 2024, creating lifelike videos with ease, it left many quite amazed. Sora, a leading example of a multimodal LLM (MM-LLM), uses text to guide video generation — a promising field of research that’s been on the rise for some years. Over the past year, MM-LLMs have seen exceptional progress, ushering in a new age of AI that can handle and create content in multiple modes. These MM-LLMs mark a big step beyond traditional LLMs, pulling in information from text, images, and audio…