AI video model 'MirageLSD' that converts live video in real time released, an example looks like this

AI startup Decart has announced MirageLSD , a new diffusion-based AI video editing model designed to perform fast and controllable video editing based on text prompts. Unlike traditional video generation AI that generates videos from text, MirageLSD can 'convert' camera footage or gameplay broadcasts into completely different styles in real time.
Decart
https://about.decart.ai/publications/MirageLSD
Below is a video that looks like an FPS gameplay, generated using MirageLSD from a live broadcast of an FPS game.
COD Mirage LSD - YouTube
Below is the video generated by MirageLSD based on the gameplay footage of 'Minecraft'.
Minecraft MirageLSD - YouTube
When you generate an animation from live-action footage using MirageLSD, it looks like this.
What Makes You Beautiful MirageLSD - YouTube
By using MirageLSD to convert footage of sword fights captured on camera in real time, it is possible to generate videos of battles that look like they're from a science fiction movie.
Shaolin Mirage LSD - YouTube
The main feature of MirageLSD is its high-speed processing power. According to Decart, MirageLSD's inference time is 6.1 seconds, which is about 1.7 times faster than the competing model DragNUWA 's 10.5 seconds. In addition, since it can process images with extremely low latency of less than 40 milliseconds, users can experience the images in front of them changing into anime or sci-fi style with almost no latency.
This speedup is achieved through several technical innovations to reduce computational costs. First, MirageLSD employs a 'coarse-to-fine' strategy, which generates a rough structure of the entire video at low resolution and then performs high-resolution reconstruction.
Furthermore, in the U-Net architecture that is the core of video generation, the attention mechanisms for the spatial and temporal axes are separated, and the amount of calculation is reduced to one-quarter of the original amount by downsampling the key-value pairs on the temporal axis. In other words, the task of generating videos is divided into 'frame generation (spatial axis)' and 'frame connection (temporal axis)', and the latter only looks at the main points rather than looking back at all of the past footage, achieving ultra-high-speed processing.
Decart also claims that MirageLSD has received high marks in performance evaluations. In a comparison of user evaluation scores, MirageLSD received 73% support, far surpassing DragNUWA's 21% and Rave's 6%. This shows the high quality of the videos MirageLSD generates and its fidelity to user instructions.
However, according to Decart, MirageLSD has some issues remaining. For example, MirageLSD's main features are its real-time nature and unlimited conversion capability, but in order to speed up processing, it refers to the last few frames to predict the next frame, so there is a possibility that consistency may be lost in long videos that exceed several tens of minutes. In particular, in the case of people, it seems that it is very difficult to maintain consistency when the facial expression or direction changes.
While MirageLSD focuses on video conversion, Decart's goal in the future is to develop a more comprehensive model that incorporates speech, music, emotion and other elements.
Related Posts: