Introducing the PC operation AI 'FDM-1' that learned from 11 million hours of video, fundamentally revising the video learning method and compressing 2 hours of video into 1 million tokens & applicable to self-driving cars

San Francisco-based
The First Fully General Computer Action Model | blog
https://si.inc/posts/fdm1/
While AI capable of operating PCs has already been commercialized, most AI models are developed using reinforcement learning of visual language models (VLMs) developed based on PC screenshots, making them unsuitable for long-term tasks such as operating CAD applications. Furthermore, developing a VLM requires the task of annotating screenshots, which requires a large number of human workers and a significant amount of time.
Unlike VLM-based PC operation AI, FDM-1 is trained using a total of 11 million hours of video from the internet, including videos recording video editing and live coding broadcasts. A system called 'IDM' was also developed to automate video annotation.

While it is difficult to automatically annotate live-action video, in the case of videos of PC operation, it is relatively easy to build an automatic annotation system because changes on the screen can be linked one-to-one with the operation content, such as 'When 'h' appears on the screen, it means 'the h key was pressed.'' When developing FDM-1, we first commissioned a company to manually annotate 40,000 hours of video, and then developed IDM from that data to automatically annotate 11 million hours of video.

Furthermore, an encoder tailored to the unique conditions of PC operation was also developed. Through these innovations, the FDM-1 achieved high efficiency, capable of representing 36,000 frames of video with 200,000 tokens. With the same 200,000 tokens, Gemini could only handle 775 frames, and Claude could only handle 162 frames. The development team touted the high efficiency of the FDM-1, stating that 'approximately two hours of 30fps video can be compressed into 1 million tokens.'

By being able to handle long videos with fewer tokens, the FDM-1 can now automatically run applications where the context is important, such as CG and CAD applications.

Additionally, by replacing car controls with arrow key operations, the FDM-1 can be used as an autonomous driving system.

Standard Intelligence claims that FDM-1 enables PC operation AI to move from a state that is constrained by training data to a state that is constrained by computational complexity.
Related Posts:
in AI, Posted by log1o_hf







