2025年05月12日 20時00分ソフトウェア

ローカルで各種AIモデルを実行できる無料ソフト「llama.cpp」がマルチモーダル入力をサポートし画像の説明などが可能に

AIモデルをローカルで実行できるオープンソースソフトウェア「llama.cpp」が画像の入力に対応しました。画像とテキストを同時に入力して「この画像には何が写っている？」といった質問に回答させられます。

server : vision support via libmtmd by ngxson · Pull Request #12898 · ggml-org/llama.cpp · GitHub
https://github.com/ggml-org/llama.cpp/pull/12898

llama.cpp/docs/multimodal.md at master · ggml-org/llama.cpp · GitHub
https://github.com/ggml-org/llama.cpp/blob/master/docs/multimodal.md

Trying out llama.cpp’s new vision support
https://simonwillison.net/2025/May/10/llama-cpp-vision/

llama.cppはGGUF形式に変換されたAIモデルを実行できるソフトウェアで、Llama以外にGemmaシリーズやDeepSeek-R1など多用なAIモデルをローカル環境で動かすことができます。また、Windows・macOS・Linuxに対応しているほか、MtalやCUDA、OpenCLなど数多くのバックエンドに対応しており、多様な環境でAIモデルを実行できます。

そんなllama.cppに画像入力機能を追加する変更が2025年5月10日にマージされました。llama.cppのWeb UIでポスター画像とともに「これは何？(what is this?)」と入力した例が以下。どんなポスターなのか説明する文章が生成されています。

Radeon RX 7900 XTXを搭載したPCでMistral Small 2504の4ビット量子化版を実行して画像とテキストの同時入力に成功したことも報告されています。以下の例では「イチゴがたくさん盛り付けられた皿」の写真を入力しつつ「『R』という文字を形成しているイチゴは何個？(How many strawberries are there in the letter "R"?)」という難しめの質問を投げかけています。報告者によると、正解することもあれば、間違えることもあったそうです。

ペリカンを使ったAIの性能測定に定評のあるエンジニアのサイモン・ウィリソン氏は、ノートPCでllama.cppを使ってGemma 3の4ビット量子化版を実行し、以下のペリカンの画像とともに「説明して(describe it)」というテキストを入力しました。

回答は以下のとおり。写真の特徴を詳しく説明できています。

It’s a striking photograph of a brown pelican standing in shallow water on a sandy beach. Here’s a breakdown of what I see:

Subject: The main focus is a large brown pelican. It’s standing with its wings fully extended, creating a dramatic pose.

Pose: The pelican is in a slightly awkward, almost comical stance, with its legs planted firmly in the wet sand and its wings spread wide. It looks like it might be preparing to take off or is simply enjoying the water.

Color & Texture: The pelican is a rich, dark brown with lighter markings on its wings and head. The sand is a muted beige color, and the water reflects the sunlight, creating a shimmering effect. You can see the texture of the sand and the ripples in the water.

Lighting: The lighting appears to be bright and sunny, casting shadows and highlighting the details of the pelican’s feathers.

Background: The background is blurred, suggesting a shallow depth of field, which keeps the focus on the pelican. It appears to be a beach scene.

Overall Impression: The image has a dynamic feel due to the pelican’s pose and the movement implied by the water. It’s a well-composed shot that captures a moment of wildlife in its natural habitat.

Do you want me to zoom in on a particular detail, or perhaps tell you more about pelicans?

llama.cppの詳細情報は以下のリンク先で公開されており、ソースコードやインストール手順などを確認できます。

GitHub - ggml-org/llama.cpp: LLM inference in C/C++
https://github.com/ggml-org/llama.cpp