Skip to content

MCP MVP#18655

Open
allozaur wants to merge 327 commits intoggml-org:masterfrom
allozaur:allozaur/mcp-mvp
Lines changed: 17839 additions & 6169 deletions
Open

MCP MVP#18655
allozaur wants to merge 327 commits intoggml-org:masterfrom
allozaur:allozaur/mcp-mvp

Conversation

@allozaur
Copy link
Collaborator

@allozaur allozaur commented Jan 7, 2026

Note

Demos and description are WIP

Video demos

Adding a new MCP Server and using it within an Agentic Loop

demo1.mp4

Using MCP Prompts

demo2.mp4

Using MCP Resources

demo3.mp4

Image Generation and Web Search using different MCP servers

demo4.mp4

New features

  • Adding System Message to conversation or injecting it to an existing one
  • CORS Proxy on llama-server backend side
  • MCP
    • Servers Selector
    • Settings with Server cards showing capabilities, instructions and other information
    • Tool Calls
      • Agentic Loop
        • Logic
        • UI with processing stats
    • Prompts
      • Detection logic in „Add” dropdown
      • Prompt Picker
      • Prompt Args Form
      • Prompt Attachments in Chat Form and Chat Messages
    • Resources
      • Browser with search & filetree view
      • Resource Attachments & Preview dialog
  • Show raw output switch under the assistant message
  • Favicon utility
  • Key-Value form component (used for MCP Server headers in add new/edit mode)

UI Improvements

  • Created TruncatedText component
  • Created Image CORS error fallback UI
  • Created DropdownMenuSearchable component
  • Created HorizontalScrollCarousel
  • Created CollapsibleContentBlock component + refactored Reasoning Content && Tool Calls
  • Code block improved UI + unfinished code block handling + syntax highlighting improvements
  • Max-heght for components + autoscroll of overflowing content during streaming
  • New statistics UI + added new data for Tool Calls and Agentic summary
  • Better time formatting for Chat Message Statistics

Architecture refactors/improvements:

  • Autoscroll hook
  • Renamed all service files to have .service.ts format
  • Common types definitions
  • Abort Signal utility + refactor
  • API Fetch utility
  • Cache TTL
  • Components folder naming & restructuring
  • Context API for editing messages and message actions
  • Enums & constants cleanup
  • Markdown Rendering improvements
  • Chat Form components
  • Resolving images in Markdown Content
  • Chat Attachments API improvements
  • Removed Model Change validation logic
  • Removed File Upload validation logic
  • New formatters
  • Storybook upgrade
  • New prop to define initial section when opening Chat Settings

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Jan 7, 2026

Thank you for the architectural unification! The SearchableDropdownMenu refactor is superb, we're making good progress!

Only remaining items/features for a MVP (& testing on my side) :

  • Server naming based solely on domain name is problematic for those wanting to host multiple MCP services on web subdirectories or subdomains (like on our home/personal servers or a future integrated MCP backend!): we need to find a solution
  • Display on narrow widths needs improvement -> MCP selector drops below the model name (yes model names are long, plus I prefix MoE/Dense on all my models because this info really needs to be visible to everyone. Auto-detection from GGUF would be fantastic someday!)
  • Stress test context with heavy RAG, we've already done many long complex agentic loops on sandbox successfully (your webinterface automatic publication from agent was cool btw)
  • Global on/off button on the first server seems strange to me

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Jan 9, 2026

I'm interested in this, so gave the PR a look and just wanted to ask, are local MCP servers planned to be supported?

Right now it looks like URL is required, without a local "command" "args" "env" type option. (for Node, NPX / UVX, Docker, etc) Might be able to get around this with a MCP proxy server, but built-in support of local servers like many MCP clients would be welcomed.

e.g. Cursor, VS Code, OpenCode, Roo Code, Antigravity, LM Studio, and others support the following with small variations:

{
  "mcpServers": {
    "git": {
      "command": "uvx",
      "args": ["mcp-server-git"]
    },
    "name": {
      "command": "npx",
      "args": [ "/path/index.js" ],
      "env": { "VAR": "VAL" }
    }   
  }
}

Lots of examples here.

I know it's still WIP, but just wanted to ask. Or maybe I've missed it?

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Jan 9, 2026

I'm interested in this, so gave the PR a look and just wanted to ask, are local MCP servers planned to be supported?

Right now it looks like URL is required, without a local "command" "args" "env" type option. (for Node, NPX / UVX, Docker, etc) Might be able to get around this with a MCP proxy server, but built-in support of local servers like many MCP clients would be welcomed.

e.g. Cursor, VS Code, OpenCode, Roo Code, Antigravity, LM Studio, and others support the following with small variations:

{
  "mcpServers": {
    "git": {
      "command": "uvx",
      "args": ["mcp-server-git"]
    },
    "name": {
      "command": "npx",
      "args": [ "/path/index.js" ],
      "env": { "VAR": "VAL" }
    }   
  }
}

Lots of examples here.

I know it's still WIP, but just wanted to ask. Or maybe I've missed it?

This PR is browser-side (Svelte) -> TCP (streamable-http / sse / websocket), so no stdio support on browser.
A backend relay in llama-server for stdio->HTTP MCP bridging would be possible but not yet implemented.
I have a personal Node.js proxy doing this (OAI->MCP (with stdio)->OAI hack), can share if useful.

EDIT: And once we have the MCP client in the browser, nothing prevents a small example script in Python or Node.js from relaying MCP to stdio :)

@allozaur
Copy link
Collaborator Author

allozaur commented Jan 9, 2026

I'm interested in this, so gave the PR a look and just wanted to ask, are local MCP servers planned to be supported?

Right now it looks like URL is required, without a local "command" "args" "env" type option. (for Node, NPX / UVX, Docker, etc) Might be able to get around this with a MCP proxy server, but built-in support of local servers like many MCP clients would be welcomed.

e.g. Cursor, VS Code, OpenCode, Roo Code, Antigravity, LM Studio, and others support the following with small variations:

{
  "mcpServers": {
    "git": {
      "command": "uvx",
      "args": ["mcp-server-git"]
    },
    "name": {
      "command": "npx",
      "args": [ "/path/index.js" ],
      "env": { "VAR": "VAL" }
    }   
  }
}

Lots of examples here.

I know it's still WIP, but just wanted to ask. Or maybe I've missed it?

Hey! We are introducing a solid basis for MCP support in llama.cpp, starting with pure WebUI implementation. We will add further enhancements in near future ;)

ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jan 9, 2026
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jan 9, 2026
@strawberrymelonpanda
Copy link
Contributor

This PR is browser-side (Svelte) -> TCP (streamable-http / sse / websocket), so no stdio support on browser.
Hey! We are introducing a solid basis for MCP support in llama.cpp, starting with pure WebUI implementation.

Thanks folks, makes sense. I'll make a small proxy script then, just wanted to make sure I wasn't overlooking a component.

I have a personal Node.js proxy doing this (OAI->MCP (with stdio)->OAI hack), can share if useful.

I think I can whip something up, but I wouldn't say no to a reference.

I would say that if the MCP button appears in the WebUI as is, this sort of question is probably bound to come up. A small example proxy script in a docs might not be a bad idea perhaps.

I appreciate the work being done on this and other WebUI PRs.

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Jan 16, 2026

Now we can work with image attachments properly, without context overloading.

  • An MCP server can return an attachment; the SvelteUI doesn't saturate its context, but informs the LLM that an attachment is available, and intuitively creates a Markdown link!

If the Markdown link points to an attachment, it's displayed.

Now we can do things like ChatGPT DALL-E/Sora Image, and all locally. llama.cpp can query an MCP server stable-diffusion.cpp with Qwen3-Image, and from what I've tested (I'll make videos), the rendering is between DALL-E and Sora Image (an LLM prompts the image generator much better than a human, while capturing the intent without leaving any ambiguity for the image model).

MCPAttach0 MCPAttach1

@ServeurpersoCom
Copy link
Collaborator

Since this works, all that remains is to refactor the CoT with the new pre-rendering format (client-specific tags <<< ... >>>) to have complete control over the context and sending the "interleaving" back to the server during an agentic loop. This will answer several questions from llama.cpp users and developers about powerful models like Minimax. And for Qwen, it will finally provide visibility into the CoT during the agentic loop!

@ServeurpersoCom
Copy link
Collaborator

Testing interleaved reasoning block and toolcall. I need to remove "Filter reasoning after first turn" useless now

InterleavedThinkingBlockAndImage.mp4

@ServeurpersoCom
Copy link
Collaborator

Obviously, with the refactoring of the CoT display, for now, all the reasoning is sent back to the model with our proprietary UI tags included ! A simple strip that reuses the regular expressions before sending it to the API will restore the previous functionality.

Later, we'll need an option to choose whether to send this reasoning back to the backend, preserving the actual location of each reasoning block!

For the MCP tool responses, the model has access to everything during the agentic loop, but once a new loop is started, the previous loops are "compressed" to the last N lines of the option, just like the display. This eliminates the token cache, so the backend has to rebuild the last modified agentic loop. This optimizes the context and doesn't degrade agentic performance because the LLM is supposed to have already performed its synthesis.

@ServeurpersoCom
Copy link
Collaborator

A little fun with image generation through MCP (other server different from the LLM server but dedicated to image generation, nothing prevents from having both 2 in 1 if we have enough VRAM for both inference instances)

MCP-Image-Gen2.mp4

@ServeurpersoCom
Copy link
Collaborator

The CI fails because we simplified the code path so that the modality type is detected in the post-upload codepath (better), instead of pre-uploading (filepicker, too limiting and incompatible). The Storybook tests still check the old UX where Images/Audio buttons were disabled based on model modalities, but that logic got nuked from ChatForm in the refactor. Now the filepicker accepts everything and validation happens client-side after upload with text fallback, so tests are looking for DOM elements that don't exist anymore. The modality props became orphaned, ChatFormActions still receives them but ChatForm doesn't compute or pass them anymore. We need to nuke these obsolete UI tests and keep the actual modality validation logic in unit tests where it belongs.

@ServeurpersoCom
Copy link
Collaborator

  • I really want to rebase this branch (but I'll hold off)!
  • The existing feature of graying out non-Vision models if there's an image upload in the conversation isn't ideal: when you send an image to a Vision model, it "transforms" the projection into text that can be processed by any other non-Vision LLM. Furthermore, the property is only loaded for models already in use. Personally, I don't like this feature, this unnecessarily restricts the LLM use (Vision model are less powerfull for some specific following interactions, and there's also a wider selection of Non-Vision models)
  • But I can fix MCP tollcall response attachment should NOT be considered a user upload.

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Jan 17, 2026

  • The existing feature of graying out non-Vision models if there's an image upload in the conversation isn't ideal: when you send an image to a Vision model, it "transforms" the projection into text that can be processed by any other non-Vision LLM.

I've certainly wondered about this scenario. There are cases where I can certainly see chaining vision and non-vision models to be useful, such as having a vision model do OCR and then a non-vision model follow-up. I'd love an omni-model that's on par, but in my tests, even the recent large Qwen vision models still lack in coding and logic compared to their non-vision counterparts.

That said, the "transformation" is incomplete and based on prompt, isn't it? Like, describing the image or OCR'ing the image, rather than actually converting the image into non-vision text tokens? I assume it was grayed out because the non-vision model wouldn't be able to 'see' the image / make sense of the tokens,

Personally I'd totally support a PR to change it regardless.

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Jan 17, 2026

  • The existing feature of graying out non-Vision models if there's an image upload in the conversation isn't ideal: when you send an image to a Vision model, it "transforms" the projection into text that can be processed by any other non-Vision LLM.

I've certainly wondered about this scenario. There are cases where I can certainly see chaining vision and non-vision models to be useful, such as having a vision model do OCR and then a non-vision model follow-up. I'd love an omni-model that's on par, but in my tests, even the recent large Qwen vision models still lack in coding and logic compared to their non-vision counterparts.

That said, the "transformation" is incomplete and based on prompt, isn't it? Like, describing the image or OCR'ing the image, rather than actually converting the image into non-vision text tokens? I assume it was grayed out because the non-vision model wouldn't be able to 'see' the image / make sense of the tokens,

Personally I'd totally support a PR to change it regardless.

As a precaution, I simply fixed the problem. But the question was raised, I performed a simple test:

I sent an image and asked the model NOT to respond; it accepted.

After several exchanges, I requested a very detailed description of the image, and it managed to provide it without error! The projection is indeed present in the backend and is destroyed if the model is changed. This needs to be tested even when switching from one VL to another. -> This work

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Jan 17, 2026

Actually, it's quite simple. The image remains attached in the client-side prompt. And of course, it's sent again on the next request. A non-VL model would throw an error. So the feature is simply a security measure,

Alternatively, you could just filter out the image and continue in text mode using the user-requested description instead, though this is less precise than having a VL model process the actual image. In fact, it would reduce code complexity and give users more freedom. You'd just need a small notification that the image and its full description are no longer considered in the context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Compilation issues devops improvements to build systems and github actions enhancement New feature or request examples server/webui server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

None yet

10 participants