More

ggerganov · 2025-01-31T14:59:19 1738335559

The llama.cpp tools and examples download the models by default to a OS-specific cache folder [0]. We try to follow the HF standard (as discussed in the linked thread), though the layout of the llama.cpp cache is not the same atm. Not sure about the plans for RamaLama, but it might be something worth to consider.

[0] https://github.com/ggerganov/llama.cpp/issues/7252

sitkack · 2025-01-31T16:17:19 1738340239

I think it would be the most important thing to consider, because the biggest thing that predecessor to RamaLama provided was a way to download a model (and run it).

If there was a contract about how models were laid out on disk, then downloading, managing and tracking model weights could be handled by a different tool or subsystem.

ecurtin · 2025-02-01T14:30:48 1738420248

In RamaLama an OCI container-like store is used (at least from the UX perspective it feels like that) for all models in RamaLama, it's protocol agnostic supports oci artefacts, huggingface, ollama, etc.

ggerganov · 2025-01-24T09:04:56 1737709496

Appreciate the feedback!

Currently, there isn't a user-friendly way to disable the stats from showing apart from modifying the "'show_info': 0" value directly in the plugin implementation. These things will be improved with time and will become more user-friendly.

A few extra optimizations will soon land which will further improve the experience:

- Speculative FIM

- Multiple suggestions

tomnipotent · 2025-01-26T03:16:28 1737861388

First extension I've used that perfectly autocompletes Go method receivers.

First tab completes just "func (t *Type)" so then I can type the first few characters of something I'm specifically looking for or wait for the first recommendation to kick in. I hope this isn't just a coincidence from the combination of model and settings...

ggerganov · 2025-01-23T18:29:25 1737656965

Hi HN, happy to see this here!

I highly recommend to take a look at the technical details of the server implementation that enables large context usage with this plugin - I think it is interesting and has some cool ideas [0].

Also, the same plugin is available for VS Code [1].

Let me know if you have any questions about the plugin - happy to explain. Btw, the performance has improved compared to what is seen in the README videos thanks to client-side caching.

[0] - https://github.com/ggerganov/llama.cpp/pull/9787

[1] - https://github.com/ggml-org/llama.vscode

amrrs · 2025-01-23T18:34:01 1737657241

For those who don't know, He is the gg of `gguf`. Thank you for all your contributions! Literally the core of Ollama, LMStudio, Jan and multiple other apps!

kennethologist · 2025-01-24T02:45:47 1737686747

A. Legend. Thanks for having DeepSeek available so quickly in LM Studio.

sergiotapia · 2025-01-23T19:18:29 1737659909

well hot damn! killing it!

halyconWays · 2025-01-23T19:47:56 1737661676

[flagged]

kamranjon · 2025-01-23T22:28:50 1737671330

They collaborate together! Her name is Justine Tunney - she took her “execute everywhere” work with Cosmopolitan to make Llamafile using the llama.cpp work that Giorgi has done.

halyconWays · 2025-02-02T02:52:37 1738464757

She actually stole that code from a user named slaren and was personally banned by Gerg from the llama.cpp repo for about a year because of it. Also it was just lazy loading the weights, it wasn't actually a 50% reduction.

https://news.ycombinator.com/item?id=35411909

kamranjon · 2025-02-08T19:43:21 1739043801

That seems like a false narrative, which is strange because you could have just read the explanation from Jart a little further down in the thread:

https://news.ycombinator.com/item?id=35413289

madeforhnyo · 2025-01-23T22:22:12 1737670932

Someone did? Could you pls share a link?

bangaladore · 2025-01-23T20:13:30 1737663210

Quick testing on vscode to see if I'd consider replacing Copilot with this. Biggest showstopper right now for me is the output length is substantially small. The default length is set to 256, but even if I up it to 4096, I'm not getting any larger chunks of code.

Is this because of a max latency setting, or the internal prompt, or am I doing something wrong? Or is it only really make to try to autocomplete lines and not blocks like Copilot will.

Thanks :)

ggerganov · 2025-01-23T20:23:50 1737663830

There are 4 stopping criteria atm:

- Generation time exceeded (configurable in the plugin config)

- Number of tokens exceeded (not the case since you increased it)

- Indentation - stops generating if the next line has shorter indent than the first line

- Small probability of the sampled token

Most likely you are hitting the last criteria. It's something that should be improved in some way, but I am not very sure how. Currently, it is using a very basic token sampling strategy with a custom threshold logic to stop generating when the token probability is too low. Likely this logic is too conservative.

bangaladore · 2025-01-23T20:36:52 1737664612

Hmm, interesting.

I didn't catch T_max_predict_ms and upped that to 5000ms for fun. Doesn't seem to make a difference, so I'm guessing you are right.

eklavya · 2025-01-24T11:45:43 1737719143

Thanks for sharing the vscode link. After trying I have disabled the continue.dev extension and ollama. For me this is wayyyyy faster.

jerpint · 2025-01-23T18:31:01 1737657061

Thank you for all of your incredible contributions!

liuliu · 2025-01-23T18:50:25 1737658225

KV cache shifting is interesting!

Just curious: how much of your code nowadays completed by LLM?

ggerganov · 2025-01-23T18:54:55 1737658495

Yes, I think it is surprising that it works.

I think a fairly large amount, though can't give a good number. I have been using Github Copilot from the very early days and with the release of Qwen Coder last year have fully switched to using local completions. I don't use the chat workflow to code though, only FIM.

menaerus · 2025-01-24T08:37:32 1737707852

Interesting approach.

Am I correct to understand that you're basically minimizing the latencies and required compute/mem-bw by avoiding the KV cache? And encoding the (local) context in the input tokens instead?

I ask this because you set the prompt/context size to 0 (--ctx-size 0) and the batch size to 1024 (-b 1024). Former would mean that llama.cpp will only be using the context that is already encoded in the model itself but no local (code) context besides the one provided in the input tokens but perhaps I misunderstood something.

Thanks for your contributions and obviously the large amount of time you take to document your work!

ggerganov · 2025-01-24T09:18:14 1737710294

The primary tricks for reducing the latency are around context reuse, meaning that the computed KV cache of tokens from previous requests is reused for new requests and thus computation is saved.

To get high-quality completions, you need to provide a large context of your codebase so that the generated suggestion is more inline with your style and implementation logic. However, naively increasing the context will quickly hit a computation limit because each request would need to compute (a.k.a prefill) a lot of tokens.

The KV cache shifts used here is an approach to reuse the cache of old tokens by "shifting" them in new absolute positions in the new context. This way a request that would normally require a context of lets say 10k tokens, could be processed more quickly by computing just lets say 500 tokens and reusing the cache of the other 9.5k tokens, thus cutting the compute ~10 fold.

The --ctx-size 0 CLI arg simply tells the server to allocate memory buffers for the maximum context size supported by the model. For the case of Qwen Coder models, this corresponds to 32k tokens.

The batch sizes are related to how much local context around your cursor to use, along with the global context from the ring buffer. This is described in more detail in the links, but simply put: decreasing the batch size will make the completion faster, but with less quality.

menaerus · 2025-01-24T09:43:56 1737711836

Ok, so --ctx-size with a value != 0 means that we can override the default model context size. Since for obvious computation cost reasons we cannot use the 32k fresh context per each request, the trick you make is to use the 1k context (batch that includes local and semi-local code) that you enrich with the previous model responses by keeping them in and feeding them from KV cache? To increase the correlation between the current request and previous responses you do the shifting in KV cache?

ggerganov · 2025-01-24T10:00:49 1737712849

Yes, exactly. You can set --ctx-size to a smaller value if you know that you will not hit the limit of 32k - this will save you VRAM.

To control how much global context to keep in the ring buffer (i.e. the context that is being reused to enrich the local context), you can adjust the "ring_n_chunks" and "rink_chunk_size". With the default settings, this amounts to about 8k tokens of context on our codebases when the ring buffer is full, which is a conservative setting. Increasing these numbers will make the context bigger, will improve the quality but will affect the performance.

There are a few other tricks to reduce the compute for the local context (i.e. the 1k batch of tokens), so that in practice, a smaller amount is processed. This further saves compute during the prefill.

menaerus · 2025-01-24T10:19:32 1737713972

Since qwen 2.5 turbo with 1M context size is advertised to be able to crunch ~30k LoC, I guess we can say then that the 32k qwen 2.5 model is capable of ~960 LoC and therefore 32k model with an upper bound of context set to 8k is capable of ~250 LoC?

Not bad.

gloflo · 2025-01-23T21:49:50 1737668990

What is FIM?

jjnoakes · 2025-01-23T21:56:08 1737669368

Fill-in-the-middle. If your cursor is in the middle of a file instead of at the end, then the LLM will consider text after the cursor in addition to the text before the cursor. Some LLMs can only look before the cursor; for coding,.ones that can FIM work better (for me at least).

rav · 2025-01-23T21:56:24 1737669384

FIM is "fill in middle", i.e. completion in a text editor using context on both sides of the cursor.

LoganDark · 2025-01-24T22:44:18 1737758658

llama.cpp supports FIM?

attentive · 2025-01-23T22:42:47 1737672167

Is it correct to assume this plugin won't work with ollama?

If so, what's ollama missing?

mistercheph · 2025-01-24T00:32:05 1737678725

this plugin is designed specifically for the llama.cpp server api, if you want copilot like features with ollama, you can use an ollama instance as a drop-in replacement for github copilot with this plugin: https://github.com/bernardo-bruning/ollama-copilot

There is also https://github.com/olimorris/codecompanion.nvim which doesn't have text completion, but supports a lot of other AI editor workflows that I believe are inspired by Zed and supports ollama out of the box

nancyp · 2025-01-23T18:43:39 1737657819

TIL: VIM has it's own language. Thanks Georgi for LLAMA.cpp!

nacs · 2025-01-23T19:27:52 1737660472

Vim is incredibly extensible.

You can use C or VIMscript but programs like Neovim support Lua as well which makes it really easy to make plugins.

halyconWays · 2025-01-23T19:47:11 1737661631

Please make one for Jetbrains' IDEs!

ggerganov · on Feb 28, 2024

> Thanks to the amazing work of @ggerganov on llama.cpp which made this possible. If there is anything that you wish to exist in an ideal local AI app, I'd love to hear about it.

The app looks great! Likewise, if you have any requests or ideas for improving llama.cpp, please don't hesitate to open an issue / discussion in the repo

xyc · on Feb 28, 2024

Oh wow it's the goat himself, love how your work has democratized AI. Thanks so much for the encouragement. I'm mostly a UI/app engineer, total beginner when it comes to llama.cpp, would love to learn more and help along the way.

titaniumtown · on Feb 28, 2024

Wow I've been following your work for a while, incredible stuff! Keep up the hard work, I check llama.cpp's commits and PRs very frequently and always see something interesting in the works (the alternative quantization methods and Flash Attention have been interesting).

petargyurov · on Feb 28, 2024

Did not expect to see the Georgi Gerganov here :) How is GGML going?

Поздрави!

ggerganov · on Feb 28, 2024

So far is going great! Good community, having fun. Many ideas to explore :-)

duckkg5 · on Feb 28, 2024

Nothing to add except that your work is tremendous

ggerganov · on Nov 6, 2023

I've found lowering the temperature and disabling the repetition penalty can help [0]. My explanation is that the repetition penalty penalizes the end of sentences and sort of forces the generation to go on instead of stopping.

[0] https://old.reddit.com/r/LocalLLaMA/comments/17e855d/llamacp...

ggerganov · on Nov 2, 2023

Yes, I was planning to do this back then, but other stuff came up. There are many different ways in which this simple example can be improved:

- better detection of when speech ends (currently basic adaptive threshold)

- use small LLM for quick response with something generic while big LLM computes

- TTS streaming in chunks or sentences

One of the better OSS versions of such chatbot I think is https://github.com/yacineMTB/talk. Though probably many other similar projects also exist by now.

generalizations · on Nov 2, 2023

I keep wondering if a small LLM can also be used to help detect when the speaker has finished speaking their thought, not just when they've paused speaking.

drunkenmagician · on Nov 2, 2023

Maybe using a voice activity detector, VAD would be a lighter (less resources required) option.

generalizations · on Nov 3, 2023

That works when you know what you’re going to say. A human knows when you’re pausing to think, but have a thought you’re in the middle of expressing. A VAD doesn’t know this and would interrupt when it hears a silence of N seconds; a lightweight LLM would know to keep waiting despite the silence.

cjbprime · on Nov 3, 2023

And the inverse: the VAD would wait longer than necessary after a person says e.g. "What do you think?", in case they were still in the middle of talking.

rjtavares · on Nov 2, 2023

> use small LLM for quick response with something generic while big LLM computes

Can't wait for poorly implemented chat apps to always start a response with "That's a great question!"

Joeri · on Nov 2, 2023

“Uhm, i mean, like, you know” would indeed be a little more human.

avarun · on Nov 2, 2023

Just like poorly implemented human brains tend to do :P

ggerganov · on Nov 2, 2023

Heh, funny to see this popup here :)

The performance on Apple Silicon should be much better today compared to what is shown in the video as whisper.cpp now runs fully on the GPU and there have been significant improvements in llama.cpp generation speed over the last few months.

boesboes · on Nov 2, 2023

13 minutes between this and the commit of a new demo video, not bad :D

And impressive performance indeed!

tomtom1337 · on Nov 3, 2023

Ah, forget the other message, I watched the videos in the wrong order! And I can’t delete or edit using the Hack app!

tomtom1337 · on Nov 3, 2023

Is it just me, or is the gpu version actually slower to respond?

A4ET8a8uTh0 · on Nov 2, 2023

You are kinda famous now man. Odds are, people follow your github religiously.

MuffinFlavored · on Nov 2, 2023

Is ggerganov to LLM what Fabrice Bellard is to QuickJS/QEMU/FFMPEG?

actionfromafar · on Nov 3, 2023

That's a big burden to place on anyone.

asadm · on Nov 2, 2023

I have sent a PR to move that new demo to the top. I think the new demo is significantly better.

sgt · on Nov 3, 2023

Is running this on Apple Silicon the most cost effective way to run this, or can it be done cheaper on a beefed up homelab Linux server?

v3ss0n · on Nov 2, 2023

will this work with latested distilled llama?

ggerganov · on July 24, 2023

> I don't recall the details exactly, but I don't think it ever did very much.

How would you have known if the trick actually reduces the outliers in the weights? Even if the transformer quality does not improve overall, having less outliers as a result is very beneficial for more accurate quantization of the data

danielmarkbruce · on July 24, 2023

Are you asking "why would you have bothered to look at"?

The "how" is pretty straightforward.

p1esk · on July 25, 2023

He's questioning the statement: "I don't think [the trick] ever did very much", because no one has yet looked at whether the trick helps reducing outliers in very large models. If it does help with this, as the blog author believes, then it is indeed a very useful trick.

danielmarkbruce · on July 25, 2023

Is he? A surface level reading suggests he's asking "how would you know".. and the answer is... by looking at the parameters. People do that.

>> because no one has yet looked at whether the trick helps reducing outliers in very large models

Given a softmax version doing exactly as the blog post says is baked into a google library (see this thread), and you can set it as a parameter in a pytorch model (see this thread), this claim seems off. "Let's try X, oh, X doesn't do much, let's not write a paper about it" is extremely common for many X.

tudorw · on July 25, 2023

This would seem like a really good argument as to why failures should be written up, otherwise where is the list of what has been tried before?

danielmarkbruce · on July 25, 2023

Yup, it is. But it isn't going to happen.

ggerganov · on July 25, 2023

Yes, I assumed that checking the weights for presence and amount of outliers is not something that is usually done and effects on this can be overlooked. If my assumption is wrong and researchers do usually look at such metrics, then my question is not very relevant.

Agree - the "how" is straightforward

ggerganov · on July 5, 2023

My POV is that llama.cpp is primarily a playground for adding new features to the core ggml library and in the long run an interface for efficient LLM inference. The purpose of the examples in the repo is to demonstrate ways of how to use the ggml library and the LLM interface. The examples are decoupled from the primary code - i.e. you can delete all of them and the project will continue to function and build properly. So we can afford to expand them more freely as long as people find them useful and there is enough help for maintaining them. Still, we try to keep the 3rd party dependencies to a minimum so that the build process is simple and accessible

There was a similar "dilemma" about the GPU support - initially I didn't envision adding GPU support to the core library as I thought that things will become very entangled and hard to maintain. But eventually, we found a way to extend the library with different GPU backends in a relatively well decoupled way. So now, we have various developers maintaining and contributing to the backends in a nice independent way. Each backend can be deleted and you will still be able to build the project and use it.

So I guess we are optimizing for how easy it is to delete things :)

Note that the project is still pretty much a "big hack" - it supports just LLaMA models and derivatives, therefore it is easy atm. The more "general purpose" it becomes, the more difficult things become to design and maintain. This is the main challenge I'm thinking how to solve, but for sure keeping stuff minimalistic and small is a great help so far

> What if GG didn't want such a thing? When is something like this better for a separately maintained repo and not a main merge? How do you know when it is OK to submit a PR to add something like this without overstepping (or is it always?)

I try to explain my vision for the project in the issues and the discussion. I think most of the developers are very well aligned with it and can already tell what is a good addition or not

aidenn0 · on July 5, 2023

Thank you for the ggml library, by the way. It let me play around with whisper in a sane manner. To run the CUDA torch versions, I needed to shut down X to free enough GPU memory for the medium model, and the small model might require me to quit firefox. With ggml, I can use cublas and run even the large model with a huge speedup compared to CPU only torch.

mk_stjames · on July 5, 2023

Thanks for replying to me directly! I'm finding it fascinating to follow this project. Good luck with your company Georgi.

vitaminka · on July 5, 2023

i’m curious, what’s is the approach for maintainable and decoupled various gpu backends?

ggerganov · on July 5, 2023

It was designed in #915 (read just the OP and the linked PRs at the end) and the implementation pretty much follows it closely, at least for the Metal backend. The CUDA and OpenCL backends are currently slightly coupled in ggml as they started developing before #915, but I think we'll resolve this eventually.

#915 - https://github.com/ggerganov/llama.cpp/discussions/915

vitaminka · on July 5, 2023

interesting decoupling method, ty :)

ggerganov · on June 6, 2023

I'm planning to write code and have fun!

beardog · on June 6, 2023

>ggml.ai is a company founded by Georgi Gerganov to support the development of ggml. Nat Friedman and Daniel Gross provided the pre-seed funding.

Did you give them a different answer? It is okay if you can't or don't want to share, but I doubt the company is only planning to have fun. Regardless, best of luck to you and thank you for your efforts so far.

jgrahamc · on June 6, 2023

This is a good plan.

az226 · on June 6, 2023

Have you thought about what your path looks like to get to the next phase? Are you taking on any more investors pre-seee?