r/LocalLLaMA 23h ago

KAN: Kolmogorov-Arnold Networks Other

Paper: https://arxiv.org/abs/2404.19756

Code: https://github.com/KindXiaoming/pykan

Quick intro: https://kindxiaoming.github.io/pykan/intro.html

Documentation: https://kindxiaoming.github.io/pykan/

Abstract:

Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.

https://preview.redd.it/x5c4w0q2juxc1.png?width=2326&format=png&auto=webp&s=65c90e0a13558da186d36f315007f042d2fb60dc

115 Upvotes

14 comments sorted by

42

u/koflerdavid 21h ago

This was not an easy read, but it was worth it! Well, it probably can't be used to convert over existing MLPs. But:

  • KANs are promised to be way smaller than MLPs and generalize better since they don't suffer from the Curse of Dimensionality that much,
  • KANs comes with effective pruning technique that seems to not harm performance too much,
  • the splines can often be replaced by more efficient functions with similar shape, which is easier since they are 1D functions, and
  • the spline foundation allows an existing network to be scaled up and accommodate more samples. If it can also be scaled downward it would offer another way for quantization schemes to squeeze down the model.

KANs probably won't be able to take advantage of GPUs as much as MLPs because current hardware is mostly optimizied for MatMuls, but together with the smaller model sizes KANs sound promising.

19

u/nickmitchko 20h ago

ML101 question I hear a lot: What activation function do I use. Maybe the answer is, you don't! Train instead a spline representing your activation function.

Very interesting and probably would be a ground level step forward if it works.

1

u/medialoungeguy 4m ago

Wow that's wild.

14

u/metaprotium 11h ago

alternate paper name: KAN (Kolmogorov-Arnold Networks) replace MLPs? Yes they KAN!

1

u/OpusLatericium 8h ago

Paper titles are getting out of hand as is, I say we reign it in instead, before it's too late!

5

u/Singsoon89 8h ago

My takeaway is that KAN is an alternative architecture to MLPs and that similarly to how Hinton et al discovered how to take MLPs from a few layers to dozens or more when they discovered back-propagation, this is essentially the discovery of back-propagation for KANs.

So we now have an alternative architecture that will have different strengths and weaknesses that could potentially be used in combo with MLPs.

It is definitely a breakthrough of sorts. Remains to be seen what will come out of it but my takeaway is stay tuned and watch that space.

12

u/epicwisdom 21h ago

More appropriate for /r/MachineLearning. No direct relevance to LLMs.

Although it's an interesting research direction, replacing matmuls + element-wise scaling+ReLU is giving up a lot of performance. It's intuitive that KANs would be better at modeling PDEs, but there's no clear benefit for language or vision models.

22

u/SeawaterFlows 20h ago

No direct relevance to LLMs.

there's no clear benefit for language or vision models

This remains to be seen, no? I assume many teams will at least try to incorporate this into genAI architectures. The authors themselves even say that you could try to replace the MLPs in transformers with KANs.

4

u/Interesting_Bison530 18h ago

It’d be cool to see how this compares against grouped convolution and dense networks at the end of gpts 

1

u/dogesator Waiting for Llama 3 1h ago

KAN demonstrated in the paper 10 times slower training per step, but that’s only when measuring a fixed parameter count. But this is ignoring the fact that the paper also claims 100X better parameter efficiency with faster scaling laws than MLP, meaning that a 1M KAN model theoretically achieves the same loss with the same dataset as a 100M MLP model, so overall the KAN model is still 10 times faster than an equivalent capability MLP model while simultaneously being way smaller vram footprint, this is perfect for locally running models where vram capacity and memory bandwidth is one of the biggest limiting factors. It’s only less performance if comparing same parameter count which is not a very useful real world calculation when the actually capabilities of the 1M KAN model seems to be even significantly beyond a 10M MLP model, I guess all have to wait and see how well it can be implemented and generalized in future works.

2

u/dogesator Waiting for Llama 3 1h ago

Here is a summary I wrote for a friend:

KAN strives to replace the MLP which is a major component of transformers making up about 70% of all transformers parameters and accounting for about 95% of all compute during transformer inference.

The KAN paper claims 100X better parameter efficiency than MLP, and if I’m reading it right they basically mean that for a given dataset, 1B KAN parameters achieve the same loss as 100B MLP parameters… Downside is that each KAN parameter on average is 10X slower than an MLP parameter.

But even though it's 10X slower at the same param count... 10B KAN parameters would be about 10X faster than a 1T MLP model while theoretically reaching atleast the same quality (assuming the loss improvements extrapolate well to general real world improvements)

BUT the KAN paper also states that KAN scales faster than MLP too, meaning that the capabilities increase more as you increase param count, compared to MLPs

So 10B KAN network might actually be more equivalent to a 2T MLP network in terms of quality. But even if 10B KAN is only as good as a 200B MLP network in real world abilities, that's still a network that takes up around 20X less VRAM footprint as an equivalent quality MLP model. while being atleast twice as fast in both training and inference.

also another caveat to mention:

The speed gains in local inference could be even much higher than that, because you're typically very memory bandwidth constrained in local environments with batch size of 1, not so much FLOPS constrained, so the 10B KAN model might be more like 10 times faster or more than the 200B MLP, depending on what the memory bandwidth to flops ratio is of the hardware you're running on.

Best case scenario: The 10B KAN model is 20 times faster than the 200B MLP network.

Worse case scenario: The 10B KAN model is only around 2 times faster than the 200B MLP network.

Limitations: still is yet to be seen how much that loss difference translates to real world quality for KAN once you actually integrate it into a transformer like an MLP is, and also needs to be figured out what the best approach in integrating with transformer is. But I’m hopeful.

4

u/vTuanpham 13h ago

I will pay 0$ for anyone ELI5 for me.

3

u/Singsoon89 8h ago

See above.