r/LocalLLaMA • u/SeawaterFlows • 23h ago
KAN: Kolmogorov-Arnold Networks Other
Paper: https://arxiv.org/abs/2404.19756
Code: https://github.com/KindXiaoming/pykan
Quick intro: https://kindxiaoming.github.io/pykan/intro.html
Documentation: https://kindxiaoming.github.io/pykan/
Abstract:
Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.
19
u/nickmitchko 20h ago
ML101 question I hear a lot: What activation function do I use. Maybe the answer is, you don't! Train instead a spline representing your activation function.
Very interesting and probably would be a ground level step forward if it works.
1
14
u/metaprotium 11h ago
alternate paper name: KAN (Kolmogorov-Arnold Networks) replace MLPs? Yes they KAN!
1
u/OpusLatericium 8h ago
Paper titles are getting out of hand as is, I say we reign it in instead, before it's too late!
5
u/Singsoon89 8h ago
My takeaway is that KAN is an alternative architecture to MLPs and that similarly to how Hinton et al discovered how to take MLPs from a few layers to dozens or more when they discovered back-propagation, this is essentially the discovery of back-propagation for KANs.
So we now have an alternative architecture that will have different strengths and weaknesses that could potentially be used in combo with MLPs.
It is definitely a breakthrough of sorts. Remains to be seen what will come out of it but my takeaway is stay tuned and watch that space.
12
u/epicwisdom 21h ago
More appropriate for /r/MachineLearning. No direct relevance to LLMs.
Although it's an interesting research direction, replacing matmuls + element-wise scaling+ReLU is giving up a lot of performance. It's intuitive that KANs would be better at modeling PDEs, but there's no clear benefit for language or vision models.
22
u/SeawaterFlows 20h ago
No direct relevance to LLMs.
there's no clear benefit for language or vision models
This remains to be seen, no? I assume many teams will at least try to incorporate this into genAI architectures. The authors themselves even say that you could try to replace the MLPs in transformers with KANs.
4
u/Interesting_Bison530 18h ago
It’d be cool to see how this compares against grouped convolution and dense networks at the end of gpts
1
u/dogesator Waiting for Llama 3 1h ago
KAN demonstrated in the paper 10 times slower training per step, but that’s only when measuring a fixed parameter count. But this is ignoring the fact that the paper also claims 100X better parameter efficiency with faster scaling laws than MLP, meaning that a 1M KAN model theoretically achieves the same loss with the same dataset as a 100M MLP model, so overall the KAN model is still 10 times faster than an equivalent capability MLP model while simultaneously being way smaller vram footprint, this is perfect for locally running models where vram capacity and memory bandwidth is one of the biggest limiting factors. It’s only less performance if comparing same parameter count which is not a very useful real world calculation when the actually capabilities of the 1M KAN model seems to be even significantly beyond a 10M MLP model, I guess all have to wait and see how well it can be implemented and generalized in future works.
2
u/dogesator Waiting for Llama 3 1h ago
Here is a summary I wrote for a friend:
KAN strives to replace the MLP which is a major component of transformers making up about 70% of all transformers parameters and accounting for about 95% of all compute during transformer inference.
The KAN paper claims 100X better parameter efficiency than MLP, and if I’m reading it right they basically mean that for a given dataset, 1B KAN parameters achieve the same loss as 100B MLP parameters… Downside is that each KAN parameter on average is 10X slower than an MLP parameter.
But even though it's 10X slower at the same param count... 10B KAN parameters would be about 10X faster than a 1T MLP model while theoretically reaching atleast the same quality (assuming the loss improvements extrapolate well to general real world improvements)
BUT the KAN paper also states that KAN scales faster than MLP too, meaning that the capabilities increase more as you increase param count, compared to MLPs
So 10B KAN network might actually be more equivalent to a 2T MLP network in terms of quality. But even if 10B KAN is only as good as a 200B MLP network in real world abilities, that's still a network that takes up around 20X less VRAM footprint as an equivalent quality MLP model. while being atleast twice as fast in both training and inference.
also another caveat to mention:
The speed gains in local inference could be even much higher than that, because you're typically very memory bandwidth constrained in local environments with batch size of 1, not so much FLOPS constrained, so the 10B KAN model might be more like 10 times faster or more than the 200B MLP, depending on what the memory bandwidth to flops ratio is of the hardware you're running on.
Best case scenario: The 10B KAN model is 20 times faster than the 200B MLP network.
Worse case scenario: The 10B KAN model is only around 2 times faster than the 200B MLP network.
Limitations: still is yet to be seen how much that loss difference translates to real world quality for KAN once you actually integrate it into a transformer like an MLP is, and also needs to be figured out what the best approach in integrating with transformer is. But I’m hopeful.
4
42
u/koflerdavid 21h ago
This was not an easy read, but it was worth it! Well, it probably can't be used to convert over existing MLPs. But:
KANs probably won't be able to take advantage of GPUs as much as MLPs because current hardware is mostly optimizied for MatMuls, but together with the smaller model sizes KANs sound promising.