[Gluon][Tutorial] Persistent attention #7298

Mogball · 2025-06-24T22:47:08Z

Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the softmax partition. fp8 is ~100 tflops faster when the kernel name has "cutlass" in it.

Attention Z=4 H=32 D=64 causal=False:
     N_CTX  triton-fp16  triton-fp8
0   1024.0   359.574448  370.119987
1   2048.0   612.103928  641.204555
2   4096.0   653.868402  682.337948
3   8192.0   692.102228  721.555690
4  16384.0   696.972041  726.190035
5  32768.0   698.723685  727.983456
6  65536.0   699.865817  728.558321
Attention Z=4 H=32 D=64 causal=True:
     N_CTX  triton-fp16  triton-fp8
0   1024.0   181.879039  177.982453
1   2048.0   441.315463  454.310072
2   4096.0   532.170527  539.995252
3   8192.0   633.620646  638.544937
4  16384.0   667.687180  670.681255
5  32768.0   684.276329  688.571907
6  65536.0   692.953202  694.648353
Attention Z=4 H=32 D=128 causal=False:
     N_CTX  triton-fp16   triton-fp8
0   1024.0   718.580015   709.863720
1   2048.0  1133.490258  1222.548477
2   4096.0  1247.605551  1369.800195
3   8192.0  1243.482713  1406.799697
4  16384.0  1125.744367  1514.857403
5  32768.0  1124.116305  1521.267973
6  65536.0  1064.588719  1518.738037
Attention Z=4 H=32 D=128 causal=True:
     N_CTX  triton-fp16   triton-fp8
0   1024.0   355.642522   351.161232
1   2048.0   846.404095   854.547917
2   4096.0  1013.840017  1021.676435
3   8192.0  1176.258395  1152.844234
4  16384.0  1190.290681  1325.786204
5  32768.0  1063.658200  1394.413325
6  65536.0   970.531569  1413.282610

ThomasRaoux

wow!

python/tutorials/gluon/01-attention-forward.py

Mogball · 2025-07-09T07:52:00Z

For posterity, these are the best results prior to converting the kernel to persistent

Attention Z=4 H=32 D=64 causal=False:
     N_CTX  triton-fp16  triton-fp8  cudnn-fp16
0   1024.0   382.890516  412.882937  564.837281
1   2048.0   564.572331  613.796259  802.967790
2   4096.0   651.779895  711.258057  890.712337
3   8192.0   718.153704  786.906882  940.327292
4  16384.0   746.519007  815.458990  944.850564
5  32768.0   758.416978  830.055643  939.287109
6  65536.0   766.176045  837.979637  925.739296
Attention Z=4 H=32 D=64 causal=True:
     N_CTX  triton-fp16  triton-fp8  cudnn-fp16
0   1024.0   181.758722  190.690577  381.906099
1   2048.0   313.497481  326.500187  626.967949
2   4096.0   463.538482  483.472677  777.606926
3   8192.0   586.226812  618.900682  805.998776
4  16384.0   683.741305  708.737060  853.282336
5  32768.0   734.844555  762.845981  912.526865
6  65536.0   767.292419  793.280126  924.780010
Attention Z=4 H=32 D=128 causal=False:
     N_CTX  triton-fp16   triton-fp8   cudnn-fp16
0   1024.0   655.417393   730.561798   926.859512
1   2048.0   970.734621  1057.033298  1267.867719
2   4096.0  1118.226666  1210.507191  1428.037959
3   8192.0  1182.149430  1332.290127  1488.733746
4  16384.0  1227.000687  1372.951364  1358.394870
5  32768.0  1254.096611  1409.254506  1314.970965
6  65536.0  1231.680630  1426.040751  1313.822094
Attention Z=4 H=32 D=128 causal=True:
     N_CTX  triton-fp16   triton-fp8   cudnn-fp16
0   1024.0   312.399981   345.242273   553.117042
1   2048.0   534.902248   590.947330   877.822759
2   4096.0   782.786229   871.178240  1122.610667
3   8192.0   961.045037  1105.459197  1319.639575
4  16384.0  1114.273933  1256.257370  1317.439900
5  32768.0  1192.714280  1341.112079  1275.191200
6  65536.0  1195.453344  1386.801400  1310.388518

Jokeren · 2025-07-09T07:59:03Z

Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the softmax partition. fp8 is ~100 tflops faster when the kernel name has "cutlass" in it.

Attention Z=4 H=32 D=64 causal=False:
     N_CTX  triton-fp16  triton-fp8
0   1024.0   359.574448  370.119987
1   2048.0   612.103928  641.204555
2   4096.0   653.868402  682.337948
3   8192.0   692.102228  721.555690
4  16384.0   696.972041  726.190035
5  32768.0   698.723685  727.983456
6  65536.0   699.865817  728.558321
Attention Z=4 H=32 D=64 causal=True:
     N_CTX  triton-fp16  triton-fp8
0   1024.0   181.879039  177.982453
1   2048.0   441.315463  454.310072
2   4096.0   532.170527  539.995252
3   8192.0   633.620646  638.544937
4  16384.0   667.687180  670.681255
5  32768.0   684.276329  688.571907
6  65536.0   692.953202  694.648353
Attention Z=4 H=32 D=128 causal=False:
     N_CTX  triton-fp16   triton-fp8
0   1024.0   718.580015   709.863720
1   2048.0  1133.490258  1222.548477
2   4096.0  1247.605551  1369.800195
3   8192.0  1243.482713  1406.799697
4  16384.0  1125.744367  1514.857403
5  32768.0  1124.116305  1521.267973
6  65536.0  1064.588719  1518.738037
Attention Z=4 H=32 D=128 causal=True:
     N_CTX  triton-fp16   triton-fp8
0   1024.0   355.642522   351.161232
1   2048.0   846.404095   854.547917
2   4096.0  1013.840017  1021.676435
3   8192.0  1176.258395  1152.844234
4  16384.0  1190.290681  1325.786204
5  32768.0  1063.658200  1394.413325
6  65536.0   970.531569  1413.282610

I don't see a "cutlass" in the kernel names?

python/tutorials/gluon/01-attention-forward.py

Mogball · 2025-07-09T08:16:46Z

def attention_repr(specialization): 
     name = "gluon_attention" 
     # Up to 150 TFLOPS faster for fp8! 
     if specialization.constants["dtype"] == gl.float8e5: 
         name = "cutlass_" + name 
     return name

Jokeren · 2025-07-09T08:25:35Z

Before:

Attention Z=4 H=32 D=64 causal=False:
     N_CTX  triton-fp16  triton-fp8  cudnn-fp16
0   1024.0   382.890516  412.882937  564.837281
1   2048.0   564.572331  613.796259  802.967790
2   4096.0   651.779895  711.258057  890.712337
3   8192.0   718.153704  786.906882  940.327292
4  16384.0   746.519007  815.458990  944.850564
5  32768.0   758.416978  830.055643  939.287109

After

Attention Z=4 H=32 D=64 causal=False:
     N_CTX  triton-fp16  triton-fp8
0   1024.0   359.574448  370.119987
1   2048.0   612.103928  641.204555
2   4096.0   653.868402  682.337948
3   8192.0   692.102228  721.555690
4  16384.0   696.972041  726.190035
5  32768.0   698.723685  727.983456
6  65536.0   699.865817  728.558321

I'm not sure if I interpreted it incorrectly, but seems like perf is dropped based on the numbers?

peterbell10

Great stuff. Couple small NITs though.

python/tutorials/gluon/01-attention-forward.py

peterbell10 · 2025-07-09T12:18:15Z

python/tutorials/gluon/01-attention-forward.py

+        _, corr_bar, corr_producer = corr_producer.acquire()
+
+        p = gl.join(p0, p1).permute(0, 2, 1).reshape([config.SPLIT_M, config.BLOCK_N])
+        p = gl.convert_layout(p, config.qk_layout)


This shouldn't be needed any more after I introduced the slice layout for split, right?

The convert layout coming out of the split is no longer needed, but

ValueError('Layout mismatch in broadcast: SliceLayout(dim=1, parent=BlockedLayout(size_per_thread=[1, 128], threads_per_warp=[32, 1], warps_per_cta=[4, 1], order=[0, 1], ctas_per_cga=[1, 1], cta_split_num=[1, 1], cta_order=[1, 0])) vs SliceLayout(dim=1, parent=DistributedLinearLayout(reg_bases=[[0, 64], [0, 1], [0, 2], [0, 4], [0, 8], [0, 16], [0, 32]], lane_bases=[[1, 0], [2, 0], [4, 0], [8, 0], [16, 0]], warp_bases=[[32, 0], [64, 0]], block_bases=[], shape=[128, 128]))')

It seems that p ends up with a linear layout instead of a blocked layout. I am not sure why though -- I believe the layout inference should try a blocked layout first before falling back to linear layout.

peterbell10 · 2025-07-09T12:25:18Z

python/tutorials/gluon/01-attention-forward.py

+    name = "gluon_attention"
+    # Up to 150 TFLOPS faster for fp8!
+    if specialization.constants["dtype"] == gl.float8e5:
+        name = "cutlass_" + name


very cool... did you check if other names change the scheduling (e.g. because of non-determinism or code alignment) or if it's literally just special cased for cutlass.

it's literally just special cased for cutlass.

Yup

wow! You literally beat the nvcc team!

@AlexMaclean Just a FYI, in case you can prod the right folks on your side. There must be a better way to enable this optimization. A PTX directive, perhaps, if ptxas can't figure out the right thing by itself?

@Mogball have you checked the accuracy, is it the same? The Deepseek technical report mentioned that fp8 tensor cores use reduced mantissa for the accumulator, maybe this is what indirectly enabled/disabled by the name of the kernel.

The Deepseek technical report mentioned that fp8 tensor cores use reduced mantissa for the accumulator, maybe this is what indirectly enabled/disabled by the name of the kernel.

That's only on Hopper

By disassembly of ptxas, it is indeed hard-coded that they have logic like strstr(kernel_name, "cutlass").

By disassembly of ptxas, it is indeed hard-coded that they have logic like strstr(kernel_name, "cutlass").

That's Interesting! I'm curious is it feasible to modifty asm code for ptxas that make the al return register always be true (maybe we could modify code in the address between 2165-216c), did you have a try?

Admittedly it is feasible. But it is more likely that, this is an unstable, experimental, aggressive optimization by NVIDIA, and blindly always enabling it may produce some elusive bugs.

Mogball · 2025-07-09T16:57:37Z

Before:

Attention Z=4 H=32 D=64 causal=False:
     N_CTX  triton-fp16  triton-fp8  cudnn-fp16
0   1024.0   382.890516  412.882937  564.837281
1   2048.0   564.572331  613.796259  802.967790
2   4096.0   651.779895  711.258057  890.712337
3   8192.0   718.153704  786.906882  940.327292
4  16384.0   746.519007  815.458990  944.850564
5  32768.0   758.416978  830.055643  939.287109

After

Attention Z=4 H=32 D=64 causal=False:
     N_CTX  triton-fp16  triton-fp8
0   1024.0   359.574448  370.119987
1   2048.0   612.103928  641.204555
2   4096.0   653.868402  682.337948
3   8192.0   692.102228  721.555690
4  16384.0   696.972041  726.190035
5  32768.0   698.723685  727.983456
6  65536.0   699.865817  728.558321

I'm not sure if I interpreted it incorrectly, but seems like perf is dropped based on the numbers?

For D64 it did drop quite a bit during the transition to persistent. This is due to a scheduling issue in ptxas that I couldn't find a workaround for.

Mogball added 30 commits June 20, 2025 12:33

x

081f4e9

x

c17a986

fmt

336eced

x

f8695ef

fmt

4106ed1

error msgs

2c7369d

add a toggle

089a2d2

split_d and 192 vs 200 regs

90a60ab

fmul2 alpha

c6c1298

closing the gap

9f2c650

always use mul_f32x2

Loading
Loading status checks…

cc4f053

wip

ddb5825

wip

4c2399c

it works

ebb47ce

add test

e205029

update

Loading
Loading status checks…

0ae2de2

fmt

Loading
Loading status checks…

740905f

remove testing code

Loading
Loading status checks…

2ade13a

update test

eb0e2da

update test

a1e0616

allow mutation to change field types

2198ad1

propagate mut types properly

e97f9f5

remove MutableArgument class

3d1c990

Merge branch 'mogball/subtile_mma' into mogball/mutable

a9b15e8

transition some types

79c9969

transition the rest

758c428

Merge remote-tracking branch 'origin/main' into mogball/mutable

Loading
Loading status checks…

7641ead

make all aggregates mutable

Loading
Loading status checks…

e4a3404

fix test

Loading
Loading status checks…

d707273

fix test and reverse order of mutable returns

Loading
Loading status checks…

b04d2fe

Mogball requested review from peterbell10 and ThomasRaoux July 9, 2025 00:31

ThomasRaoux approved these changes Jul 9, 2025

View reviewed changes

python/tutorials/gluon/01-attention-forward.py Outdated Show resolved Hide resolved

remove unused fn

Loading
Loading status checks…

f260a49

Jokeren reviewed Jul 9, 2025

View reviewed changes

python/tutorials/gluon/01-attention-forward.py Outdated Show resolved Hide resolved

Jokeren reviewed Jul 9, 2025

View reviewed changes

python/tutorials/gluon/01-attention-forward.py Outdated Show resolved Hide resolved

peterbell10 approved these changes Jul 9, 2025

View reviewed changes

aeng-openai approved these changes Jul 9, 2025

View reviewed changes

Mogball added 2 commits July 9, 2025 09:58

remove profile flag

060a9c5

review comments

Loading
Loading status checks…

3d101be

Mogball enabled auto-merge (squash) July 9, 2025 18:19

Mogball merged commit ade3d49 into main Jul 9, 2025
9 checks passed

Mogball deleted the mogball/persistent branch July 9, 2025 18:24

hyz0906 mentioned this pull request Jul 11, 2025

Alphanews hyz0906/paper#2

Open

0xjunhao mentioned this pull request Aug 4, 2025

[Kernel] Add support for block FP8 on SM120 (NVIDIA 5090 and RTX PRO 6000) vllm-project/vllm#22131

Merged

4 tasks

Foxhunt mentioned this pull request Oct 3, 2025

Daily Hacker News 03-10-2025 Foxhunt/daily-hackernews#450

Open

github-actions bot mentioned this pull request Oct 3, 2025

Daily Content Summary 2025-10-03 jhengy/content-aggregator#255

Open

muhiyadiinmowliid7-byte approved these changes Oct 3, 2025

View reviewed changes

blka mentioned this pull request Oct 3, 2025

Daily Hacker News 03-10-2025 blka/daily-hackernews#165

Open

triton-lang deleted a comment from muhiyadiinmowliid7-byte Oct 3, 2025

headllines bot mentioned this pull request Oct 4, 2025

Hacker News Daily Top 10 @2025-10-04 headllines/hackernews-daily#1913

Open

CodeYHJ mentioned this pull request Oct 4, 2025

Hacker News Daily Point Above 100 @2025-10-04 CodeYHJ/hackernews-daily#238

Open

hacknews-bot bot mentioned this pull request Oct 4, 2025

Hacker News Daily Top 10 2025-10-04 luoyunchong/actions#430

Open

EFLql mentioned this pull request Oct 4, 2025

HN Top Posts 2025-10-04 EFLql/hackernews-daily#82

Open

github-actions bot mentioned this pull request Oct 4, 2025

[2025-10-03] Signal Protocol and Post-Quantum Ratchets — OpenAI's H1 2025: $4.3B in income, $13.5B in loss jiacai2050/mofish#1173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gluon][Tutorial] Persistent attention #7298

[Gluon][Tutorial] Persistent attention #7298

Mogball commented Jun 24, 2025 •

edited

Loading

ThomasRaoux left a comment

Mogball commented Jul 9, 2025

Jokeren commented Jul 9, 2025

Mogball commented Jul 9, 2025

Jokeren commented Jul 9, 2025

peterbell10 left a comment

peterbell10 Jul 9, 2025

Mogball Jul 9, 2025

peterbell10 Jul 9, 2025

Mogball Jul 9, 2025

Edenzzzz Jul 11, 2025

Artem-B Jul 11, 2025

MKlimenko Jul 11, 2025

ThomasRaoux Jul 11, 2025

yhx-12243 Jul 12, 2025 •

edited

Loading

wenqiny Jul 14, 2025

yhx-12243 Jul 14, 2025

Mogball commented Jul 9, 2025

[Gluon][Tutorial] Persistent attention #7298

[Gluon][Tutorial] Persistent attention #7298

Conversation

Mogball commented Jun 24, 2025 • edited Loading

ThomasRaoux left a comment

Choose a reason for hiding this comment

Mogball commented Jul 9, 2025

Jokeren commented Jul 9, 2025

Mogball commented Jul 9, 2025

Jokeren commented Jul 9, 2025

peterbell10 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhx-12243 Jul 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mogball commented Jul 9, 2025

Mogball commented Jun 24, 2025 •

edited

Loading

yhx-12243 Jul 12, 2025 •

edited

Loading