[QST] Split-K: Reduce in Shared Memory instead of Global Memory

I have a (somewhat naive/newbie) question about the Split-K implementation.

In CUTLASS, the Split-K kernel splits the K dimension so that multiple thread blocks compute (in parallel) the partial results of one column tile. Then, these partial results are written to global memory. Finally, these partial results are reduced via a separate reduction kernel launch. Hence, this implementation introduces a round trip of these partial results between off-chip memory and on-chip memory, as well as the overhead of one extra kernel launches.
Here is a potential alternative implementation, in which we split the K dimension so that multiple threads compute (in parallel) the partial results of one column tile. These partial results are written to and reduced through shared memory, before writing the results back to global memory.

Compared to method (1), method (2) saves a round trip between off-chip and on-chip memory and a separate kernel launch. Is there any reason why (1) is (overwhelmingly?) preferred to (2)?

This is (loosely) a follow-up question to #1391, which is more like method (2).

Thanks in advance for your help!

There's a third option that cutlass implements as well which is a semaphore based serial reduction across different CTAs and that doesn't require a separate reduction kernel or a round trip to gmem

1 is easy to implement and can have great perf depending on the problem size

2 is more difficult to implement and reduces the arithmetic intensity of the CTA level GEMM, which are usually highly tuned for a given tile size. It can be done but the benefits are limited to fewer scenarios

Thanks for the very informative response!

For the 3rd option, do you mind sharing the pointer to its code?

cutlass/include/cutlass/gemm/kernel/gemm_universal.h

Line 607 in c4e3e12

// If performing a reduction via split-K, fetch the initial synchronization

Thanks for the pointer!

In my case, we are mostly implementing things ourselves. Do you have any thoughts on the performance + implementability of the 3rd option versus the first two options (especially the first option)?

implementability wise its between the two in terms of difficult and complexity.

performance wise, it depends (on arch, problem size, your kernel schedule, pipelining strategy, fusions, etc...)

Understood, thanks again for the super helpful answers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Split-K: Reduce in Shared Memory instead of Global Memory #1421

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

[QST] Split-K: Reduce in Shared Memory instead of Global Memory #1421

Description

Activity

thakkarV commented on Mar 25, 2024

HanGuo97 commented on Mar 25, 2024

thakkarV commented on Mar 26, 2024

HanGuo97 commented on Mar 26, 2024

thakkarV commented on Mar 26, 2024

HanGuo97 commented on Mar 26, 2024

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions