Skip to content

[QST] Split-K: Reduce in Shared Memory instead of Global Memory #1421

@HanGuo97

Description

@HanGuo97

I have a (somewhat naive/newbie) question about the Split-K implementation.

  1. In CUTLASS, the Split-K kernel splits the K dimension so that multiple thread blocks compute (in parallel) the partial results of one column tile. Then, these partial results are written to global memory. Finally, these partial results are reduced via a separate reduction kernel launch. Hence, this implementation introduces a round trip of these partial results between off-chip memory and on-chip memory, as well as the overhead of one extra kernel launches.

  2. Here is a potential alternative implementation, in which we split the K dimension so that multiple threads compute (in parallel) the partial results of one column tile. These partial results are written to and reduced through shared memory, before writing the results back to global memory.

Compared to method (1), method (2) saves a round trip between off-chip and on-chip memory and a separate kernel launch. Is there any reason why (1) is (overwhelmingly?) preferred to (2)?

This is (loosely) a follow-up question to #1391, which is more like method (2).

Thanks in advance for your help!

Activity

thakkarV

thakkarV commented on Mar 25, 2024

@thakkarV
Collaborator

There's a third option that cutlass implements as well which is a semaphore based serial reduction across different CTAs and that doesn't require a separate reduction kernel or a round trip to gmem

1 is easy to implement and can have great perf depending on the problem size

2 is more difficult to implement and reduces the arithmetic intensity of the CTA level GEMM, which are usually highly tuned for a given tile size. It can be done but the benefits are limited to fewer scenarios

HanGuo97

HanGuo97 commented on Mar 25, 2024

@HanGuo97
Author

Thanks for the very informative response!

For the 3rd option, do you mind sharing the pointer to its code?

thakkarV

thakkarV commented on Mar 26, 2024

@thakkarV
Collaborator

// If performing a reduction via split-K, fetch the initial synchronization

HanGuo97

HanGuo97 commented on Mar 26, 2024

@HanGuo97
Author

Thanks for the pointer!

In my case, we are mostly implementing things ourselves. Do you have any thoughts on the performance + implementability of the 3rd option versus the first two options (especially the first option)?

thakkarV

thakkarV commented on Mar 26, 2024

@thakkarV
Collaborator

implementability wise its between the two in terms of difficult and complexity.

performance wise, it depends (on arch, problem size, your kernel schedule, pipelining strategy, fusions, etc...)

HanGuo97

HanGuo97 commented on Mar 26, 2024

@HanGuo97
Author

Understood, thanks again for the super helpful answers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @HanGuo97@thakkarV@mnicely

        Issue actions

          [QST] Split-K: Reduce in Shared Memory instead of Global Memory · Issue #1421 · NVIDIA/cutlass