Cell Simulator

Unfortunately, in the hello world example, each SPU does have much information to compute upon. This tutorial will go over the first type of passing information to and from the PPU to the SPU. Note: this is not the most efficient way at all to do this!

When creating an SPU thread, we should acually send it some information so that it knows what to compute. In this example we will run a simple parrallel sumation program. Basically, we want to know what is 1 + 2 + 3 + ... + n-1 + n. If we let n = 8000, then each SPU can do 1000 integer additions and then return. Using Gauss's summation formula we can quickly check for correctness.

Now, each thread needs to know what summation range it is doing. Luckily, the spe_create_thread function allows use to pass in arguements. With these passed in arguements, the SPU can now compute sum and then return the computed sum.

Here are directions to complete the task, again, code is available at the end.
  1. First copy over the helloworld application to a new directory, 'sum'. (Hello world is a good template.)
  2. Now, change the names. Change 'sum/hello.h' to 'sum/sum.h'. Change 'sum/spu/hello_spu.c' to 'sum/spu/sum_spu.c'. Now edit the makefiles in 'sum/ppu' and in 'sum/spu' to compile the correct files.
  3. Next comes the changing of the 'ppu/sum.c'. First we need to include the "../sum.h" so we have the 'addr64' struct. (This struct is used to pass information to the SPU.) Naturally, we create a variable in main (addr64 param;) and then just before the spe_thread_create function is called, we intialize the variable to pass in the SPU's id. So, just above the spe_thread_create function place in param.ui[1] = i;.
  4. Change speids[i] = spe_create_thread (gid, &sum_spu, NULL, NULL, -1, 0); to speids[i] = spe_create_thread (gid, &sum_spu, (void*)param.ui[1], NULL, -1, 0)
  5. Delete the printf("Hello World!\n"); since it isn't needed.
Now change up the SPU's code to reflect the changes. Since the SPU's code is so simple (right now) I'll just give you the new full code:
#include "../sum.h"
#include <stdio.h>

int main(unsigned long long speid, addr64 argp, addr64 envp) {
    int i;
    int sum;
    sum = 0;

    for(i = argp.ui[1]*1000; i < argp.ui[1]*1000 + 1000; i++){
        sum += i;
    }

    printf("Sum = %d\n", sum);

    return 0;
}

If you run the code, each SPU should print out thier sum. Notice that we passed the id through the addr64 argp variable. These addresses, on the PPU, are 64-bit (hence the 'addr64' struct). Keep this in mind when computing memory addresses for System Memory. Now, more on the memory of each SPU.

First off, each SPU has it's own local memory (256K as of writing). This memory is actually mapped to a large "System Memory" that all SPUs and the PPU(a) are hooked up to. For illustration, here is an image taken from IBM:


(Image is © IBM 2006)

This is a good image for the entire memory structure. Don't worry if it all makes sence yet, what you need to understand is that "System Memory" holds all the "Local Store" memories. If you do any static or dynamic allocation (like malloc) from a program running on the SPU, it is allocated in the "Local Store" memory. What the program will do next is copy from local store to the System Memory. (Then the PPU can access it.) This is through a Direct Memory Address (DMA) unit call the Memory Flow Controller (MFC).

Before an SPU can use the MFC, the SPU must know what System Memory Address (SMA) to read/write to. Furthermore, there are two main restrictions to the address being passed to the SPU. First the address % 128 must equal 0. (IE, last 7 bits are 0) or a Bus error will occur because of cache line placement. Furthermore, since CELL uses 128 byte cache lines, you will need to pad to ensure you use up the entire line. Using malloc the program will need to pad the memory. Here is some (inefficient) code for allocating an array of 8 integers.

    int *sums; After the check for spe_group_max...     sums = (int*)malloc(128 + sizeof(int)*8 + 128);     while(sums % 128 != 0){ary++;}

This isn't the most efficient way of allocating the space, but it is easy to undertand. Inside of the malloc, the first 128 is to ensure that there is a starting address that is % 128. The sizeof(int)*8 is for the 8 integers of data. The last 128 is to ensure that there is padding at the end of the cache line. Naturally, at the end of the program you should free this memory, but also change the create function to the following:

    speids[i] = spe_create_thread (gid, &sum_spu,
        param.ui[1], (unsigned long long*)sums,
        -1, (void*)NULL);
At the end of the program, after the memory sync, check to see if the sums are correct:
    sum = 0;
    for(i = 0; i < 8; i++){
        sum += sums[i];
    }

    printf("%d =? %d\n", sum ,((8000)*(8001)/2));
    if(sum == ((8000)*(8001)/2)){
        printf("Sums were correct.\n");
    }else{
        printf("Sums were incorrect.\n");
        for(i = 0; i < 8; i++){
            printf("sum[%d] = %d\n", i, sums[i]);
        }
    }

Now, the SPU must write their sums to the PPU. Before that, the SPU will acquire the sums array then compute. Finally, it will write it's results. Before I give the code, put int sums[32] __attribute__ ((aligned (128))); in global scope. (Just above main. ) This variable is where we will store data retrieved from System Memory. The __attribute__ ((aligned (128))) tells the compiler to make sure that this static allocation is 128 aligned, (IE, sums % 128 == 0). So, here is the code for reading/writing to and from System Memory.

    mfc_get(sums, envp.ui[1], sizeof(sums), 31, 0, 0);
    mfc_write_tag_mask(1<<31);
    mfc_read_tag_status_all();

    sum = 0;
    for(i = argp.ui[1]*1000+1; i <= argp.ui[1]*1000 + 1000; i++){
        sum += i;
    }

    sums[argp.ui[1]] = sum;
    mfc_put(sums, envp.ui[1], sizeof(sums), 20, 0, 0);

Skipping the first three mfc function calls. The mfc_put first takes in what will be written to System Memory. The second arguement is the address in System Memory to write too. The third is how much data to write. The fourth variable is the tag of the Message. The last two are for cache algorithms and other options to use.

Now for the mfc_get. Similiar to the mfc_put the first arguement is where to write the data too. The second arguement is the System Memory Adress to read from, the third is how much memory to read, the fourth is for the tag. As usual the last two are for cache algirhtms and option.

The next two function calls are important mfc_write_tag_mask tell the DMA to "Set tag mask to select tag groups to be included in query operation."Basically, all updates to the "sums" array (writes) are now to be included here. The mfc_read_tag_status_all() Tells the local DMA to accept the new tag status. (IE: turn on updates to the sums array.) This is why even though it looks like every SPU is over-writting the sums array (when they hit mpf_put), it is actually updating it's portion.

So, now if you run the program you should get:
[root@(none) ~]# callthru source /tmp/sum > sum && chmod +x sum && ./sum
32004000 =? 32004000
Sums were correct.
[root@(none) ~]#
Again, here is the code: sum