Cell Simulator

NOT DONE YET

Two other methods of message passing are at your disposal. Mailboxes provide a simple blocking mechanism between the PPU and SPU. Signals are more efficient then mailboxes and can be sent from one SPU to another. (Mailboxes can only be done PPU <=> SPU.) The downside to signals is there overheard to initially setup and that I have only been able to get signals to work in 32-bit mode. (Instead of the native 64-bit of the PPUs.)

Mailboxes

Each SPU has 1 outbound mailbox and 4 inbound mailboxes. Meaning, the PPU can send 4 "mails" to the SPU before the SPU reads any (reads from the SPU remove one from the SPU). If the mailbox is full and the PPU writes to it. Some undefined action happens. The SPU writting to a full outbound box causes the same problem. Note that the mailbox reads are blocking. Meaning the program execution doesn't continue until the read is complete. This is very good for the PPU telling every SPU that all memory operations are complete and they (the SPUs) may begin.

Nicely, mailboxes are easy to implement. Hence the code will start with mailboxes and then move into signalling. So, in my development area I made a directory 'Signaling' and use a 'helloworld' as a template. Sending a message from the PPU to the SPU is easy. After creating the threads, but before waiting on them call:

    for (i = 0; i < SPU_THREADS; i++){
        if(spe_write_in_mbox(spe_ids[i], i) < 0){
            fprintf(stderr, "Failed writing message to spe %d\n", i);
            exit(1);
        }
    }
In main of the SPU, write:
    int rank = spu_read_in_mbox();     printf("rank = %d\n");

Again the spu_read_in_mbox() is a blocking command, execution will not continue until the SPU gets the information from their mailbox. If the mailbox is empty, the SPU waits until there is data in the mailbox. Hence mailboxes are a good syncing mechanism.

So, mailboxes are a simple way to doing syncing. Unfortunatly, the communication is slow and can only be done inbetween PPU and SPU. Signaling is faster, and allows communication between SPUs.

Signaling

Signaling is a fast method to communication between SPUs, (and with PPUs if needed), unfortunately there is some overhead. Also, I've only been able to get signaling to work in 32-bit binaries. So, to start here is the control block (place this into a header file for the PPU code and SPU code to include.)

#ifndef __signal_h__
#define __signal_h__

#define NUM_THREADS 8

/* This union helps clarify calling parameters between the PPE and the SPE. */
typedef union{
    unsigned long long ull;
    unsigned int ui[2];
}addr64;

typedef struct _control_block{
    unsigned int rank;
    unsigned long speid; 
    void* ls;
    void* sig1;

    unsigned char pad[112];
} control_block;

#endif

'rank' is the id of the SPU, IE which one it is (0-7). 'speid' holds that speid, 'ls' is the local store', and 'sig1' is the location of signal1 register. Note that the program will send every SPU every other SPU's information. Anyways here is the PPU's code, with some comments:

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <libspe.h>
#include "../signal.h"

extern spe_program_handle_t signal_spu;
speid_t spe_ids[NUM_THREADS];
control_block cb[NUM_THREADS] __attribute__ ((aligned (128)));

int main(int argc, char **argv){
    int i;
    int status[NUM_THREADS];
    
    (void)argc;
    (void)argv;
    
    for(i=0; i<NUM_THREADS; i++){
        spe_ids[i] = spe_create_thread(0, &signal_spu, &cb[0], 
                NULL, -1, SPE_MAP_PS);
        if(spe_ids[i] == 0) {
            fprintf(stderr, "Failed spu_create_thread(rc=%p, errno=%d)\n", 
                spe_ids[i], errno);
            exit(1);
        }
    }

    for (i = 0; i < NUM_THREADS; i++){
        cb[i].rank = i;
        cb[i].speid = (unsigned long)spe_ids[i];
    
        if((cb[i].sig1 = spe_get_ps_area(spe_ids[i], SPE_SIG_NOTIFY_1_AREA)) 
                == NULL){
            printf("Failed call to spe_get_ps_area(%d, ...)\n", i);
            return -1;
        }
    
        if((cb[i].ls = spe_get_ls(spe_ids[i])) == NULL){
            fprintf(stderr, "Failed call to spe_getls(%d)\n", i);
            return -1;
        }
    }

    for (i = 0; i < NUM_THREADS; i++){
        if(spe_write_in_mbox(spe_ids[i], cb[i].rank) < 0){
            fprintf(stderr, "Failed writing messages to spe %d\n", i);
            exit(1);
        }
    }
    
    for(i = 0; i < NUM_THREADS; i++){
        (void)spe_wait(spe_ids[i], &status[i], 0);
    }

    printf("\nThe program has completed successfully.\n");

    return (0);
}

The first for loop creates each thread and ensures they were started correctly. The second for loop sets each of the control blocks. Note that we don't have the speid of all of the SPUs until the threads are created. Also spe_get_ps_area acquires the program stack area of something and spe_get_ls acquires the local store area.

The thrid loop sends every SPU their rank (or id). In addition each SPU is waiting for the information and hence doesn't execute until the message is sent. Notice that each control block is set up properly before that time.

Now onto the SPU's code:

#include <stdio.h>
#include <spu_intrinsics.h>
#include <cbe_mfc.h>
#include <spu_mfcio.h>
#include "../signal.h"

control_block cb[8] __attribute__ ((aligned (128)));

int rank;
int sendsig(int);

int main(unsigned long long spu_id, unsigned long long parm){
    int tag_id = 0;
    (void)spu_id;
    (void)parm;

    spu_writech(MFC_WrTagMask, -1);
    
    rank  = spu_read_in_mbox();
    
    mfc_get(&cb[0], (unsigned long) parm,
            NUM_THREADS * sizeof(control_block), tag_id, 0, 0);
    mfc_read_tag_status_all();

    if(rank < 7){
        sendsig(rank+1);
        spu_read_signal1();
    }else /*if(g_rank == 7)*/{
        sendsig(0);
        spu_read_signal1();
    }
    
    return 0;
    
}

int sendsig(int s_rank){
    volatile int signal[4] __attribute__ ((aligned (128)));
    unsigned long ea;
    char* ls;
    int tag_id = 1;
    
    if (s_rank >= NUM_THREADS){
    printf("ERROR: rank (%d) greater than NUM_THREADS", rank);
    return -1;
    }

    if (s_rank == rank){
    printf("ERROR: do not send signal to yourself\n");
    return -1;
    }
    
    signal[3] = 2;
    ea =  (unsigned long)cb[s_rank].sig1 + 12;
    ls = ((char*)&signal[0])+ 12;
    
    mfc_sndsig( ls , ea, tag_id, 0,0);
    mfc_read_tag_status_all();
    
    return 0;
}

Again notice that each SPU will hold data for everyother SPU. Before the mailbox read the code spu_writech(MFC_WrTagMask, -1); Is used to notify that . Then read_in_mbox is called. Again, execution doens't continue until the SPU program recieves the mail. Once that is done, the program reads all of the control blocks and then sends a signal thier "next" SPU. (The last one sends a signal to the first one. After sending each one then reiceves. Note that the sending doesn't block, but the reciving blocks until some signal is recived (similiar to mailboxes). The sendsig function send the number 3 to whatever SPU is passed to the function.