TestbedTutorial

This really needs to be cleaned up...

There are two computers responsible for controlling the testbed:
-One runs the head server as well as the cluster controller to manage 25 of the nodes, we will simply call this computer the 'head server' ( 128.138.206.62).

-The other runs the cluster controller process only to manage an additional 25 nodes, we will call this cluster controller 2 ( 128.138.206.42 ).

On each of the computers, the repository is checked out in a directory called 'ruby'.

The head_server process is in "ruby/daemon", and must be started on the 'head server' with 'ruby ./head_server.rb' in the "ruby/daemon" directory.

The cluster_controller process is in "ruby/cluster" and must be started on both the 'head server' as well as the 'cluster controller' with 'ruby ./cluster_controller.rb' in the "ruby/cluster" directory.

Each of the corresponding cluster_controller processes will connect to the head server. When you start a job, debug information will be displayed in each of the cluster controller's output consoles. If you start the applications in screen, you can detach the console and logout while the daemons keep running, further someone else can login, re-attach the screen session and view the debug output (and possibly restart an appropriate process if necessary).

Q: I'm a little confused about the loading process and perhaps why some nodes aren't being programmed?

This process is the most intense on the system... because there's no type of "multicast", we must send the image individually to each node, this stresses all aspects of the system - 25 threads reading the image and writing it to a usb driver which sends out at serial speeds.

here's what I can tell you and I'll answer any questions and offer advise to further fix the problem...

the cluster controllers spawn each "node mate" and then send appropriate commands to initiate programming. There is a node mate for each architecture, we're interrested in the telos architecture, so the commands for interfacing with the node are in "ruby/cluster/telos_programming_utils.rb". In this file you will find calls to " bsl.py", the application which is used to program the nodes normally, I believe that this application is pretty consistent when run on a single node.

The command starts when the head server gets the "program nodes" message, which (basically) calls the "really_start_job" function in ruby/daemon/head_server.rb. Here we start a queue of size 10 (slots) (see the @slots member variable defined in the initialize function). We try to basically keep 10 nodes programming at a time, each time one finishes we join that thread and start a new one, in each slot the head server calls the cluster controller's start_node function. You'll notice a "sleep( 0.010)" to try to stagger these loads, I'm not sure what other sort of schemes can be developed...

The cluster controller (cluster/cluster_controller.rb) simply delegates the the rpc calls from the head server to each node mate, which calls it's corresponding "programming_utils" (described above).

The code in "programming_utils" defines how to copy a program image onto a node via the usb port. In the case of the TelosB nodes, it simply invokes 'bsl.py' with the appropriate parameters.

If you need to debug this process...

I would suggest looking at the output of the head server and cluster controller when programming a job. This will explain whether the message is getting propagated to the appropriate processes.

If the messages are getting delegated properly, then the cluster-controller's output will show the output from the invications to 'bsl.py' and whether those are failing due to an improper id or the like.

Q: Occasionally the logger (or other process) will report the error 'MySQL Server has Gone Away', what gives?

hmm... I remember this issue cropping up quite a while ago and can't quite remember what the solution was.. What I think is happening is that each of these is that the head server, logger and possibly other processes will use ActiveRecord? to access the mysql database - this is over a tcp/ip connection. If this tcp/ip connection gets closed this error pops up. (I think there is a timeout for how long it can be open without any data). I'm pretty sure telling active record to reconnect when the error occurs can rectify the problem (which I believe is what we do in other parts of the code).

One thing to try would be to restart all the processes (including mysql) to see if things re-synch (don't forget to check that email I sent out about which processes to start and how to start them in screen)....

I think this may have to do something with the tcp connection to the mysql server being dropped. I thought we had done something to prevent this (like telling active record to reconnect)...

Q: The webserver seems to have stopped handling requests, however the computer seems to be up, what do I need to do to restart it?

Login to the webserver computer (128.138.206.43), and run the following command:

sudo /etc/init.d/apache2 restart

Q: I'm confuse how to add or scan for nodes on a cluster controller.

A: There is an entry that renames the devices according to their serial number (which wont change). It is done by adding the following entry to /etc/udev/udev.rules:
BUS=="usb", KERNEL=="ttyUSB*", SYMLINK="ttyUSB%s{serial}", NAME="ttyUSB%n"

This line should already be on the cluster controllers.

Now you should be able to ls/dev/ttyUSB* and see all the unique node id's. As an administrator, select 'scan for nodes' with a regexp that will catch all the nodes in /dev (this may take two tries). Note that this currently adds new entries for the nodes, and thus, if you remove all nodes first, I believe the groups become invalid.

Q: How can I setup the webserver?

A: In the above path, the 'ruby' directory is a sub-directory of mantis-testbed. i.e.
checkout mantis testbed,
svn+ssh://mantis.cs.colorado.edu/mantis-testbed

and the ruby directory will be in there. The webserver needs to point to the head server to get the database. This is done by editing 'config/database.yml', an example (I believe) is as follows (except you will need to setup a proper user name that can access the mysql database), be sure to setup the configuration entries for development as well as production:

development:
adapter: mysql
database: testbed
username: charles
socket: /var/run/mysqld/mysqld.sock
password:
head_server: 128.138.206.62
head_server_port: 6555

You will need to setup fastcgi, and set the webserver's configuration to be 'production' by putting this line in 'ruby/testbed/config/environment.rb':
ENV'RAILS_ENV' ||= 'production'

Here's an example configuration for apache2:

FastCgiServer? /var/www/testbed/dispatch.fcgi -idle-timeout 120 -processes 2
FastCgiConfig? -maxClassProcesses 2 -maxProcesses 2 -minProcesses 1 -processSlack 1

<VirtualHost? *:80>
ServerName? rails
ServerAdmin? charlesg3@gmail.com

DocumentRoot? /var/www/testbed
<Directory />
Options FollowSymLinks?
AllowOverride? None
</Directory>
<Directory /var/www/testbed>
Options Indexes FollowSymLinks? MultiViews? ExecCGI
AllowOverride? All
Order allow,deny
allow from all
</Directory>

ErrorLog? /var/log/apache2/testbed2.log

# Possible values include: debug, info, notice, warn, error, crit,
# alert, emerg.
LogLevel? debug

CustomLog? /var/log/apache2/access.log combined
ServerSignature? On
</VirtualHost>

Q: How do messages propagate from a node to the logger and other clients?

After a node has been programmed the serial port is opened in the ruby/cluster/serial_utils.rb file. Then, the appropriate "translator" (ruby/cluster/mos_translator.rb) is used to convert the byte-stream from the serial into defined packets and sent them to the cluster controller. It does this by calling the 'handle_message' function in 'ruby/cluster/testbed_nodemate.rb'. Basically this function just sends the message over the socket between the node-mate and the cluster controller process on that computer.

Once the cluster controller receives a packet from the node-mate, it forwards the message to the head_server. I believe this happens in the 'handle_server_packet' function in 'cluster/cluster_controller.rb'.

Once the head server receives the packet, it distributes the packet to each of the clients that are connected. It does this by first parsing the packet in 'daemon/messageTransport.rb', and then adding that packet to the sendQueue (in the 'send' function in that file) — eventually the packet gets sent to all clients through the 'sendAll' function in 'daemon/messageTransport.rb'.

The logger is simply a client that connects and listens for the "LoggerON" message, when it hears that, it starts receiving messages from the head server and inserting them into the nlogs table.

Created by: charlesg3 last modification: Thursday 13 of December, 2007 [20:50:14 UTC] by charlesg3

source

history

similar

Online users

2 online users

Wiki

Articles

[ Execution time: 0.49 secs ] [ Memory usage: 6.78MB ] [ GZIP Disabled ] [ Server load: 0.06 ]