01-08-2012, 12:43 PM
BlueGene/L System Software
BlueGeneL System Software.ppt (Size: 291 KB / Downloads: 26)
Programming on BG/L
A single application program image
Running on tens of thousands of compute nodes
Communicating via message passing
Each image has its own copy of
Memory
File descriptors
A “job” is encapsulated in a single host-side process
A merge point for compute node stdout streams
A control point for
Signaling (ctl-c, kill, etc)
Debugging (attach, detach)
Termination (exit status collection and summary)
Cross compile the source code
Place executable onto BG/L machine’s shared filesystem
Run it
“blrun <job information> <program name> <args>”
Stdout of all program instances appears as stdout of blrun
Files go to user-specified directory on shared filesystem
blrun terminates when all program instances terminate
Killing blrun kills all program instances
Programming Models
“Coprocessor model”
64k instances of a single application program
each has 255M address space
each with two threads (main, coprocessor)
non-coherent shared memory
“Virtual node model”
128k instances
127M address space
one thread (main)
Programming Model
Does a job behave like
A group of processes?
Or a group of threads?
A little bit of each
A process group?
Yes
Each program instance has its own
Memory
File descriptors
No
Can’t communicate via mmap, shmat
Can’t communicate via pipes or sockets
Can’t communicate via signals (kill)
A thread group?
Yes
Job terminates when
All program instances terminate via exit(0)
Any program instance terminates
Voluntarily, via exit(!0)
Involuntarily, via uncaught signal (kill, abort, segv, etc)
No
Each program instance has own set of file descriptors
Each has own private memory space
Compilers and libraries
GNU C, Fortran, C++ compilers can be used with BG/L, but they do not exploit 2nd FPU
IBM xlf/xlc compilers have been ported to BG/L, with code generation and optimization features for dual FPU
Standard glibc library
MPI for communications
System calls
Traditional ANSI + “a little” POSIX
I/O
Open, close, read, write, etc
Time
Gettimeofday, etc
Signal catchers
Synchronous (sigsegv, sigbus, etc)
Asynchronous (timers and hardware events)
System calls
No “unix stuff”
fork, exec, pipe
mount, umount, setuid, setgid
No system calls needed to access most hardware
Tree and torus fifos
Global OR
Mutexes and barriers
Performance counters
Mantra
Keep the compute nodes simple
Kernel stays out of the way and lets the application program run
Software Stack in BG/L Compute Node
CNK controls all access to hardware, and enables bypass for application use
User-space libraries and applications can directly access torus and tree through bypass
As a policy, user-space code should not directly touch hardware, but there is no enforcement of that policy
What happens under the covers?
The machine
The job allocation, launch, and control system
The machine monitoring and control system
The machine
Nodes
IO nodes
Compute nodes
Link nodes
Communications networks
Ethernet
Tree
Torus
Global OR
JTAG
The IO nodes
1024 nodes
talk to outside world via ethernet
talk to inside world via tree network
not connected to torus
embedded linux kernel
purpose is to run
network filesystem
job control daemons
The compute nodes
64k nodes, each with 2 cpus and 4 fpus
application programs execute here
custom kernel
non-preemptive
application program has full control of all timing issues
kernel and application share same address space
kernel is memory protected
kernel provides
program load / start / debug / termination
file access
all via message passing to IO nodes
The link nodes
Signal routing, no computation
Stitch together cards and racks of io and compute nodes into “blocks” suitable for running independent jobs
Isolate each block’s tree, torus, and global OR network