SUR Blue Gene

From ScorecWiki

Contents

Introduction

This document is a repository for basic usage information pertaining to RPI's SUR Blue Gene/L. All directions here assume that you are ssh'ed into levi.nic.rpi.edu from either the campus network or from an authorized off-campus host/subnet.

For information on the CCNI Blue Gene please look at the CCNI wiki: http://wiki.ccni.rpi.edu

System Overview

Hardware

  • Two 700-MHz PowerPC 440 CPUs per node, 1024 nodes
  • 32-bit architecture
  • 1024 MB of double data rate (DDR) dynamic random access memory (DRAM) per node at 350 MHz; approximately 85-cycle latency
  • Caches
    • L1 data cache : 32 KB per processor; 32-B cache-line size; 64-way set associative; round-robin replacement
    • L2 data cache : 2 KB per processor; a prefetch buffer with 16 128-byte lines
    • L3 data cache : 4 MB embedded DRAM shared by the processors; approximately 35-cycle latency
  • Networks
    • 3-dimensional torus : 175 MBps in each direction
    • Global collective : 350 MBps; 1.5 μs latency
    • Global barrier and interrupt
    • JTAG
    • Gigabit Ethernet (external)
  • Processing Units
    • Single integer unit (FXU)
    • Single load/store unit (LSU)
    • Special double floating-point unit (DFPU) : 32 primary floating-point registers, 32 secondary floating-point registers; supports both standard PowerPC and SIMD instructions

Instruction Sets

  • Standard PowerPC instructions (fadd, fmadd, fadds, fdiv)—Execute on FPU0; 5-cycle latency in the floating-point pipeline
  • SIMD instructions (fpadd, fpmadd, fpre, and so forth)— Execute on data in matched primary and secondary register pairs, generating up to two results per processor clock cycle; 5-cycle latency in the floating-point pipeline

The theoretical floating-point performance limit is one fpmadd per cycle, resulting in four floating-point operations per cycle. This amounts to (4 × 700 × 10)6 floating-point operations per second (FLOPS) per processor core, or a peak performance of 5.6 GFLOPS per compute node.

Storage

System storage consists of 6TB of SAN attached hardware RAID protected disk, distributed with a four-server GPFS parallel cluster file system. This storage is backup nightly to tape.

Getting an Account

To request an account on the SUR Blue Gene/L you must submit [this] form to either the SCOREC main office (see SCOREC web site for address) or to the Academic and Research Computing division (specific contact to be provided...).

Connecting to the System

All interaction with the system is done via SSH to the host levi.nic.rpi.edu. If you are off campus then you will need to either use the campus VPN service or to request that the support staff (see contact info below) add your IP address to the list of machines allowed to connect. Moving data on and off of the system is most simply done using SCP, but the interactive node has utilities like an ftp client, rsync, subversion and cvs installed.

Building Executables

Compilers

Your basic parallel compiler wrappers are in your $PATH by default and you are provided both GNU and IBM XL flavors: mpicc, mpicxx, and mpif77 for GNU and mpixlc, mpixlcxx, mpixlf77, and mpixlf90 for XL. The "naked" compilers themselves (without the MPI wrappers) are located as follows:

FORTRAN:

/opt/ibmcmp/xlf/bg/10.1/bin/blrts_xlf 
/opt/ibmcmp/xlf/bg/10.1/bin/blrts_xlf90 
/opt/ibmcmp/xlf/bg/10.1/bin/blrts_xlf95 

C:

/opt/ibmcmp/vac/bg/8.0/bin/blrts_xlc 

C++:

/opt/ibmcmp/vacpp/bg/8.0/bin/blrts_xlC 

There are also stand alone GNU compilers for the Blue Gene:

FORTRAN:

/bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-g77 

C:

/bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-gcc 

C++:

/bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-g++ 

Note that it is recommended not to use GNU compiler for Blue Gene as IBM compilers tend to offer significantly higher performance. The GNU compilers offer more flexible support for things like inline assembler.

Books can (and have) been written about tuning your software to the Blue Gene, but here are some very broad points that you should keep in mind when trying to get the most out of the system:

  • Do not perform parallel file system operations in the same directory
  • Leverage the low latency IPC network as much as possible, this is Blue Gene's strength
  • Understand the differences between co-processor and virtual node operation and leverage them to suit your job (eg: use the different L2/L3 cache profiles effectively, decide how much memory you will need per processor, understand your IPC patterns, etc)

Libraries

The ESSL linear algebra subroutines are located in /bgl/BlueLight/ppcfloor/essl/4.2/lib and the basic Blue Gene libraries (libc, etc) are located in /bgl/BlueLight/ppcfloor/bglsys/lib; the headers live in /bgl/BlueLight/ppcfloor/bglsys/include

Dependencies

"DEPS=-M" does not work well for BlueGene. -M means generate dependencies but it does not prevent compiling. To avoid this situation the flag "DEPS=-qmakedep -o /dev/null -c" could be used instead.

To compile Zoltan on BlueGene, following things could be done to disable the dependency:

Modify Zoltan/Makefile_sub and Zoltan/Utilities/Makefile_sub as follows:

- comment out the following line

 #include $(OBJ_FILES:.o=.d)

- remove the text $(OBJ_FILES:.o=.d) everywhere else in the files.

Debuggers

GDB

GDB can currently be run by passing -start_gdbserver /bgl/BlueLight/V1R3M4_300_2008-080728/ppc/dist/sbin/gdbserver.440 to mpirun

at the prompt enter

 dump_proctable

to get addresses to connect to.

Connect using /bgl/BlueLight/V1R3M4_300_2008-080728/ppc/blrts-gnu/bin/gdb and the 'target remote' command.

 GNU gdb 6.4
 Copyright 2005 Free Software Foundation, Inc.
 GDB is free software, covered by the GNU General Public License, and you are
 welcome to change it and/or distribute copies of it under certain conditions.
 Type "show copying" to see the conditions.
 There is absolutely no warranty for GDB.  Type "show warranty" for details.
 This GDB was configured as "powerpc-linux-gnu".
 (gdb) target remote 172.?.?.?:port

Totalview

We have licensed the Totalview debugger for the Blue Gene platform for academic use. If you use the bash shell then this tool will be in your $PATH. Documentation is located at /gpfs/gpfs0/software/applications/toolworks/tvlatest/doc/pdf and should explain most of what you might want to know about using the tools. Blue Gene installation specific notes would include the following:

  • Jobs are most easily launched with the debugger by launching the totalview GUI from a shell script fed to the scheduler then giving mpirun as the program to run and the rest of the arguments you would give to mpirun to run the job as arguments to the program in totalview. For example:


Create a shell script from which to launch TV:

cat runtv.sh
#!/bin/bash

totalview

Submit the script to the scheduler:

sbatch -p debug --nodes 32 ./runtv.sh

When totalview starts, specify mpirun as the program to debug (don't select a parallel environment)...
Image:Tvprog.jpg

...and specify the arguments you would pass to mpirun to run your program normally as arguments for Totalview to use:

Image:Tv-args-slurm.png

**Warning**: My latest attempt to run totalview on SUR required that I not run in virtual mode (i.e. do not include -mode VN is the argument list).

Scheduling and Running Jobs

The RPI SUR Blue Gene uses a batch submission system to run jobs. The scheduler takes jobs that specify a number of parameters (including the number of processors, Blue Gene interconnect, and memory size) and schedules the execution of this job on a part of the Blue Gene matching the job's requirements. The choices for the size of a job are currently 1024, 512, 128, or 32 nodes.

A web interface exists to view the status of the Blue Gene system. It is available at http://wrangler.nic.rpi.edu:8080/BlueGeneNavigator/faces/jobs.jsp

Using the Queuing System (Slurm)

Slurm jobs for Blue Gene are built by simply putting one or more mpirun commands in a shell script (with no specification of which Blue Gene partition they would run on) and submitting the script to Slurm with the sbatch command. You specify which resources you need to sbatch and the scheduler takes care of allocating an appropriate Blue Gene partition to your job. The manual for sbatch (man sbatch) gives a complete list of parameters, here is a simple usage example:

The job script (let's call it testjob.sh):

#!/bin/bash

echo 'job starting'
mpirun -mode VN -cwd `pwd` ./my_bluegene_executable
# additional calls to mpirun may be placed in the script, they will all use the same partition 

The job submission command:

sbatch --nodes 128 -p normal -o ./jobstdout ./testjob.sh

Note that "--nodes 128" is requesting 128 Blue Gene nodes, "-p normal" is setting the queue to 'normal', and STDOUT for the job is redirected to the file jobstdout in the current working directory where the command is run. As mentioned above, a lot can be done with sbatch to customize how jobs are run and the sbatch manual page should be consulted for this information. Note that the shorted "-n" flag is requesting processors, not nodes, and will request half that many compute nodes. The capitalized "-N" does request whole compute nodes.

The status of the queue can be viewed with squeue and jobs can be canceled with scancel, specifying the job's ID as the sole argument.

There are two queues setup. Your job can either run in the debug queue (specified with "-p debug" as part of your sbatch command, or if you do not specify a queue), or in the normal queue ("-p normal"). To clarify the different queues, here is a comparison:

Partition Size Time Limit Priority
debug 32 1 hour higher
normal 1024, 512, 128, 32 12 hours lower

Backfill scheduling: Slurm has the ability to schedule and run smaller, shorter jobs while waiting for a larger set of nodes to become available for a larger/long job. To take advantage of this you can set the "--time" flag with the amount of time your job is expected to take. If it can be run sooner without impacting other jobs priorities it may be able to launch sooner than it would under the traditional scheduler. (Please see man sbatch for more details.)

Usage Policy

The system is operated without job limits except for wall clock length. This is intended to allow for the most productive use of the system but requires that users be respectful of each other. This means not monopolizing the system: only submit more than two or three jobs at a time if they're short; don't hog the whole machine unless no one else is using it (and stop feeding it when someone else comes into the queue); and don't pile data into GPFS unless you're actively using it on the system.

Accounting

In June of 2007 a plan was devised for all users of the SUR Blue Gene system to be given a number of 'tokens' and have jobs run through the system consume one token for every second of every node that they are run on. Once a user runs out of tokens their jobs were to be rescheduled to a lower priority (the scavenge queue) until their tokens had been replenished. The scavenge queue also enforces a hour hour job time limit. This scheme has not, to date, been enforced.

Disk accounting is done as a typical Unix file system quota with a default of 100G per home directory. You may check your quota utilization with the mmlsquota command like so:

mmlsquota gpfs0

Support/Contact

The SUR Blue Gene support staff can be reached at bg-support-l [at] lists.rpi.edu

Notes

Optimizations Available to the IBM XL Compilers

For a complete description of optimization, see the IBM XL User’s Guide for the language used by your application. This section summarizes that information and provides recommendations for setting the XL compiler flags to optimize the performance of your application on the Blue Gene. The default optimization level for the XL compiler is none. The following optimization levels are available:

FlagDescription

-O

This optimization level is a good place to start; use it with the -qmaxmem=64000 flag.

-O2

This optimization level is the same as -O.

-O3 -qstrict

Optimization but must strictly obey program semantics

-O3

This is an aggressive optimization level, It allows re-association, will replace division with multiplication by reciprocal when possible

-O4

The –O4 option is short for -O3 -qhot -qipa=level=1 -qarch=auto –qtune=auto. Therefore, with this option, add –qarch=440d –qtune=440 to restore the proper architecture and tuning options for the eServer Blue Gene. Adds compile time inter procedural analysis

-O5

The -O5 option is short for -O3 -qhot –qipa=level=2 -qarch=auto –qtune=auto. Therefore, with this option, add –qarch=440d –qtune=440 to restore the proper architecture and tuning options for the eServer Blue Gene. Adds link time inter procedural analysis

-qhot

This turns on high-order transformation module. It will add vector routines, unless -qhot=novector

-qreport=hotlist

check listing

-qipa

This performs inter-procedure analysis; many suboptions such as -qipa=level=2

Architecture Flags

FlagDescription

-qarch=440

This flag generates standard PowerPC floating-point code, uses single FPU per processor

-qarch=440d

This flag will try to generate double FPU code, uses both FPU per processor

-qtune=440

This is the default tuning option

A Sample Makefile

# Compilers 
 
CC = /opt/ibmcmp/vac/bg/8.0/bin/blrts_xlc 
F77 = /opt/ibmcmp/xlf/bg/10.1/bin/blrts_xlf 
 
# Blue Gene Software 
 
BGL_PATH = /bgl/BlueLight/ppcfloor/bglsys 
 
# Executables 
 
EXECS = hello_c.rts hello_f.rts 
 
# Include and Libraries 
 
INC_PATH = -I$(BGL_PATH)/include 
LIB_PATH = -L$(BGL_PATH)/lib 
LIBS_MPI = $(LIB_PATH) -lmpich.rts -lmsglayer.rts -lrts.rts -ldevices.rts 
 
# Flags 
 
OPT_FLAGS = -O3 -qstrict -qarch=440 -qtune=440 
CFLAGS = $(OPT_FLAGS) $(INC_PATH) 
FFLAGS = -qnullterm $(OPT_FLAGS) $(INC_PATH) 
 
default: $(EXECS) 
 
hello_c.rts:       HelloWorld_c.o 
        $(CC) HelloWorld_c.o $(LIBS_MPI) -o hello_c.rts 
 
hello_f.rts:    HelloWorld_f.o 
        $(F77) HelloWorld_f.o $(LIBS_MPI) -o hello_f.rts 
 
* Note: To have the text of any FORTRAN error messages displayed instead of just the error code, specify  -env “NLSPATH=$NLSPATH” on your mpirun command  line.  
# For C codes 
 
.c.o: 
        $(CC) $(CFLAGS) -c $< 
 
# For Fortran codes 
 
.f.o: 
       $(F77) $(FFLAGS) -c $< 

clean: 
        rm -f *.o $(EXECS) 

Links

IBM guide to using the Blue Gene: http://www.redbooks.ibm.com/abstracts/sg246686.html
System Monitoring Tool: http://wrangler.nic.rpi.edu:8080/BlueGeneNavigator/faces/jobs.jsp
A guide to building ITK on BG/L: http://www.itk.org/Wiki/Proposals:Compiling_on_Bluegene_Supercomputer