Using Cluster Ohio Clusters and those at OSC

http://www.osc.edu/

Tech Info > Computing > Using the Cluster

Using the Cluster Ohio Cluster at Kent and the IA32 Cluster at OSC

The Cluster Ohio Cluster at Kent is called co1.cs.kent.edu and has
Eight nodes each consisting of:

Four, 550 Megahertz Intel Pentium III Xeon processors, with
- 512kB of secondary cache
- Two Gigabytes of RAM
- 18 Gigabytes, ultra-wide SCSI hard drives
- Two Myrinet interfaces
- One 100Base-T Ethernet interface
RedHat Linux
Portland Group Compilers for FORTRAN
KAI Toolset for C, and C++
Etnus Totalview
NAGlib
Miscellaneous tools from the public domain or OSC developed.

The Ohio Supercomputer Center (OSC) provides supercomputing services to Ohio colleges, universities, and companies. One of the high-performance computing systems available at OSC is the IA32 Cluster. This is a cluster of commodity PCs with a high speed network interconnect. The current configuration is one file server, node, two front-end nodes, one hundred and twenty-eight compute nodes, and sixteen storage nodes. The file server node has four 550MHz Pentium II Xeon processors, 4 GB of memory, and 72 GB of local storage. Each login node has two 1.4GHz Athlon MP processors, 2 GB of memory, and 70 GB of local scratch space. Each compute node has two 1.4GHz Athlon MP processors, 2 GB of memory, and 70 GB of local scratch space. Each storage node has two 933MHz Pentium III processors, 1 GB of memory, and a RAID controller supporting 520GB of IDE RAID disks. The nodes are connected using Myrinet 2000, a switched 2 Gbit/s network.

Getting started at Kent

To login to the Cluster Ohio cluster at Kent, ssh to the following address.

        co1.cs.kent.edu

From here, you have access to the compiling systems, performance-analysis tools, and debugging tools. You can run programs interactively or through batch requests. See the following sections for details.

Getting started at OSC

To login to the Cluster at OSC, ssh to the following address.

        oscbw.osc.edu

File system

The Cluster accesses the user home directories found on the OSC file server. Therefore, users have the same home directory on the Cluster as on the Cray SV1 and the SGI Origin 2000.

The Cluster also has fast local disk space intended for temporary files. You are encouraged to perform the majority of your work in the temporary space and only store permanent files in your home directory. Large files stored in your home directory may be automatically migrated to the tape storage repository. To ensure fast access to required files, copy the files to the temporary area at the start of your session.

The following example shows how to use /tmp, the temporary-storage directory.

`mkdir /tmp/$USER`	Create your own temporary directory.
`cp` files `/tmp/$USER`	Copy the necessary files.
`cd /tmp/$USER`	Move to the directory.
...	Do work (compile, execute, etc.).
...
`cp` new files `$HOME`	Copy important new files back home.
`cd $HOME`	Return to your home directory.
`rm -rf /tmp/$USER`	Remove your temporary directory.
`exit`	End the session.

Use this procedure when compiling and executing interactively. The temporary space is not backed up, and old files may be purged when the temporary file system gets full.

A simpler procedure is available for batch jobs through the TMPDIR environment variable. See "Batch requests" for more information.

Executing programs

Commands on the Cluster can be executed either interactively or through batch requests. The Cluster has fixed usage limits for interactive execution; jobs that take more than the allowed CPU time must be executed using batch requests. To use the resources of the Cluster most efficiently, you are encouraged to use batch requests whenever possible. See "Batch requests" for more information.

For information on how to execute an MPI program, see the "MPI" section.

To execute a non-MPI program, simply enter the name of the executable. Unless otherwise specified, the number of processors used for a non-MPI parallel program is determined by the operating system at runtime. To control the number of processors, set the environment variables NCPUS (for automatically parallelized jobs) or OMP_NUM_THREADS (for jobs parallelized using the OpenMP compiler directives). If the number of available processors (four per node) is less than NCPUS or OMP_NUM_THREADS, then at least one processor will run multiple threads.

The following ksh example causes a.out to use 2 processors if they are available.

   export NCPUS=2
        export OMP_NUM_THREADS=2
        ./a.out

The omp_numthreads function can be called within a program to determine the number of threads assigned to that program.

        integer function omp_num_threads

        int omp_num_threads();

Batch requests

Batch requests are handled by the Portable Batch System (PBS) and Maui Scheduler. Use the qsub command to submit a batch request, qstat to view the status of your requests, and qdel to delete unwanted requests. For more information, see the man pages for each command.

The following options are often useful when submitting batch requests. The options may appear on the qsub command line or preceded by #PBS at the begining of the batch-request file.

Option Meaning

-l walltime=time Total wallclock time limit in seconds

-l nodes=numnodes:ppn=numprocs Request use of numprocs processors (max 2) on each of numnodes nodes (max 112).

-N job Name the job.

-S shell Use shell rather than your default login shell to interpret the job script.

-j oe Redirect stderr to stdout.

-m ae Send e-mail when the job finishes or aborts.

By default, your batch jobs execute in your home directory. This is true even if you submit the job from another directory.

To facilitate the use of temporary disk space, a unique temporary directory is automatically created at the beginning of each batch job. This directory is also automatically removed at the end of the job. You access the directory through the TMPDIR environment variable. Note that in jobs using more than one node, $TMPDIR is not shared -- each node has its own distinct version of $TMPDIR.

A sample ksh request file appears below. The request first copies a Fortran file from a subdirectory of the user's home to the temporary space. It then compiles the file for automatic parallel execution, runs the executable using 2 threads on 1 node, and copies the results back to the previous subdirectory. Notice that the careful use of full file names allows this request to be submitted safely from any subdirectory.

#PBS -l walltime=1:00:00
#PBS -l nodes=1:ppn=2
#PBS -N my_job
#PBS -S /bin/ksh
#PBS -j oe

cd $TMPDIR
cp $HOME/science/my_program.f .
f77 -O2 -Mconcur my_program.f
export NCPUS=2
./a.out > my_results
cp my_results $HOME/science/.

If you have the above request saved in a file named my_request.job (and my_program.f saved in a subdirectory called science), the following command will submit the request.

        qsub my_request.job

You can use the qstat command to monitor the progress of the resulting batch job. When the job finishes, my_results will appear in the science subdirectory, and the standard output generated by the job will appear in a file called my_job.oN, where N is some number. This file will appear in the directory where you executed the qsub command. The N differentiates multiple submissions of the same job, for each submission generates a different number.

Single-CPU sequential jobs should either set the -l nodes resource limit to 1:ppn=1 or leave it unset entirely. The following is an example of a sequential job which uses $TMPDIR as its working area.

#PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=1
#PBS -N cdnz3d
#PBS -j oe
#PBS -S /bin/ksh

cd $HOME/Beowulf/cdnz3d
cp cdnz3d cdin.dat acq.dat cdnz3d.in $TMPDIR
cd $TMPDIR
/usr/bin/time ./cdnz3d > cdnz3d.hist
cp cdnz3d.hist cdnz3d.out $HOME/Beowulf/cdnz3d

All batch jobs must set the -l walltime resource limit, as this allows the Maui Scheduler to backfill small, short running jobs in front of larger, longer running jobs. This in turn helps improve turnaround time for all jobs.

Programming environment

The IA32 Cluster supports two programming models of parallel execution: shared memory using one node, through compiler directives and automatic parallelization; and distributed memory using multiple nodes, through message passing. See the sections below for more information.

Compiling systems

FORTRAN 77, Fortran 90, HPF, C, and C++ are supported on the IA32 Cluster. The IA32 Cluster has the Portland Group's suite of optimizing compilers, which tend to generate faster code than that generated by the standard GNU compilers.

The following examples produce the Linux executable a.out for each type of source file.

FORTRAN 77 pgf77 sample.f

Fortran 90 pgf90 sample.f90

HPF pghpf sample.hpf

C pgcc sample.c

C++ pgCC sample.C

For more information on command-line options for each compiling system, see the man pages (man pgf90, man pgcc, man pgCC, etc.).

Shared memory

The IA32 Cluster can automatically optimize single-node sequential programs for shared-memory parallel execution using the -Mconcur compiler option.

        pgf77 -Mconcur sample.f

        pgf90 -Mconcur sample.f90

        pgcc -Mconcur sample.c

    pgCC -Mconcur sample.C

In addition to automatic parallelization, the FORTRAN 77 and Fortran 90 compilers understand the OpenMP set of directives, which give the programmer a finer control over the parallelization. The -mp compiler option enables OpenMP support.

Message Passing Interface (MPI)

The IA32 Cluster at OSC uses the MPICH implementation of the Message Passing Interface (MPI), optimized for the high-speed Myrinet interconnect. MPI is a standard library for performing parallel processing using a distributed-memory model. For more information on MPI, see the "Training" section of the Technical Information Server.

Each program file using MPI must include the MPI header file. The following statement must appear near the beginning of each C file or Fortran file, respectively.

        #include <mpi.h>

        include 'mpif.h'

To compile an MPI program, use the MPI wrappers around the Portland Group compilers. Here are some examples:

        mpif77 sample.f

        mpif90 sample.f90

        mpicc sample.c

        mpiCC sample.C

The Portland Group HPF compiler can also use the Myrinet optimized version of MPI. This should be used for HPF programs which are intended for use on more than one node of the cluster:

        pghpf -Mmpi sample.hpf

Use the mpiexec command to run the resulting executable; this command will automatically determine how many processors to run on based on your batch request.

        mpiexec a.out

Here is an example of an MPI job which uses the entire cluster:

#PBS -l walltime=1:00:00
#PBS -l nodes=64:ppn=2
#PBS -N my_job

#PBS -S /bin/ksh
#PBS -j oe

cd $HOME/science
mpif77 -fast mpiprogram.f
pbsdcp a.out $TMPDIR
cd $TMPDIR
mpiexec ./a.out > my_results
cp my_results $HOME/science/.

mpiexec will normally spawn one MPI process per CPU requested in a batch job. However, this behavior may be modified with the -pernode command line options. The -pernode option requests that one MPI process be spawned per node. These options are intended to be used for codes which mix MPI message passing with some form of shared memory programming model, such as OpenMP or POSIX threads.

The pbsdcp command used in the example above is a distributed copy command; it copies the listed file or files to the specified destination (the last argument) on each node of the cluster assigned to your job. This is needed when copying files to directories which are not shared between nodes, such as /tmp or $TMPDIR.

Debugging

The Portland Group compiler suite includes a graphical debugger called Xpgdbg. This debugger allows for interactive and post-mortem analysis of sequential and shared-memory parallel programs. To debug a program with Xpgdbg, first compile the program with the -g option.

        pgf77 -g program.f

        pgf90 -g program.f90

        pgcc -g program.c

        pgCC -g program.C

To debug a program interactivel;y, run the debugger on the appropriate executable.

       Xpgdbg a.out

To analyze a core file after an unsuccessful execution, run the debugger on the core file and supply the executable that generated the file.

       Xpgdbg a.out core

Performance Analysis

Software

Training

Additional information on Beowulf clusters in general can be found at the Beowulf Project home page at NASA Goddard and the Extreme Linux home page at Los Alamos National Labs.

OSC is also known as the Ohio Supercomputer Center
OSC,1224 Kinnear Road, Columbus, Ohio 43212
Consultants: Email: oschelp@osc.edu Phone: 800/686-6472 or 614/292-1800 fax: 614/292-7168
© 2001, OSC

Tech Info > Computing > Using the Cluster