LA-UR-08-6246



## **Roadrunner:** What makes it tick?

### Los Alamos Computer Science Symposium October 14, 2008

### Ken Koch

Roadrunner Technical Manager, Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory

Work presented was performed by a large team of Roadrunner project staff!





## The messages this talk will convey are:

- Why Roadrunner? Why Cell?
  - A bold but important step toward the future
- What does Roadrunner look like?
  - Cluster-of-clusters with node-attached Cells
- Concepts for Programming Roadrunner
  - MPI, Opteron+Cell, "local-store" memory & DMA transfers
- Status and plans for Roadrunner
  - Unclassified Science opportunities





## **The Cell Processor**

## a harbinger of the future





## **Microprocessor trends are changing**

- Moore's law still holds, but is now being realized differently
  - Frequency, power, & instructionlevel-parallelism (ILP) have all plateaued
  - Multi-core is here today and manycore ( ≥ 32 ) looks to be the future
  - Memory bandwidth and capacity per core are headed downward (caused by increased core counts)
  - Key findings of Jan. 2007 IDC Study: "Next Phase in HPC"
    - new ways of dealing with parallelism will be required
    - must focus more heavily on bandwidth (flow of data) and less on processor



From Burton Smith, LASCI-06 keynote, with permission





# We are programming thousands of processors with MPI







# Future supercomputers will require new programming models



# The Cell processor is an (8+1)-way heterogeneous parallel processor





## **IBM is creating new Cell processors**







# Industry presentations show changing trends in processors

#### Intel's Microprocessor Research Lab



#### Intel's Visual Computing Group -, Larabee











# Roadrunner is on a different path to a petascale





## A Roadrunner is born







# IBM built hybrid nodes in Rochester, MN and assembled the system in Poughkeepsie, NY





# Roadrunner broke the 1 Petaflop/s mark on May 26<sup>th</sup>, 2008



Only 3 days after the full machine was finally assembled! • Los Alamos HATIONAL LABORATORY EST. 1943



### **Roadrunner is a TOP performer!**



## Roadrunner System Configuration





# Roadrunner Phase 3 is Cell-accelerated, not a cluster of Cells







## A Roadrunner TriBlade node integrates Cell and Opteron blades

- QS22 is an IBM Cell blade containing two new enhanced double-precision (eDP/PowerXCell<sup>TM</sup>) Cell chips
- Expansion blade connects two QS22 via four PCI-e x8 links to LS21 & provides the node's ConnectX IB 4X DDR cluster attachment
- LS21 is an IBM dual-socket Opteron blade
- 4-wide IBM BladeCenter packaging
- Roadrunner Triblades are completely diskless and run from RAM disks with NFS & Panasas only to the LS21
- Node design points:
  - One Cell chip per Opteron core
  - ~400 GF/s double-precision & ~800 GF/s single-precision
  - 16 GB Opteron memory PLUS 16 GB Cell memory







## A Roadrunner TriBlade node integrates Cell and Opteron blades







## A Connected Unit (CU) forms a building block



Asc

## A Connected Unit (CU) is a powerful cluster

#### 360 1.8 GHz dual-core Opterons 720 PowerXCell chips 192 IB 4X DDR cluster links 2.59 TF DP peak Opteron 73.7 TF DP peak Cell 768 GB/s aggregate BW (bi-dir) 2.88 TB Opteron memory 2.88 TB Cell memory 384 GB/s bi-section BW (bi-dir) 24 2.6 GHz dual-core Opterons 18.4 TB/s Cell memory BW 24 10 GigE I/O links on 12 I/O nodes in I/O nodes 24 GB/s aggregate I/O BW (uni-dir) (IB limited) 180 TriBlade to Panasas to Panasas (1 LS21 + 2 QS22) filesystem filesystem compute nodes 12 IBM x3655 I/O nodes (dual 10 GigE each) (dual-socket dual-core) 192 cluster nodes Voltaire 288-port IB 4x DDR 96 2<sup>nd</sup>-stage links

Operated by the Los Alamos National Security, LLC for the DOE/NNSA

**Connected Unit Specifications:** 

## Now build a cluster-of-clusters...



17 CUs with CU switches, 3264 IB nodes

Extra 2<sup>nd</sup>—stage switch ports allow expansion up to 24 CUs





# Roadrunner is a hybrid petascale system of modest size delivered in 2008



• Los Alamos



### **Roadrunner is a petascale system in 2008**

#### **Full Roadrunner Specifications:** 6,120 dual-core Opterons 12,240 PowerXCell 8i chips 3,264 nodes on 2-stage IB 4X DDR 44.1 TF DP peak Opteron 1.33 PF DP peak Cell 13.1 TB/s aggregate BW (bi-dir) (1<sup>st</sup> stage) 49 TB Opteron memory 2.59 PF SP peak Cell 6.5 TB/s aggregate BW (bi-dir) (2<sup>nd</sup> stage) 408 dual-core Opterons 49 TB Cell memory 3.3 TB/s bi-section BW (bi-dir) (2<sup>nd</sup> stage) in I/O nodes 313 TB/s Cell memory BW 408 10 GigE I/O links on 204 I/O nodes 408 GB/s aggregate I/O BW (uni-dir) (IB limited) **17** CU clusters 12 links per CU to each of 8 switches Eight 2<sup>nd</sup>-stage IB 4X DDR switches

## **Roadrunner at a glance**

- Cluster of 17 Connected Units (CU)
  - 12,240 IBM PowerXCell 8i chips
  - 1.33 Petaflop/s DP peak (Cell)
  - 1.026 PF sustained Linpack (DP)
  - 6120 (+408) AMD dual-core Opterons
  - 44.1 (+4.4) Teraflop/s peak (Opteron)
- InfiniBand 4x DDR fabric
  - 3264 nodes, 2-stage fat-tree; all-optical cables
  - Full bi-section BW within each CU
    - 384 GB/s (bi-directional)
  - Half bi-section BW among CUs
    - 3.26 TB/s (bi-directional)
- ~100 TB aggregate memory
  - 49 TB Opteron (compute nodes)
  - 49 TB Cell
- 204 GB/s sustained File System I/O:



204x2 10G Ethernets to Panasas

### Fedora Linux

- On LS21 and QS22 blades
- SDK for Multicore Acceleration
  - Cell compilers, libraries, tools
- xCAT Cluster Management
  - System-wide GigEnet network
- 2.35 MW Power:
  - 0.437 GF/Watt
- Area:
  - 280 racks
  - 5200 ft<sup>2</sup>





# **Programming Concepts**





### **Roadrunner nodes have a memory hierarchy**



## Three types of processors work together



## Three types of processors work together







## Put it all together: MPI+DaCS+DMA+SIMD



- DMAs are simply block memory transfers
  - HW asynchronous (no SPE stalls)
  - DDR2 memory latency and BW performance

DMA Get: mfc\_get( LS\_addr, Mem\_addr, size, tag, 0, 0);

DMA Put: mfc\_put( Mem\_addr, LS\_addr, size, tag, 0, 0);

DMA Wait: mfc\_write\_tag\_mask(1<<tag); mfc\_read\_tag\_status\_all();



## Pick data structures & alignment to allow SIMD



128 bits = 2 doubles Work on aligned data c[i] = a[i] + b[i]

Cross aligned operations are really bad! c[i] = a[i] + a[i+1]



4 singles or integers work similarly at twice the performance





## **IBM-provided ALF is a simple work-queue** approach for abstracting parallelism



Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Slide 32



## ALF & DaCS: Broader than Cell & Roadrunner



- Designed by IBM & LANL to be HW agnostic
  - Cell PPE+SPEs and also Opterons+direct-SPEs
  - multicore/GPU/Cell, interconnect, even possibly cluster-wide
  - desire technical community participation to extend range

# Programming approach has now been demonstrated and is Tractable

- Two levels of parallelism:
  - node-to-node: MPI & DaCS-MPI-DaCS relay
  - within-Cell: threads, pipelined DMAs, & SIMD
- Large-grain computationally intense portions of code are split off for Cell acceleration within a node process
  - Usually an entire tree of subroutines
  - This is equivalent to "function offload" of entire large algorithms
- Threaded fine-grained parallelism introduced within the Cell itself
  - Create many-way parallel pipelined work units for the 8 SPEs
  - Good for both multicore/manycore chips and heterogeneous chip trends with dwindling memory bandwidth
- Communications during Cell computation are possible between Cells via DaCS-MPI-DaCS "relay" approach
- Considerable flexibility and opportunities exist





## Roadrunner Status and Future Plans





## LANL has two tracks for Open Science



## LANL has two tracks for Open Science

- Call for open science proposals on full Roadrunner during stabilization
  - Important side effects
    - increase the cadre of expert Cell programmers
    - Increase the number of codes that can take advantage of Roadrunner architecture
- There were 29 proposals submitted
  - Requests for 181 M Cells hours (5x available resources)
  - Requests for \$9M in LDRD support (3x available resources)
- Eight projects were selected

Additional LANL open RR resources are required to support open science.





# There are very exciting opportunities among the 8 selected proposals for full Roadrunner time.

| Kinetic Thermonuclear Burn Studies with VPIC on Roadrunner                                                                                                           | VPIC               |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|
| Multibillion-Atom Molecular Dynamics Simulations of Ejecta Production and Transport using Roadrunner                                                                 | SPaSM              |
| New frontiers in viral phylogenetics                                                                                                                                 | ML                 |
| Three-Dimensional Dynamics of Magnetic Reconnection in Space and Laboratory Plasmas                                                                                  | VPIC               |
| The Roadrunner Universe                                                                                                                                              | MC <sup>3</sup>    |
| Implicit Monte Carlo Calculations of Supernova Light-Curves                                                                                                          | IMC + Rage         |
| Instabilities-Driven Reacting Compressible Turbulence                                                                                                                | CFDNS              |
| Cellulosomes in Action: Peta-Scale Atomistic Bioenergy Simulations                                                                                                   | GROMACS            |
| Parallel-replica dynamics study of tip-surface and tip-tip interactions in atomic force microscopy and the formation and mechanical properties of metallic nanowires | SPaSM +<br>PAR-REP |
| Saturation of Backward Stimulated Scattering of Laser In The Collisional Regime                                                                                      | VPIC               |

Indicates new work



Operated by the Los Alamos National Security, LLC for the DOE/NNSA



Indicates new + old



## http://www.lanl.gov/roadrunner/

Roadrunner architecture Early applications efforts Upcoming Open Science efforts Cell & hybrid programming Computing trends Related Internet links



