Slurm User Group Meeting 2011

Hosted by Bull

Agenda

The 2011 SLURM User Group Meeting will be held on September 22 and 23 in Phoenix, Arizona and will be hosted by Bull. On September 22 there will be two parallel tracks of tutorials meeting in separate rooms. One set of tutorials will be for users and the other will be for system adminitrators. There will be a series of technical presentations on September 23. The Schedule amd Abstracts are shown below.

Hotel Information

The meeting will be held at Embassy Suites Phoenix - North 2577 West Greenway Road, Phoenix, Arizona, USA (Phone: 1-602-375-1777 Fax: 1-602-375-4012). You may book your reservations on line at Embassy Suites Phoenix - North

Please reference Bull when making your reservations to recieve a $79/room rate.

Directions and Transportation

From Phoenix Sky Harbor Airport, take I-10 west to I-17 North. Follow I-17 to the Greenway Road, exit 211 approximately 15 miles. Exit and turn right, 1/8th of a mile on the right is the hotel entrance.

View all directions, map, and airport information

Contact

If you need further informations about the event, or the registration protocols, contact the Slurm User Group 2011 organizers.

Registration

Please register online no later than August 22.

Schedule

September 22: User Tutorials.

Time Theme Speaker Title
08:30 - 09:00  Registration
09:00 - 10:30  User Tutorial #1  Don Albert and Rod Schultz (Bull)  SLURM: Beginners Usage
10:30 - 11:00  Coffee break
11:00 - 12:30  User Tutorial #2  Bill Brophy, Rod Schultz, Yiannis Georgiou (Bull)  SLURM: Advanced Usage Usage
12:30 - 14:00  Lunch at conference center
14:00 - 15:30  User Tutorial #3  Martin Perry and Yiannis Georgiou (Bull)  Resource Management for multicore/multi-threaded usage
15:30 - 16:00  Coffee break
16:00 - 17:00  Question and Answer  Danny Auble and Morris Jette (SchedMD)  Get your questions answered by the developers

September 22: System Adminitrator Tutorials.

Time Theme Speaker Title
08:30 - 09:00  Registration
09:00 - 10:30  Admin Tutorial #1  David Egolf and Bill Brophy (Bull)  SLURM High Availability
10:30 - 11:00  Coffee break
11:00 - 12:30  Admin Tutorial #2  Dan Rusak (Bull)  Power Management / sview
12:30 - 14:00  Lunch at conference center
14:00 - 15:30  Admin Tutorial #3  Don Albert and Rod Schultz (Bull)  Accounting, limits and Priorities configurations
15:30 - 16:00  Coffee break
16:00 - 17:30  Admin Tutorial #4  Matthieu Hautreux (CEA), Yiannis Georgiou and Martin Perry (Bull)  Scalability, Scheduling and Task placement

September 23: Technical Session

Time Theme Speaker Title
08:30 - 09:00  Registration
09:00 - 10:40  Welcome
 Keynote  William Kramer (NCSA)  Challenges and Opportunities for Exscale Resource Management and how Today's Petascale Systems are Guiding the Way
 Session #1  Matthieu Hautreux (CEA)  SLURM at CEA
 Session #2  Don Lipari (LLNL)  LLNL site report
10:40 - 11:00  Coffee break
11:00 - 12:30  Session #3  Alejandro Lucero Palau (BSC)  SLURM Simulator
 Session #4  Danny Auble (SchedMD)  SLURM operation on IBM BlueGene/Q
 Session #5  Morris Jette (SchedMD)  SLURM operation on Cray XT and XE
12:30 - 14:00  Lunch at conference center
14:00 - 15:30  Session #6  Mariusz Mamoński (Poznań University)  Introduction to SLURM DRMAA
 Session #7  Robert Stober, Sr. (Bright Computing)  Bright Cluster Manager & SLURM: Benefits of Seamless Integration
 Session #8  Morris Jette (SchedMD)  Proposed Design for Job Step Management in User Space
15:30 - 16:00  Coffee break
16:00 - 17:30  Session #9  Don Lipari (LLNL)  Proposed Design for Enhanced Enterprise-wide Scheduling
 Session #10  Danny Auble and Morris Jette (SchedMD)  SLURM Version 2.3 and plans for future releases
 Open discussion, feature requests, etc.


Abstracts

User Tutorial #1

SLURM Beginners Usage
Don Albert and Rod Schultz (Bull)

User Tutorial #2

SLURM Advanced Usage
Bill Brophy, Rod Schultz, Yiannis Georgiou (Bull)

User Tutorial #3

Resource Management for multicore/multi-threaded usage
Martin Perry and Yiannis Georgiou (Bull)

Administrator Tutorial #1

SLURM High Availability
David Egolf and Bill Brophy (Bull)

Administrator Tutorial #2

Power Management / Sview
Dan Rusak (Bull)

Administrator Tutorial #3

Accounting, limits and Priorities configurations
Don Albert and Rod Schultz (Bull)

Administrator Tutorial #4

Scalability, Scheduling and Task placement
Matthieu Hautreux (CEA), Yiannis Georgiou and Martin Perry (Bull)

Keynote Speaker

Challenges and Opportunities for Exscale Resource Management and how Today's Petascale Systems are Guiding the Way
William Kramer (NCSA)

Resource management challenges currently experienced on the Blue Waters computer will be described. These experiences will be extended to describe the additional challenges faced in exascale and trans-petascale systems.

Session #1

CEA Site report
Matthieu Hautreux (CEA)

Evolutions and feedback from Tera100. SLURM on Curie, the PRACE second Tier-0 system that is planned to be installed by the end of the year in a new facility hosted at CEA. Curie will be a 1.6 Petaflop system from Bull.

Session #2

LLNL site report
Don Lipari (LLNL)

Don Lipari will provide an overview of the batch scheduling systems in use at LLNL and an overview on how they are managed.

Session #3

SLURM Simulator
Alejandro Lucero Palau (BSC)

Batch scheduling for high performance cluster installations has two main goals: 1) to keep the whole machine working at full capacity at all times, and 2) to respect priorities avoiding lower priority jobs jeopardizing higher priority ones. Usually, batch schedulers allow different policies with several variables to be tuned by policy. Other features like special job requests, reservations or job preemption increase the complexity for achiev- ing a fine-tuned algorithm. A local decision for a specific job can change the full scheduling for a high number of jobs and what can be thought as logical within a short term could make no sense for a long trace mea- sured in weeks or months. Although it is possible to extract algorithms from batch scheduling software to make simulations of large job traces, this is not the ideal approach since scheduling is not an isolated part of this type of tools and replicating same environment requires an important effort plus a high maintenance cost. We present a method for obtaining a special mode of operation for a real production-ready scheduling software, SLURM, where we can simulate execution of real job traces to evaluate impact of scheduling policies and policy tuning.

Session #4

SLURM Operation on IBM BlueGene/Q
Danny Auble (SchedMD)

SLURM version 2.3 supports IBM BlueGene/Q. This presentation will report the design and operation of SLURM with respect to BlueGene/Q systems.

Session #5

SLURM Operation on Cray XT and XE systems
Morris Jette (SchedMD)

SLURM version 2.3 supports Cray XT and XE systems running over Cray's ALPS (Application Level Placement Scheduler) resource manager. This presentation will discuss the design and operation of SLURM with respect to Cray systems.

Session #6

Introduction to SLURM DRMAA
Mariusz Mamoński (Poznań University)

DRMAA or Distributed Resource Management Application API is a high-level Open Grid Forum API specification for the submission and control of jobs in a Grid architecture.

Session #7

Bright Cluster Manager & SLURM: Benefits of Seamless Integration
Robert Stober, Sr. (Bright Computing)

Bright Cluster Manager, tightly integrated with SLURM, simplifies HPC cluster installation and management while boosting system throughput. Bright automatically installs, configures and deploys SLURM so that clusters are ready to use in minutes rather than days. Bright provides extensive and extensible monitoring and management through its intuitive Bright Cluster Manager GUI, powerful cluster management shell, and customizable web-based user portal. Additional integration benefits include sampling, analysis and visualization of all key SLURM metrics from within the Bright GUI, automatic head node failover, and extensive pre-job health checking capability. Regarding the latter, say good-bye to the black hole node syndrome: Bright plus SLURM effectively prevent this productivity-killing problem by identifying and sidelining problematic nodes before the job is run.

Session #8

Proposed Design for Job Step Management in User Space
Morris Jette (SchedMD)

SLURM currently creates and manages job steps using SLURM's control daemon, slurmctld. Since some user jobs create thousands of job steps, the management of those job steps accounts for most of slurmctld's work. It is possible to move job step management from slurmctld into user space to improve SLURM scalability and performance. A possible implementation of this will be presented.

Session #9

Proposed Design for Enhanced Enterprise-wide Scheduling
Don Lipari (LLNL)

SLURM currently supports the ability to submit and status jobs between computers at site, however the current design has some limitations. When a job is submitted with several possible computers usable for its execution, the job is routed to the computer on which it is expected to start earliest. Changes in the workload or system failures could make moving the job to another computer result in faster initiation, but that is currently impossible. SLURM is also unable to support dependencies between jobs executing on different computers. The design of a SLURM meta-scheduler with enhanced enterprise-wide scheduling capabilities will be presented.

Session #10

Contents of SLURM Version 2.3 and plans for future releases
Danny Auble and Morris Jette (SchedMD)

An overview of the changes SLURM Version 2.3 will be presented along with current plans for future releases.

Open Discussion

All meeting attendees will be invited to provide input with respect to SLURM's design and development work. We also invite proposals for hosting the SLURM User Group Meeting in 2012.