
Slurm User Group Meeting 2011
Hosted by Bull
Agenda
The 2011 SLURM User Group Meeting will be held on September 22 and 23 in Phoenix, Arizona and will be hosted by Bull. On September 22 there will be two parallel tracks of tutorials meeting in separate rooms. One set of tutorials will be for users and the other will be for system adminitrators. There will be a series of technical presentations on September 23. The Schedule amd Abstracts are shown below.
Hotel Information
The meeting will be held at Embassy Suites Phoenix - North 2577 West Greenway Road, Phoenix, Arizona, USA (Phone: 1-602-375-1777 Fax: 1-602-375-4012). You may book your reservations on line at Embassy Suites Phoenix - North
Please reference Bull when making your reservations to recieve a $79/room rate.
Directions and Transportation
From Phoenix Sky Harbor Airport, take I-10 west to I-17 North. Follow I-17 to the Greenway Road, exit 211 approximately 15 miles. Exit and turn right, 1/8th of a mile on the right is the hotel entrance.
View all directions, map, and airport information
Contact
If you need further informations about the event, or the
registration protocols, contact the
Slurm User Group 2011 organizers.
Registration
Please register online no later than August 22.
Schedule
September 22: User Tutorials.
Time | Theme | Speaker | Title |
---|---|---|---|
08:30 - 09:00 | Registration | ||
09:00 - 10:30 | User Tutorial #1 | Don Albert and Rod Schultz (Bull) | SLURM: Beginners Usage |
10:30 - 11:00 | Coffee break | ||
11:00 - 12:30 | User Tutorial #2 | Bill Brophy, Rod Schultz, Yiannis Georgiou (Bull) | SLURM: Advanced Usage Usage |
12:30 - 14:00 | Lunch at conference center | ||
14:00 - 15:30 | User Tutorial #3 | Martin Perry and Yiannis Georgiou (Bull) | Resource Management for multicore/multi-threaded usage |
15:30 - 16:00 | Coffee break | ||
16:00 - 17:00 | Question and Answer | Danny Auble and Morris Jette (SchedMD) | Get your questions answered by the developers |
September 22: System Adminitrator Tutorials.
Time | Theme | Speaker | Title |
---|---|---|---|
08:30 - 09:00 | Registration | ||
09:00 - 10:30 | Admin Tutorial #1 | David Egolf and Bill Brophy (Bull) | SLURM High Availability |
10:30 - 11:00 | Coffee break | ||
11:00 - 12:30 | Admin Tutorial #2 | Dan Rusak (Bull) | Power Management / sview |
12:30 - 14:00 | Lunch at conference center | ||
14:00 - 15:30 | Admin Tutorial #3 | Don Albert and Rod Schultz (Bull) | Accounting, limits and Priorities configurations |
15:30 - 16:00 | Coffee break | ||
16:00 - 17:30 | Admin Tutorial #4 | Matthieu Hautreux (CEA), Yiannis Georgiou and Martin Perry (Bull) | Scalability, Scheduling and Task placement |
September 23: Technical Session
Time | Theme | Speaker | Title |
---|---|---|---|
08:30 - 09:00 | Registration | ||
09:00 - 10:40 | Welcome | ||
Keynote | William Kramer (NCSA) | Challenges and Opportunities for Exscale Resource Management and how Today's Petascale Systems are Guiding the Way | |
Session #1 | Matthieu Hautreux (CEA) | SLURM at CEA | |
Session #2 | Don Lipari (LLNL) | LLNL site report | |
10:40 - 11:00 | Coffee break | ||
11:00 - 12:30 | Session #3 | Alejandro Lucero Palau (BSC) | SLURM Simulator |
Session #4 | Danny Auble (SchedMD) | SLURM operation on IBM BlueGene/Q | |
Session #5 | Morris Jette (SchedMD) | SLURM operation on Cray XT and XE | |
12:30 - 14:00 | Lunch at conference center | ||
14:00 - 15:30 | Session #6 | Mariusz Mamoński (Poznań University) | Introduction to SLURM DRMAA |
Session #7 | Robert Stober, Sr. (Bright Computing) | Bright Cluster Manager & SLURM: Benefits of Seamless Integration | |
Session #8 | Morris Jette (SchedMD) | Proposed Design for Job Step Management in User Space | |
15:30 - 16:00 | Coffee break | ||
16:00 - 17:30 | Session #9 | Don Lipari (LLNL) | Proposed Design for Enhanced Enterprise-wide Scheduling |
Session #10 | Danny Auble and Morris Jette (SchedMD) | SLURM Version 2.3 and plans for future releases | |
Open discussion, feature requests, etc. |
Abstracts
User Tutorial #1
SLURM Beginners UsageDon Albert and Rod Schultz (Bull)
- Simple use of commands (submission/monitoring/result collection)
- Reservations
- Use of accounting and reporting
- Scheduling techniques for smaller response time (setting of walltime for backfill , etc)
User Tutorial #2
SLURM Advanced UsageBill Brophy, Rod Schultz, Yiannis Georgiou (Bull)
- MPI jobs
- Checkpoint/Restart (BLCR or application level)
- Preemption / Gang Scheduling Usage
- Dynamic allocations (growing/shrinking)
- Grace Time Delay with Preemption
User Tutorial #3
Resource Management for multicore/multi-threaded usageMartin Perry and Yiannis Georgiou (Bull)
- CPU allocation
- CPU/tasks distribution
- Task binding
- Internals of the allocation procedures
Administrator Tutorial #1
SLURM High AvailabilityDavid Egolf and Bill Brophy (Bull)
- How to set up the High Availability SLURM
- Event logging with striggers
Administrator Tutorial #2
Power Management / SviewDan Rusak (Bull)
- Power Management configuration
- sview presentation
Administrator Tutorial #3
Accounting, limits and Priorities configurationsDon Albert and Rod Schultz (Bull)
- Accounting with slurmdbd configuration
- Multifactor job priorities with examples considering all different factors
- QOS configuration
- Fairsharing setting
Administrator Tutorial #4
Scalability, Scheduling and Task placementMatthieu Hautreux (CEA), Yiannis Georgiou and Martin Perry (Bull)
- High Throughput Computing
- Topology constraints config
- Generic Resources and GPUs config
- Task Placement with Cgroups
Keynote Speaker
Challenges and Opportunities for Exscale Resource Management and how Today's Petascale Systems are Guiding the WayWilliam Kramer (NCSA)
Resource management challenges currently experienced on the Blue Waters computer will be described. These experiences will be extended to describe the additional challenges faced in exascale and trans-petascale systems.
Session #1
CEA Site reportMatthieu Hautreux (CEA)
Evolutions and feedback from Tera100. SLURM on Curie, the PRACE second Tier-0 system that is planned to be installed by the end of the year in a new facility hosted at CEA. Curie will be a 1.6 Petaflop system from Bull.
Session #2
LLNL site reportDon Lipari (LLNL)
Don Lipari will provide an overview of the batch scheduling systems in use at LLNL and an overview on how they are managed.
Session #3
SLURM SimulatorAlejandro Lucero Palau (BSC)
Batch scheduling for high performance cluster installations has two main goals: 1) to keep the whole machine working at full capacity at all times, and 2) to respect priorities avoiding lower priority jobs jeopardizing higher priority ones. Usually, batch schedulers allow different policies with several variables to be tuned by policy. Other features like special job requests, reservations or job preemption increase the complexity for achiev- ing a fine-tuned algorithm. A local decision for a specific job can change the full scheduling for a high number of jobs and what can be thought as logical within a short term could make no sense for a long trace mea- sured in weeks or months. Although it is possible to extract algorithms from batch scheduling software to make simulations of large job traces, this is not the ideal approach since scheduling is not an isolated part of this type of tools and replicating same environment requires an important effort plus a high maintenance cost. We present a method for obtaining a special mode of operation for a real production-ready scheduling software, SLURM, where we can simulate execution of real job traces to evaluate impact of scheduling policies and policy tuning.
Session #4
SLURM Operation on IBM BlueGene/QDanny Auble (SchedMD)
SLURM version 2.3 supports IBM BlueGene/Q. This presentation will report the design and operation of SLURM with respect to BlueGene/Q systems.
Session #5
SLURM Operation on Cray XT and XE systemsMorris Jette (SchedMD)
SLURM version 2.3 supports Cray XT and XE systems running over Cray's ALPS (Application Level Placement Scheduler) resource manager. This presentation will discuss the design and operation of SLURM with respect to Cray systems.
Session #6
Introduction to SLURM DRMAAMariusz Mamoński (Poznań University)
DRMAA or Distributed Resource Management Application API is a high-level Open Grid Forum API specification for the submission and control of jobs in a Grid architecture.
Session #7
Bright Cluster Manager & SLURM: Benefits of Seamless IntegrationRobert Stober, Sr. (Bright Computing)
Bright Cluster Manager, tightly integrated with SLURM, simplifies HPC cluster installation and management while boosting system throughput. Bright automatically installs, configures and deploys SLURM so that clusters are ready to use in minutes rather than days. Bright provides extensive and extensible monitoring and management through its intuitive Bright Cluster Manager GUI, powerful cluster management shell, and customizable web-based user portal. Additional integration benefits include sampling, analysis and visualization of all key SLURM metrics from within the Bright GUI, automatic head node failover, and extensive pre-job health checking capability. Regarding the latter, say good-bye to the black hole node syndrome: Bright plus SLURM effectively prevent this productivity-killing problem by identifying and sidelining problematic nodes before the job is run.
Session #8
Proposed Design for Job Step Management in User SpaceMorris Jette (SchedMD)
SLURM currently creates and manages job steps using SLURM's control daemon, slurmctld. Since some user jobs create thousands of job steps, the management of those job steps accounts for most of slurmctld's work. It is possible to move job step management from slurmctld into user space to improve SLURM scalability and performance. A possible implementation of this will be presented.
Session #9
Proposed Design for Enhanced Enterprise-wide SchedulingDon Lipari (LLNL)
SLURM currently supports the ability to submit and status jobs between computers at site, however the current design has some limitations. When a job is submitted with several possible computers usable for its execution, the job is routed to the computer on which it is expected to start earliest. Changes in the workload or system failures could make moving the job to another computer result in faster initiation, but that is currently impossible. SLURM is also unable to support dependencies between jobs executing on different computers. The design of a SLURM meta-scheduler with enhanced enterprise-wide scheduling capabilities will be presented.
Session #10
Contents of SLURM Version 2.3 and plans for future releasesDanny Auble and Morris Jette (SchedMD)
An overview of the changes SLURM Version 2.3 will be presented along with current plans for future releases.