Bringing Energy Efficiency to High Performance Computing

September 3, 2013

The ability of high performance computers (HPCs) to solve complex applications very quickly has risen exponentially since their introduction in the 1960s; unfortunately, so has their electricity use. Many supercomputers require more than a megawatt of electricity to operate, and annual electricity costs can easily run into millions of dollars. As the use of HPCs became more widespread, researchers at Lawrence Berkeley National Laboratory (Berkeley Lab) saw the need to improve energy efficiency of supercomputers and the infrastructures that support them.

Berkeley Lab researchers organized the Energy Efficient High Performance Computing Working Group (EE HPC WG) in 2008 to promote energy-efficient computing best practices and to drive improvements in energy performance of HPC systems. At the time, the concept of energy-efficient computing was often a distant afterthought in the race to improve supercomputer computational performance as quickly as possible.

"We were convinced that bringing U.S. Department of Energy (DOE) national laboratories together to demand more efficient supercomputers would bring the issue to the forefront for supercomputer developers and vendors," says Bill Tschudi, leader of the High Tech and Industrial Systems Group at Berkeley Lab. "As a significant segment of the HPC market, the national labs were interested not only in spurring more efficient designs and equipment, but also in reducing their own energy bills—costs that were siphoning money from their mission."

The DOE's Federal Energy Management Program (FEMP) provided funding for Berkeley Lab to start the group, which hoped to serve both as a united front to promote energy efficient computing and as a forum for sharing best practices. The strategy worked. Awareness of the need for energy efficient HPC grew, which sparked competition among vendors to improve HPC energy efficiency even before end users asked for it. Today, realizing the benefits of energy-efficient HPCs, end users are putting requirements in proposal requests, and vendors are not only responding to those requests, but are also participating in many of the EE HPC working group's activities.

Today, Berkeley Lab continues to provide ideas and lead the working group, which is now supported by the DOE Sustainability Performance Office. The group's members—over 380 of them from 20 countries—participate voluntarily and self-select topics of interest to the group. Members include representatives from other federal agencies, universities, private industry, and vendors of HPC and data center equipment, including prominent companies such as Intel, Emerson, IBM, Cray, and others.

Grassroots Collaboration

The EE HPC WG consists of three subgroups: one focused on infrastructure, another on systems, and a third on outreach and conferences.

Dale Sartor, from Berkeley Lab, and Natalie Bates, a Berkeley Lab sub-contractor, co-lead the Working Group and oversee the working group's general activities.
Bill Tschudi, of Berkeley Lab, and David Martinez of Sandia National Laboratory, co-lead the Infrastructure Subgroup.
Berkeley Lab's John Shalf and Erich Strohmaier co-lead the Computing Systems Subgroup.
Lawrence Livermore National Laboratory's Anna Maria Bailey and Marriann Silveira co-lead the Conferences Subgroup.

Within each of those subgroups, small teams are formed to address specific issues. Examples are the HPL Power/Energy Measurement Methodology team, Liquid Cooling Commissioning team, and HPC Demand Response team, which all meet (virtually) multiple times each month.

"Working group members participate in whichever sub-group project interests them and can benefit from their expertise," explains Tschudi. "Once the subgroup members have made progress on their issue, they present it to the larger group for feedback and further development." Once completed, the groups disband and typically select other topics of interest.

Members drive the working group's agenda, which ensures that the projects meet the membership's most pressing needs. A March 2013 member survey showed that members identified 12 out of the 14 activities currently being conducted by the group as high-value activities. In that same survey, more than half of the members identified "improving software to tune for energy efficiency" as an activity to pursue in the future.

The working group as a whole meets (virtually) bi-monthly, but occasionally meets in person at supercomputer conferences such as the SC Conference, International Supercomputing Conference (ISC), and others. Members of the group also present papers at SC and ISC and arrange annual "Birds of Feather" sessions (informal meetings) to discuss recent developments in the field.

Moving Toward Common Approaches

In its nearly five years of existence, some of the working group's most important achievements have been in developing common metrics, measurement protocols, and guidelines for the supercomputer industry: for liquid cooling of supercomputers, for determining power usage effectiveness, and for measuring power during computational output.

Development of Guidelines for Liquid Cooling of Supercomputers

When vendors began producing liquid cooling systems, no standard thermal guidelines existed. By evaluating systems from the processor to the atmosphere, the EE HPC WG identified temperatures that could be supported, and developed a set of recommended temperatures that vendors could use to design equipment. The working group's recommendations first appeared in an ASHRAE white paper, and are now in ASHRAE's guidelines of recommended temperatures. Supercomputer vendors participated in this process throughout.

Development of New Metrics for Determining Power Usage Effectiveness

The Power Usage Effectiveness (PUE) metric has been used for years to determine how much of the power in a data center is consumed by the IT equipment (as opposed to other facility loads such as cooling and power distribution). However, this metric is not effective in determining the efficiency of computer equipment when the system's cooling fans or power conversions are located outside of the computer itself. The working group developed two new metrics to help evaluate these situations: (1) ITUE (IT-power usage effectiveness), which is similar to PUE but focuses on energy use inside the computer equipment, and (2) TUE (total-power usage effectiveness), which combines PUE and ITUE to provide a ratio of total energy (that of both internal and external support equipment) as well as the specific energy used in the HPC. The TUE can be used to compare one HPC system to another. The metrics were demonstrated on Oak Ridge National Laboratory's (ORNL's) Jaguar supercomputer system, and the working group plans to seek acceptance of the TUE metric through industry groups such as the Green Grid industry association.

Development of a Standard Method to Measure Computational Output

Every year, a list known as the Green 500 ranks the 500 most efficient supercomputers. The list helps supercomputer users and vendors identify the most efficient systems; however, because the methods used to determine efficiency have not been uniformly performed, the current comparisons are not as accurate as they could be. The EE HPC working group is developing a standard method that all users can use to measure power uniformly.

Commissioning Liquid-Cooled Supercomputer Systems

Commissioning of liquid cooled supercomputers is a relatively new requirement. The working group decided to share best practices and develop a liquid cooling commissioning guideline to inform those that have not dealt with liquid cooled systems. The subgroup working on this task includes vendors that provide liquid-cooled supercomputers and infrastructure cooling equipment.

Sharing Knowledge and Expertise

Because the working group's member base is so dispersed and varied, its website is a key tool for keeping members informed. Webinars inform members of the subgroups' progress, and the presentations from these webinars are archived on the site, along with the working group's published papers. To expand the reach of the working group's expertise, a member from Lawrence Livermore National Laboratory tracks conferences and other meetings where members could speak about new developments and receive feedback on various topics.

An Award-Winning Accomplishment

In 2013, members of the EE HPC WG won the Gauss Award, sponsored by the German Gauss Center for Supercomputing, for their paper, "TUE, a New Energy-Efficiency Metric Applied at ORNL's Jaguar." The award is presented for the most outstanding paper in the field of scalable supercomputing at the ISC conference held annually in Germany. Intel's Mike Patterson, the primary author, presented the paper at the conference. Bill Tschudi and Henry Coles of Berkeley Lab's Environmental Energy Technologies Division (EETD) were contributing authors. The paper described the TUE energy metric, including a description of its trial use at ORNL's scientific computing center.

SC13 Workshop

The workgroup also shares its expertise through workshops. It will present the 'Building' Energy Efficient High Performance Computing Fourth Annual EE HPC WG Workshop at SC13 in Denver, Colorado, in November. This popular annual workshop will feature high-profile researchers discussing new developments in energy-efficient HPC from both the facilities and system perspectives; from architecture through design and implementation.

Helping to Meet EISA Goals

By sharing information and developing common approaches, the workgroup is helping to reduce HPC energy use. For example, the Energy Independence and Security Act of 2007 (EISA) requires the U.S. federal government to reduce energy intensity in all its facilities, including laboratories and industrial buildings, by 30 percent by 2015. The work done by Berkeley Lab and EE HPC WG volunteers is helping federal facilities measure and quantify energy savings, as well as helping vendors design energy efficient supercomputer equipment. The growth in computing energy use makes this goal a challenge; however, the EE HPC WG is dramatically improving energy performance from its business-as-usual trajectory.

"Interest in the EE HPC working group, continues to grow," says Tschudi. "The original vision of what the group could accomplish continues to be fulfilled through collaboration with the best minds engaged in supercomputing. DOE's leadership in encouraging and supporting this activity is providing energy savings and other benefits throughout DOE labs, as well as the industry at large."

Author

Mark Wilson