Yokohama Joho Bunka Center (Yokohama Media & Communications Center),
Yokohama, Japan
Call for Contribution
Call for Participation
Advance Program
Organizing Committee
Program Committee
Advisory Committee
Travel Information


The 15th anniversary of IEEE annual symposium COOL15

CALL FOR PARTICIPATION      [pdf version is here].


Tofu Interconnect Controller for Fujitsu's Highly Scalable Supercomputer

Yuichiro Ajima (Fujitsu Ltd., Japan)

Abstract: The K computer, which is the current world's fastest supercomputer, combines 88,128 processor chips using an interconnection network called Tofu Interconnect. Fujitsu's new supercomputer system FX10 is also powered by the Tofu interconnect. We developed an interconnect controller (ICC) chip which integrates all active components of the Tofu interconnect. In this talk, we will present a technical overview of the ICC chip. The ICC chip provides a Tofu network router, four Tofu network interfaces, and a Tofu barrier interface. The Tofu network router provides four internal and ten external ports. Internal ports connect the Tofu network interface for each, and external ports are used to construct a six-dimensional mesh/torus network. The Tofu network interface supports Remote Direct Memory Access (RDMA) communication, and a Tofu barrier interface provides offload capability for synchronization and reduction communication.

Yuichiro Ajima is a system architect in the Next-Generation Technical Computing Unit at Fujitsu. His research focuses on high-performance computing system architecture. Ajima has a PhD in information engineering from the University of Tokyo. He is a member of the Information Processing Society of Japan and IEEE.

Nonvolatile Logic-in-Memory Architecture Using an MTJ/MOS-Hybrid Structure and Its Applications

Takahiro Hanyu (Tohoku University, Japan)

Abstract: Communication bottleneck between memory and logic modules has increasingly become a serious problem, which causes large power dissipation in the recent nanometer-scaled VLSI chips. One method to solve such emerging VLSI-chip problems is to use "nonvolatile" logic-in-memory architecture. In this architecture, nonvolatile storage elements are distributed over a logic-circuit plane, so that it is expected to realize both ultra-low-power and reduced interconnection delay because of great reduction of global interconnection counts and volatile storage-element counts. In this presentation, I demonstrate concrete standby power-free logic circuits based on a nonvolatile logic-in-memory structure using magnetic tunnel junction (MTJ) devices in combination with MOS transistors. Since the MTJ device with a spin-injection write capability is only one device that has all the following superior features as large resistance ratio, virtually unlimited endurance, fast read/write accessibility, scalability, CMOS-process compatibility, and no volatility, it is very suited to implement the MOS/MTJ-hybrid logic circuit with logic-in-memory architecture. As typical examples of the proposed nonvolatile logic-in-memory circuitry, an MTJ-based nonvolatile Look-Up Table (LUT) circuit for an instant power-ON/OFF Field Programmable Gate Array and an MTJ-based nonvolatile Ternary Content-Addressable Memory are also demonstrated together with the fabricated test-chip results.

Takahiro Hanyu received the B.E., M.E. and D.E. degrees in Electronic Engineering from Tohoku University, Sendai, Japan, in 1984, 1986 and 1989, respectively. He is currently a Professor in the Research Institute of Electrical Communication, Tohoku University. His general research interests include nonvolatile logic circuits and their applications to ultra- low-power VLSI processors. He received the Sakai Memorial Award from the Information Processing Society of Japan in 2000, the Judge's Special Award at the 9th LSI Design of the Year from the Semiconductor Industry News of Japan in 2002, the APEX Paper Award of Japanese Society of Applied Physics in 2009, the Excellent Paper Award of IEICE, Japan in 2010, Ichikawa Academic Award in 2010, and the Best Paper Award at IEEE Computer Society International Symposium on VLSI 2010. Dr. Hanyu is a senior member of the IEEE.

The Expanding Universe of Embedded Imaging

Masaki Hiraga (Morpho, Inc., Japan)

Abstract: Embedded devices have been evolving at a tremendous speed for the past 10 years, especially mobile phones. Multi-core CPUs and GPGPUs are becoming ever so popular and the resolution of display devices as well as digital cameras keep increasing. Image processing is mainly performed in parallel, so it has high compatibility with the advancement of hardware. As a result, highly complex imaging technology which was once used only on super computers and workstations now runs on embedded devices. By combining imaging technology with portability of mobile devices and network communications, image processing applications with new concepts are now emerging into the market.  In this session, the evolution of mobile phone hardware and software in the past 10 years will be looked upon from image processing perspective, and the present and future imaging technology will be elaborated.

Masaki Hiraga is President of Morpho,Inc. Masaki received his DSc degree from the University of Tokyo, Graduate School of Science, Department of Information Science. He founded Morpho, Inc. in 2004. Morpho, Inc. is a leading company of software imaging solutions for mobile devices. Customers utilizing Morpho's software technologies include carriers, processing platform providers and mobile device manufactures making the company a global leader in mobile imaging.

Application Scalability - Key to Low Power, Performance Growth, and Exascale

Wen-mei Hwu (Illinois Univ., USA)

Abstract: Parallelism has become the main venue of performance growth and power reduction. Once an application achieves good performance for a given hardware and data set, it must be able to scale effectively in terms of hardware parallelism and data size. Parallelism scalability allows the application to take advantage of a wide range of current and future generation hardware. Data scalability allows the application to handle the ever increasing data size in the real world while managing the ever limiting memory bandwidth. The rise of CPU-GPU heterogeneous computing has significantly boosted the pace of progress in this field. There has been rapid progress in numeric methods, algorithm design, programming techniques, compiler transformations and optimization tools for developing scalable applications. In preparation of petascale applications for deployment on Blue Waters, we have been further accelerating this revolution. In this talk, I will discuss these recent advances, their implications on the future course of computing and computer design.

Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. He is also CTO of MulticoreWare Inc., chief scientist of UIUC Parallel Computing Institute and director of the IMPACT research group ( He co-directs the UIUC CUDA Center of Excellence and serves as one of the principal investigators of the $208M NSF Blue Waters Petascale computer project. For his contributions, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the ISCA Influential Paper Award, and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.

The IBM Blue Gene/Q Supercomputer

George Liang-Tai Chiu (IBM, USA)

Abstract: Blue Gene/Q™ is the third generation in the IBM Blue Gene® line of massively parallel supercomputer systems, and is scalable to deliver a peak performance of twenty PetaFLOP/s and beyond. The aim of the Blue Gene platform remains the same, namely to build a massively parallel high performance computing (HPC) system out of highly power-efficient processor chips. Such power-efficient chips, in turn, allow very dense packaging, which consequently results in superior power efficiency, space utilization, and total cost of ownership. A focus on reliability during all phases of the design also contributes to the feasibility of scaling to large but reliable systems. The heart of a Blue Gene/Q system is its Compute chip, implemented as a System-on-a-Chip (SOC) design. It combines processors, memory hierarchy and network communications on a single ASIC. Integrating these functions on a single chip reduces the number of chip-to-chip interfaces, thereby reducing power, while increasing performance, reliability and bandwidth. It also reduces network cost substantially. This presentation will discuss the Blue Gene/Q Compute chip architecture and design, emphasizing the aspects that result in a peak performance increase of 15x versus the previous generation, Blue Gene/P, while achieving a power efficiency increase of 5.6x. The Blue Gene/Q Compute (BQC) chip is a 19 x 19 mm chip in IBM's Cu-45 (45nm SOI) technology. The chip functionally contains 18 processor cores, intended to be used as 16 user cores, 1 core for operating system services, and 1 core as a spare. The processor core is an augmented version of the 4-way multithreaded Power A2 core used on the IBM PowerEN™ chip. Blue Gene/Q-specific modifications include a Quad Floating Point Unit (QPU) with a 4-way SIMD architecture supporting integrated scalar and vector floating-point arithmetic. The QPU can concurrently execute up to 8 floating-point operations (based on a 4-wide FMA instruction), a store instruction and a load instruction. The QPU also provides a set of permute instructions to support efficient vector data reorganization, and instructions for complex number arithmetic that act on adjacent vector element pairs. In addition, each processor core interfaces, via a sophisticated L1-prefetching unit and a crossbar switch, to a 32 MB central L2 cache, which uses embedded DRAM for data storage. The L2 cache allows for the storage of multiple data versions per address. The versioning can be used for advanced cache management techniques such as Speculative Execution (SE) and Transactional Memory (TM). These techniques support aggressive multithreading of applications, as hardware will detect and deal with access conflicts. L2 cache access misses are handled by two integrated memory controllers that interface to DDR3 memory (16GB, directly attached to the BQC chip). The BQC on-chip networking logic supports 10 bidirectional 2GB/s links to neighboring chips, allowing the chips to be interconnected into a high-bandwidth, low-latency 5-D torus network, as well as providing for an additional IO link. The on-chip network logic incorporates routing between these ports, DMA facilities to support remote memory access, and hardware-assist facilities for broadcast and reduction operations. As a result of these architectural features, BQC is a power-efficient compute chip, optimized for a wide range of parallel applications. The Blue Gene/Q systems took over the Green500 top spot since November 2010 three times consecutively, achieving a power efficiency of ~2 GigaFLOPS/Watt. It also received the top honor of Graph500 in November 2011 in a data analytics application.

George Liang-Tai Chiu (Fellow, IEEE) is the Senior Manager of Advanced High Performance Systems in the Systems Department at the Thomas J. Watson Research Center, responsible for the overall hardware and software of the Blue Gene Platform. He received a Ph.D. degree in astrophysics from the University of California at Berkeley in 1978, and an MS degree in Computer Science from Polytechnic University in 1995. He joined IBM in 1980 after having been on the staff of Yale University. Dr. Chiu has worked on picosecond device and internal node characterization, laser beam and electron beam contactless testing techniques, functional testing of chips and packages, optical lithography, display technologies, computer packaging, and supercomputing. Dr. Chiu is one of the three co-founders of the Blue Gene project, and he has been in charge of the Blue Gene supercomputer since 1999. In 2007, he became the Principal Investigator of the Nuclear Energy Advanced Modeling and Simulations (NEAMS) project. In 2010, he was appointed as an Industrial Council Member of the CASL (Consortium of Advanced Simulation for Light water reactors) organization overseeing the Oak Ridge nuclear reactor research. He has published over 400 papers and taught numerous short courses in the areas mentioned above. He holds fifty two patents internationally. He received an IBM Corporate Award in 2005, the Gerstner Award for Client Excellence in 2005, the EE ACE Awards as part of the Blue Gene/L System Design Team in 2005, three IBM Outstanding Technical Achievement Awards, nine Invention Achievement Awards from IBM, and National Medal of Technology and Innovation on Blue Gene from the US Department of Energy in 2009. Dr. Chiu is a member of the International Astronomical Union, IBM Academy of Technology, and a Fellow of the Institute of Electrical and Electronics Engineers.

Panel Discussion

"Technology exchange: Supercomputing and Embedded computing"

Organizer and Moderator:
Hideharu Amano (Keio Univ, Japan)
George Liang-Tai Chiu (IBM USA)
Yuichiro Ajima (Fujitsu)
Wen mei-Hwu (Univ. of Illinois)
Felipe Cruz (Nagasaki Univ.)
Toru Shimizu (Renesas electronics)

Abstrat: The most important challenges of the next generation supercomputer is pushing into computing elements as many as possible with a limited energy and space. The rapid advance of personal mobile devices promoted embedded systems to provide powerful computing functions also with a limit energy and space. The common keys are many-core systems and accelerators. Programming techniques for making the best use of complicated hierarchical multi-core systems are another key technique. This panel discusses techniques in a field which can be useful in the other field, and how to exchange them beyond the barrier of the market.

Panelists' biographies

Felipe Cruz is a Postdoctoral Research Fellow at Nagasaki University. He works at the Nagasaki Advanced Computing Center where he focuses on Scientific Computing for low-cost and energy-efficient high performance computing systems. For more details, please visit his homepage.

For other panelists, please see their bio in the field of keynote presentations.

Special Invited Presentation

Seahawk - Optimizing power efficiency in high performance Cortex-A15 processor implementations

Dermot O'Driscoll and Sumit Sahai (ARM, UK)

Abstract: TBA

Special Sessions (invited lectures)

Advanced Virtual Prototyping of Multiprocessor SoCs

Frédéric Pétrot (TIMA Laboratory, France)

Abstract: Virtual prototyping is a technology whose goal is to simulate the behavior of an entire digital system, including the software running on the processors, and the digital hardware. It relies on specific modeling approaches, at different levels of abstraction, so that speed/accuracy trade-offs can be made. This talk will review the challenges of virtual prototyping techniques, and introduce the level of abstractions that have been agreed upon. We will then more specifically focus on the interpretation of software codes and detail two techniques, an interpretive one based on dynamic binary translation and a native one making use of hardware assisted virtualization.

Frédéric Pétrot received the DEA (master) and PhD degree in Computer Science from Université Pierre et Marie Curie (Paris VI), Paris, France, in respectively 1990 and 1994. From 1995 to 2004, he was assistant professor, and contributed actively to the Alliance VLSI CAD System and the Disydent ESL environment. F. Pétrot joined TIMA in September 2004, and holds a professor position at the Grenoble Institute of Technology, France, where, since 2007, he heads the System Level Synthesis group. His main research interests are in system level design of integrated systems, and include computer aided design of digital system, architecture and software for homogeneous and heterogeneous multiprocessor systems on chip.

The Challenges of Analyzing Embedded Processor Behavior In the Age of Complex SoCs

Markus Levy (EEMBC, USA)

Abstract: Drawing on the experience of the Embedded Microprocessor Benchmark Consortium (EEMBC), this presentation will detail the methodology used to develop benchmarks that target horizontal technologies such as floating-point and multicore and vertical technologies such as smartphones, automotive, and Android. In addition to performance-related aspects, I will also discuss battery-life measurement techniques for smartphones, a subject that is often fraught with misinterpretation and abuse. The advanced development effort of these benchmarks is faced with many challenges such as ensuring repeatability, portability, and the ability to defeat unwarranted optimizations. Furthermore, these diverse and popular topics present the design engineer with unique challenges in trying to understand how to analyze the embedded processor and system behavior. Therefore, this presentation will also explain how to apply these benchmark techniques to designing next-generation processors and systems, as well as for system designers making tradeoffs between performance and power.

Markus Levy is founder and president of EEMBC. He is also president of The Multicore Association and chairman of Multicore Developers Conference. Mr. Levy was previously a senior analyst at In-Stat/MDR and an editor at EDN magazine, focusing in both roles on processors for the embedded industry. Levy began his career in the semiconductor industry at Intel Corporation, where he served as both a senior applications engineer and customer training specialist for Intel's microprocessor and flash memory products. He is the co-author of Designing with Flash Memory, the only technical book on this subject, and received several patents while at Intel for his ideas related to flash memory architecture and usage as a disk drive alternative. He is also a volunteer firefighter.