Call for Participation | COOL Chips 21

[pdf version is here](As of 2018-Mar-15).

Keynote Presentations

AMD EPYC Microprocessor Architecture

Jay Fleischman (Advanced Micro Devices, Inc.)

Abstract: AMD will present the next-generation AMD EPYC™ microprocessor. This advanced processor is a Multi-Chip-Module (MCM) comprised of up to four System-on-a-Chip (SoC) die, codenamed “Zeppelin”. Each “Zeppelin” SoC contains eight high-performance AMD x86 cores, codenamed “Zen”, caches, memory controllers, IO controllers (such as PCIe® and SATA), and integrated x86 southbridge chipset capabilities. All these functions are connected on the SoC and between multichip packages and multi-socket systems by AMD Infinity Fabric. Utilizing GLOBALFOUNDRIES’ 14nm LPP FinFET process technology, the four-die MCM EPYC™ microprocessor has over 19.2B transistors.

Jay Fleischman is a Senior Fellow at AMD driving CPU core-architecture, where he participates in the designs of Zen1, Zen2, Zen3, Zen4, and Zen5 x86 CPUs. He received a B.S. in ECE and in CS from the University of Wisconsin, Madison, and an M.S. and Ph.D. in EECS from the University of California, Berkeley in 1993. He worked at Hewlett Packard on PA-RISC CPUs until 2006 when he joined AMD to work on x86 CPUs and systems. His primary focuses are in floating point, high-performance low-power core micro-architecture and holistic core/cache/system design.

Multiscale Dataflow ASICs – Easy, Fast, Low Cost

Oskar Mencer (Maxeler Technologies / Imperial College London)

Abstract: AI algorithms are rapidly evolving while a lot of investment is being directed into custom ASICs for AI which can take years to design and build. Developing a single chip for AI is very difficult since we do not know which algorithms will be most popular for a particular task by the time the chip is finished, and in addition there are many tasks with a wide range of different optimal AI algorithms. We propose Multiscale Dataflow as a methodology and infrastructure to minimize the time and cost for making a new ASIC, and therefore allow for very fast and very efficient development of new AI chips immediately when new algorithms come out, or for specialist domains and challenges. Thus, our dataflow methodology allows us to adapt quickly and create AI chips for all future AI algorithms.

Oskar Mencer is the Founder of Maxeler, and affiliated with the Computing Department at Imperial College London. Oskar is fascinated by human decision making, and is driving the development of a new science of Multiscale Dataflow and Space-Time-Value discretization. Prior to Maxeler, Oskar was in Computing Sciences (1127) at Bell Labs in Murray Hill, Stanford University and a HIVIPS scholar at the Hitachi Central Research Laboratories in Tokyo. Today, Hitachi is one of the major sales and marketing partners of Maxeler. In 2018, Maxeler is spinning out ChipsAI, a chip company for AI. Oskar received two Best Paper Awards, an Imperial College Research Excellence Award in 2007 and a Special Award from Com.sult in 2012 for “revolutionising the world of computers”.

Designing Deep Neural Network Accelerators with Analog Memory – A Device and Circuit Perspective

Pritish Narayanan (IBM Research – Almaden)

Abstract:
Deep Neural Networks (DNNs) have revolutionized the field of Artificial Intelligence in the last few years, demonstrating the capability to solve many challenging and meaningful machine learning tasks. However, training large neural net models using large amounts of data can often take days to weeks, even with today’s cutting-edge GPUs. Therefore, there is significant interest in the electrical engineering community to design and build new hardware systems that can accelerate these workloads and/or lower the energy consumption. One analog approach uses crossbar arrays of Non-Volatile Memory (NVM), wherein highly parallelized multiply-accumulate operations are performed at the location of the weight data. This is a non-Von Neumann architecture, which avoids expensive data transfers between the memory and the CPU, potentially achieving orders of magnitude performance/energy improvements. In this talk, I will present our group’s recent work towards achieving ‘computer science equivalent’ accuracies in such a system despite the presence of significant NVM non-idealities. I will also address circuit requirements, which tend to be significantly different from conventional memory design and discuss tradeoffs that influence area, effective performance and power.

Pritish Narayanan received his PhD in Electrical and Computer Engineering from the University of Massachusetts Amherst. He joined IBM Research – Almaden as a Research Staff Member in 2013 as part of the Storage Class Memory/MIEC project, where he investigated circuit design challenges for access devices used in 3D crosspoint memory. His current research interests are in the area of ultra-high-performance hardware systems for Artificial Intelligence and Cognitive computing including i) Novel non-Von Neumann architectures based on emerging memory, where he is the lead circuit architect for two deep learning test sites based on Phase Change Memory (PCM) and mixed-signal hardware and ii) FPGA-based systems exploiting massive parallelism and/or approximate computing techniques. Dr. Narayanan has presented one prior keynote (International Memory Workshop 2017) and a tutorial session (Device Research Conference 2017), in addition to several invited talks. He won Best Paper Awards at IEEE Computer Society Symposium on VLSI 2008 and at IEEE Nanoarch 2013. He has also been a Guest Editor for the Journal of Emerging Technologies in Computing, the Program Chair at IEEE Nanoarch 2015, Special Session Chair for IEEE Nano 2016 and served on the Technical Program Committees of several conferences.

Tensor Processing Unit:
A processor for neural network designed by Google

Kaz Sato (Google Inc. )

Abstract: Tensor Processing Unit (TPU) is a LSI designed by Google for neural network processing. TPU features a large-scale systolic array matrix unit that achieves outstanding performance-per-watt ratio. In this session we will learn how a minimalistic design philosophy and a tight focus on neural network inference use-cases enabled the high performance neural network accelerator chip.

Kaz Sato is Staff Developer Advocate at Google Cloud team, Google Inc. Focusing on Machine Learning and Data Analytics products, such as TensorFlow, Cloud ML and BigQuery. Kaz has been invited to major events including Google Cloud Next SF, Google I/O, Strata NYC etc., authoring many GCP blog posts, and supporting developer communities for Google Cloud for over 8 years. He is also interested in hardwares and IoT, and has been hosting FPGA meetups since 2013.

Designing a Power and Energy Stack for Exascale Systems

Martin Schulz (Technische Universität München)

Abstract:　Both power and energy are major design constraints as we approach the exascale era. Hardware efforts alone will no longer be sufficient to tackle this problem; instead we need comprehensive software developments that go along with any hardware effort and that help manage these scarce resources. This will have a significant impact on all layers of the exascale software stack, starting from low-level measurement and control capabilities, interactions with runtimes and resource management systems, all the way to interfaces with applications. In this talk I will highlight several ongoing projects with partners in Japan, the US and Europe to create a comprehensive software stack that can tackle this challenge and help mitigate the impact we see in power and energy constraint systems. This will include work on reducing variability, active runtime control in parallel programs, OS-level resource management as well as suitable application interfaces. Combined, these efforts will lead us to a vertically integrated software stack that enables power- and energy-aware computing and can help deliver an exascale system in the coming years.

Martin Schulz is a Full Professor and Chair for Computer Architecture and Computer Organization at the Technische Universität München (TUM), which he joined in 2017. Prior to that, he held positions at the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory (LLNL) and Cornell University. He earned his Doctorate in Computer Science in 2001 from TUM and a Master of Science in Computer Science from UIUC. Martin has published over 200 peer-reviewed papers and currently serves as the chair of the MPI Forum, the standardization body for the Message Passing Interface. His research interests include parallel and distributed architectures and applications; performance monitoring, modeling and analysis; memory system optimization; parallel programming paradigms; tool support for parallel programming; power-aware parallel computing; and fault tolerance at the application and system level. Martin was a recipient of the IEEE/ACM Gordon Bell Award in 2006 and an R&D 100 award in 2011.

Unlocking Hidden Performance: Examples from FPGA-Based Neural Nets

Ephrem Wu (Xilinx)

Abstract: Reconfigurable numerical solutions often leave performance on the table, typically achieving only a third to a half of the potential throughput. Data movement between memory and compute in parallel algorithms presents a particularly difficult “feeding-the-beast” problem. It is possible to meet this challenge by mapping parallel algorithms to a minimalist hardware architecture, and by selecting numerical representations to reduce memory capacity, bandwidth, and energy. Drawing from our experience with a reconfigurable compute unit for neural networks, we present some principles to unlock latent FPGA performance. We believe that these principles are general enough to be applicable to other parallel numerical applications.

Ephrem Wu is a Senior Director in Silicon Architecture at Xilinx. His current focus is neural-net accelerators. Since joining Xilinx in 2010, Ephrem led the definition of UltraRAM in the UltraScale+ family, the first new block memory since the BRAM, and spearheaded the design of the first 2.5D-stacked FPGA with 28 Gb/s serdes. From 2000-2010, Ephrem led backplane switch and security processor development at Velio Communications and LSI. Prior to Velio, he developed hardware and software at SGI, HP, Panasonic, and AT&T. Ephrem holds 29 U.S. patents. He earned a bachelor’s degree from Princeton University and a master’s degree from the University of California, Berkeley, both in EE.

Invited Presentation

Designing the Next Billion Chips:
How RISC-V is Revolutionizing Hardware

Yunsup Lee (SiFive)

Abstract Open source has revolutionized software. Now it’s hardware’s turn. In this talk, I present the chip design economics for today, introduce the free and open RISC-V instruction set architecture, and talk about how RISC-V, open-source hardware, and SiFive are changing the chip design economics for the next billion chips that are being built for IoT, edge computing, machine learning, and artificial intelligence applications.

Yunsup Lee is SiFive’s Chief Technology Officer and co-founder, and is the technical committee chair of the RISC-V foundation. Yunsup received his PhD from UC Berkeley, where he co-designed the RISC-V ISA and the first RISC-V microprocessors with Andrew Waterman, and led the development of the Hwacha decoupled vector-fetch extension. Yunsup also holds an MS in Computer Science from UC Berkeley and a BS in Computer Science and Electrical Engineering from the Korea Advanced Institute of Science and Technology (KAIST).

Panel Discussion

Topics:　“Challenges to the Scaling Limits: How Can We Achieve Sustainable Power-Performance Improvements?”

Organizer and Moderator:

Koji Inoue (Kyushu Univ.)

Panelist:

Takuya Araki (NEC)

Takumi Maruyama (FUJITSU LIMITED)

Takashi Oshima (Hitachi)

Martin Schulz (Technische Universität München)

Pritish Narayanan (IBM Research – Almaden)

Abstract: Moore’s Law, doubling the number of transistors in a chip every two years, has so far been contributed to the evolution of computer systems, e.g., employing large on-chip caches, increasing DRAM (or main memory) capacity, introducing manycore accelerations, etc. The growth of such hardware implementation makes a lot of optimization opportunities available to software developers. Unfortunately, we cannot expect transistor shrinking anymore, i.e., the end of Moore’s Law will come. On the other hand, our society strongly requires sustainable computing efficiency for the next generation ICT applications such as AI, IoT, Big-Data, etc. To satisfy such requirements, we have to rethink computer system designs from the bottom. The goal of this panel is to discuss and explore the future direction of computer system architectures by focusing on especially high-performance, low-power computing such as data-centers and supercomputers. We first try to share the knowledge, experience, and opinions of our excellent panelists and then discuss the technical challenges to overcome the scaling limits for obtaining sustainable power-performance improvements.

Koji Inoue received the B.E. and M.E. degrees in computer science from Kyushu Institute of Technology, Japan in 1994 and 1996, respectively. He received the Ph.D. degree in Department of Computer Science and Communication Engineering, Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan in 2001. In 1999, he joined Halo LSI Design & Technology, Inc., NY, as a circuit designer. He is currently a professor of the Department of I&E Visionaries, Kyushu University. His research interests include power-aware computing, high-performance computing, secure computer systems, 3D microprocessor architectures, multi/many-core architectures, nano-photonic computing, and single flux quantum computing.

Special Sessions (invited lectures)

Energy-Efficient and Energy-Scalable Processing – Meeting the Varied Needs of the Internet of Things at Its Edge

Massimo Alioto (National University of Singapore, Singapore)

Abstract: The Internet of Things (IoT) is evolving as a complex ecosystem that enables ubiquitous sensing through the deployment of ultra-low cost miniaturized devices (the “IoT nodes” at its edge). As the communication-computation tradeoffs swings towards more computation to deal with power-hungry radios, energy efficient processing is now necessary for any smart IoT node. And it is often not sufficient, as occasional and vigorous performance boosts are required when performing on-chip data analytics and event evaluation. In other words, energy efficiency and scalability are both key properties of processing in IoT nodes. This talk addresses the fundamental challenges posed by IoT nodes, in terms of both energy efficiency and energy scalability. Operation at minimum energy is first discussed, describing the implications at the circuit and the architectural level. Techniques to scale down energy below the “natural” minimum energy point are discussed, leveraging recently proposed approaches at the circuit, microarchitectural and architectural level. The orthogonal design dimension of energy-quality scaling is then introduced as very promising way to keep scaling down the energy, even when performance is constrained. Overall, recent and on-going research on energy-efficient and energy-scalable processing for IoT shows that there is still substantial room for energy improvements. To ultimately make IoT nodes smarter, smaller and long-lived.

Massimo Alioto is with the National University of Singapore, where he is Director of the Integrated Circuits and Embedded Systems area, and lead the Green IC research group. He previously held positions at the University of Siena, and visiting positions at Intel Labs–CRL, University of Michigan, Ann Arbor, BWRC–University of California, Berkeley, EPFL. He has authored or co-authored 250 publications on journals and conference proceedings. He is co-author of three books, including Enabling the Internet of Things–from Integrated Circuits to Integrated Systems (Springer, 2017). His primary research interests include self-powered and wireless nodes, near-threshold circuits for green computing, widely energy-scalable VLSI systems, on-chip small data analytics, and hardware-level security, among the others. In 2009-2010 he was Distinguished Lecturer of the IEEE Circuits and Systems Society, for which he is/was also member of the Board of Governors (2015-2020) and Chair of the “VLSI Systems and Applications” Technical Committee (2010-2012). He serves as Associate Editor in Chief of the IEEE Transactions on VLSI Systems (since 2013), and Deputy Editor in Chief of the IEEE Journal on Emerging and Selected Topics in Circuits and Systems (since 2018). He served as Guest Editor of several IEEE journal special issues, and Associate Editor of a number of journals. He was Technical Program Chair (e.g., SOCC, ICECS, NEWCAS, PRIME) and Track Chair in a number of conferences (e.g., ICCD, ISCAS, ICECS, VLSI-SoC). Prof. Alioto is an IEEE Fellow.

High-Power-Efficiency Implementation of Neuromorphic Computing Systems with Memristors

Yiran Chen (Duke University, USA)

Abstract: Inspired by the working mechanism of human brains, neuromorphic computing system (NCS) possesses a massively parallel architecture with closely coupled memory. NCS can be efficiently implemented by nonvolatile memories, e.g. memristor crossbar arrays, because of its analogy to matrix multiplication and high resistance resulting in low power consumption. However, memristor fabrication process cannot produce perfect devices: limited high/low resistance ratio and resistance level, varying resistance range and nonlinearity bring difficulties into hardware implementation. In this talk, we will start with spike and level versions of memristor based Neuromorphic chip prototypes using Integrate-and-Fire-Circuit and their applications in pattern recognitions, followed by the discussion on the challenges and our solutions on bridging the gap between software algorithm and hardware implementation. Both circuit design techniques and algorithm tailoring are included. (Author: Bonan Yan, Qing Yang, Chenchen Liu, Hai Li, and Yiran Chen)

Yiran Chen received B.S and M.S. from Tsinghua University and Ph.D. from Purdue University in 2005. After five years in industry, he joined University of Pittsburgh in 2010 as Assistant Professor and then promoted to Associate Professor with tenure in 2014, held Bicentennial Alumni Faculty Fellow. He now is a tenured Associate Professor of the Department of Electrical and Computer Engineering at Duke University and serving as the co-director of Duke Center for Evolutionary Intelligence (CEI), focusing on the research of new memory and storage systems, machine learning and neuromorphic computing, and mobile computing systems. Dr. Chen has published one book and more than 300 technical publications and has been granted 93 US patents. He is the associate editor of IEEE TNNLS, IEEE TCAD, IEEE D&T, IEEE ESL, ACM JETC, ACM TCPS, and served on the technical and organization committees of more than 40 international conferences. He received 6 best paper awards and 14 best paper nominations from international conferences. He is the recipient of NSF CAREER award and ACM SIGDA outstanding new faculty award. He is the Fellow of IEEE.