FirePerf: FPGA-Accelerated Full-System Hardware/Software Performance Profiling and Co-Design

Title: FirePerf: FPGA-Accelerated Full-System Hardware/Software Performance Profiling and Co-Design
Authors: Sagar Karandikar, Albert Ou, Alon Amid, Howard Mao, Randy Katz, Borivoje Nikolić, and Krste Asanović
Date: March 2020
Conference: Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2020), Lausanne, Switzerland

FASED: FPGA-Accelerated Simulation and Evaluation of DRAM

Title: FASED: FPGA-Accelerated Simulation and Evaluation of DRAM
Authors: David Biancolin, Sagar Karandikar, Donggyu Kim, Jack Koenig, Andrew Waterman, Jonathan Bachrach, and Krste Asanovic
Date: February 2019
Conference: International Symposium on Field-Programmable Gate Arrays (FPGA 2019), Seaside, CA

This work presents the generator of high-fidelity, runtime-reconfigurable, last-level cache and DRAM timing models provided by MIDAS and used in FireSim.

A Hardware Accelerator for Tracing Garbage Collection

Title: A Hardware Accelerator for Tracing Garbage Collection
Authors: Martin Maas, Krste Asanović, and John Kubiatowicz
Date: June 2018
Conference: International Symposium on Computer Architecture (ISCA 2018), Los Angeles, CA

Selected as one of IEEE Micro's "Top Picks from Computer Architecture Conferences" for 2018

FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud

Title: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud
Authors: Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolić, Randy Katz, Jonathan Bachrach, and Krste Asanovic
Date: June 2018
Conference: International Symposium on Computer Architecture (ISCA 2018), Los Angeles, CA

Selected as one of IEEE Micro's "Top Picks from Computer Architecture Conferences" for 2018

Sub-microsecond Adaptive Voltage Scaling in a 28nm FD-SOI Processor SoC

Title: Sub-microsecond Adaptive Voltage Scaling in a 28nm FD-SOI Processor SoC
Authors: Ben Keller, Martin Cochet, Brian Zimmer, Yunsup Lee, Milovan Blagojevic, Jaehwa Kwak, Alberto Puggelli, Stevo Bailey, Pi-Feng Chiu, Palmer Dabbelt, Colin Schmidt, Elad Alon, Krste Asanovic, Borivoje Nikolic
Date: September 2016
Conference: European Solid-State Circuits Conference (ESSCIRC), Lausanne, Switzerland

This work presents a RISC-V system-on-chip (SoC) with integrated voltage regulation and power management implemented in 28nm FD-SOI. A fully integrated switched-capacitor DC-DC converter, coupled with an adaptive clocking system, achieves 82-89% system conversion efficiency across a wide operating range, yielding a total system efficiency of 41.8 double-precision GFLOPS/W. Measurement circuits can detect changes in processor workload and an integrated power management unit responds by adjusting the core voltage at sub-microsecond timescales. The power management system reduces the energy consumption of a synthetic benchmark by 39.8% with negligible performance penalty and 2.0% area overhead, enabling extremely fine-grained (<1μs) adaptive voltage scaling for mobile devices.

Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL

Title: Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL
Authors: Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, Krste Asanovic
Date: June 2016
Conference: International Symposium on Computer Architecture (ISCA), Seoul, Korea

This paper presents a sample-based energy simulation methodology that enables fast and accurate estimations of performance and average power for arbitrary RTL designs. Our approach uses an FPGA to simultaneously simulate the performance of an RTL design and to collect samples containing exact RTL state snapshots. Each snapshot is then replayed in gate-level simulation, resulting in a workload-specific average power estimate with confidence intervals. For arbitrary RTL and workloads, our methodology guarantees a minimum of four-orders-of-magnitude speedup over commercial CAD gate-level simulation tools and gives average energy estimates guaranteed to be within 5% of the true average energy with 99% confidence. We believe our open-source sample-based energy simulation tool Strober can not only rapidly provide ground truth for more abstract power models, but can enable productive design-space exploration early in the RTL design process.

A RISC-V Vector Processor With Simultaneous-Switching Switched-Capacitor DC–DC Converters in 28 nm FDSOI

Title: A RISC-V Vector Processor With Simultaneous-Switching Switched-Capacitor DC–DC Converters in 28 nm FDSOI
Authors: Brian Zimmer, Yunsup Lee, Alberto Puggelli, Jaehwa Kwak, Ruzica Jevtic, Ben Keller, Stevo Bailey, Milovan Blagojevic, Pi-Feng Chiu, Hanh-Phuc Le, Po-Hung Chen, Nick Sutardja, Rimas Avizienis, Andrew Waterman, Brian Richards, Philippe Flatresse, Elad Alon, Krste Asanovic, Bora Nikolic
Date: April 2016
Conference: Journal of Solid-State Circuits (JSSC)

This work demonstrates a RISC-V vector microprocessor implemented in 28 nm FDSOI with fully integrated simultaneous-switching switched-capacitor DC–DC (SC DC–DC) converters and adaptive clocking that generates four on-chip voltages between 0.45 and 1 V using only 1.0 V core and 1.8 V IO voltage inputs. The converters achieve high efficiency at the system level by switching simultaneously to avoid charge-sharing losses and by using an adaptive clock to maximize performance for the resulting voltage ripple. Details about the implementation of the DC–DC switches, DC–DC controller, and adaptive clock are provided, and the sources of conversion loss are analyzed based on measured results. This system pushes the capabilities of dynamic voltage scaling by enabling fast transitions (20 ns), simple packaging (no off-chip passives), low area overhead (16%), high conversion efficiency (80%–86%), and high energy efficiency (26.2 DP GFLOPS/W) for mobile devices.

An Agile Approach to Building RISC-V Microprocessors

Title: An Agile Approach to Building RISC-V Microprocessors
Authors: Yunsup Lee, Andrew Waterman, Henry Cook, Brian Zimmer, Ben Keller, Alberto Puggelli, Jaehwa Kwak, Ruzica Jevtic, Stevo Bailey, Milovan Blagojevic, Pi-Feng Chiu, Rimas Avizienis, Brian Richards, Jonathan Bachrach, David Patterson, Elad Alon, Bora Nikolic, Krste Asanovic
Date: April 2016
Conference: Micro

The final phase of CMOS technology scaling provides continued increases in already vast transistor counts, but minimal improvements in energy efficiency, thus requiring innovation in circuits and architectures. However, even huge teams are struggling to complete large, complex designs on schedule using traditional rigid development flows. This article presents our agile hardware development methodology, which we have adopted for eleven RISC-V microprocessor tapeouts on modern 28 nm and 45 nm CMOS processes in the past five years, and discusses how it enabled small teams to build energy-efficient, cost-effective, and industry-competitive high-performance microprocessors in a matter of months. Our agile methodology relies on rapid iterative improvement of fabricatable prototypes using hardware generators written in Chisel, a new hardware description language embedded in a modern programming language. The parameterized generators construct highly customized systems based on the free, open, and extensible RISC-V platform. We present a case study of one such prototype featuring a RISC-V vector microprocessor integrated with a switched-capacitor DC-DC converter alongside an adaptive clock generator in a 28 nm fully-depleted silicon-on-insulator (FD-SOI) process.

Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server

Title: Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server
Authors: Scott Beamer, Krste Asanovic, David Patterson
Date: October 2015
Conference: International Symposium on Workload Characterization (IISWC), Atlanta

Abstract: Graph processing is an increasingly important application domain and is typically communication-bound. In this work, we analyze the performance characteristics of three high performance graph algorithm codebases using hardware performance counters on a conventional dual-socket server. Unlike many other communication-bound workloads, graph algorithms struggle to fully utilize the platform’s memory bandwidth and so increasing memory bandwidth utilization could be just as effective as decreasing communication. Based on our observations of simultaneous low compute and bandwidth utilization, we find there is substantial room for a different processor architecture to improve performance without requiring a new memory system.

A RISC-V vector processor with tightly-integrated switched-capacitor DC-DC converters in 28nm FDSOI

Title: A RISC-V vector processor with tightly-integrated switched-capacitor DC-DC converters in 28nm FDSOI
Authors: Brian Zimmer, Yunsup Lee, Alberto Puggelli, Jaehwa Kwak, Ruzica Jevtic, Ben Keller, Stevo Bailey, Milovan Blagojevic, Pi-Feng Chiu, Hanh-Phuc Le, Po-Hung Chen, Nick Sutardja, Rimas Avizienis, Andrew Waterman, Brian Richards, Philippe Flatresse, Elad Alon, Krste Asanovic, Bora Nikolic
Date: June 2015
Conference: Symposium on Very Large Scale Integrated Circuits (VLSI), Kyoto

This work demonstrates a RISC-V vector microprocessor implemented in 28nm FDSOI with fully-integrated non-interleaved switched-capacitor DCDC (SC-DCDC) converters and adaptive clocking that generates four on-chip voltages between 0.5V and 1V using only 1.0V core and 1.8V IO voltage inputs. The design pushes the capabilities of dynamic voltage scaling by enabling fast transitions (20ns), simple packaging (no off-chip passives), low area overhead (16%), high conversion efficiency (80-86%), and high energy efficiency (26.2 DP GFLOPS/W) for mobile devices.

Trash Day: Coordinating Garbage Collection in Distributed Systems

Title: Trash Day: Coordinating Garbage Collection in Distributed Systems
Authors: Martin Maas, Tim Harris, Krste Asanovic, John Kubiatowicz
Date: May 2015
Conference: 15th Workshop on Hot Topics in Operating Systems (HotOS '15), Kartause Ittingen, Switzerland

Abstract: Cloud systems such as Hadoop, Spark and Zookeeper are frequently written in Java or other garbage-collected languages. However, GC-induced pauses can have a significant impact on these workloads. Specifically, GC pauses can reduce throughput for batch workloads, and cause high tail-latencies for interactive applications.

In this paper, we show that distributed applications suffer from each node’s language runtime system making GC-related decisions independently. We first demonstrate this problem on two widely-used systems (Apache Spark and Apache Cassandra). We then propose solving this problem using a Holistic Runtime System, a distributed language runtime that collectively manages runtime services across multiple nodes.

We present initial results to demonstrate that this Holistic GC approach is effective both in reducing the impact of GC pauses on a batch workload, and in improving GC-related tail-latencies in an interactive setting.

GhostRider: A Hardware-Software System for Memory Trace Oblivious Computation

Title: GhostRider: A Hardware-Software System for Memory Trace Oblivious Computation
Authors: Chang Liu, Austin Harris, Martin Maas, Michael Hicks, Mohit Tiwari, Elaine Shi
Date: March 2015
Conference: International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15), Istanbul, Turkey

Abstract: This paper presents a new, co-designed compiler and architecture called GhostRider for supporting privacy preserving computation in the cloud. GhostRider ensures all programs satisfy a property called memory-trace obliviousness (MTO): Even an adversary that observes memory, bus traffic, and access times while the program executes can learn nothing about the program’s sensitive inputs and outputs. One way to achieve MTO is to employ Oblivious RAM (ORAM), allocating all code and data in a single ORAM bank, and to also disable caches or fix the rate of memory traffic. This baseline approach can be inefficient, and so GhostRider’s compiler uses a program analysis to do better, allocating data to non-oblivious, encrypted RAM (ERAM) and employing a scratchpad when doing so will not compromise MTO. The compiler can also allocate to multiple ORAM banks, which sometimes significantly reduces access times. We have formalized our approach and proved it enjoys MTO. Our FPGA-based hardware prototype and simulation results show that GhostRider significantly outperforms the baseline strategy.

The Case for the Holistic Language Runtime System

Title: The Case for the Holistic Language Runtime System
Authors: Martin Maas, Krste Asanovic, Tim Harris, John Kubiatowicz
Date: April 2014
Conference: International Workshop on Rack-scale Computing (WRSC '14), Amsterdam, Netherlands

Abstract: We anticipate that, by 2020, the basic unit of warehouse-scale cloud computing will be a rack-sized machine instead of an individual server. At the same time, we expect a shift from commodity hardware to custom SoCs that are specifically designed for the use in warehouse-scale computing. In this paper, we make the case that the software for such custom rack-scale machines should move away from the model of running managed language workloads in separate language runtimes on top of a traditional operating system but instead run a distributed language runtime system capable of handling different target languages and frameworks. All applications will execute within this runtime, which performs most traditional OS and cluster manager functionality such as resource management, scheduling and isolation.

PHANTOM: Practical Oblivious Computation in a Secure Processor

Title: PHANTOM: Practical Oblivious Computation in a Secure Processor
Authors: Martin Maas, Eric Love, Emil Stefanov, Mohit Tiwari, Elaine Shi, Krste Asanovic, John Kubiatowicz, Dawn Song
Date: November 2013
Conference: ACM Conference on Computer and Communications Security (CCS '13), Berlin, Germany

Abstract: We introduce Phantom, a new secure processor that obfuscates its memory access trace. To an adversary who can observe the processor’s output pins, all memory access traces are computationally indistinguishable (a property known as obliviousness). We achieve obliviousness through a cryptographic construct known as Oblivious RAM or ORAM. We first improve an existing ORAM algorithm and construct an empirical model for its trusted storage requirement. We then present Phantom, an oblivious processor whose novel memory controller aggressively exploits DRAM bank parallelism to reduce ORAM access latency and scales well to a large number of memory channels. Finally, we build a complete hardware implementation of Phantom on a commercially available FPGA-based server, and through detailed experiments show that Phantom is efficient in both area and performance. Accessing 4KB of data from a 1GB ORAM takes 26.2us (13.5us for the data to be available), a 32× slowdown over accessing 4KB from regular memory, while SQLite queries on a population database see 1.2−6× slowdown. Phantom is the first demonstration of a practical, oblivious processor and can provide strong confidentiality guarantees when offloading computation to the cloud.

GPUs as an Opportunity for Offloading Garbage Collection

Title: GPUs as an Opportunity for Offloading Garbage Collection
Authors: Martin Maas, Philip Reames, Jeffrey Morlan, Krste Asanović, Anthony D. Joseph, John Kubiatowicz
Date: June 2012
Conference: International Symposium on Memory Management (ISMM '12), Beijing, China

Abstract: GPUs have become part of most commodity systems. Nonetheless, they are often underutilized when not executing graphicsintensive or special-purpose numerical computations, which are rare in consumer workloads. Emerging architectures, such as integrated CPU/GPU combinations, may create an opportunity to utilize these otherwise unused cycles for offloading traditional systems tasks. Garbage collection appears to be a particularly promising candidate for offloading, due to the popularity of managed languages on consumer devices.

We investigate the challenges for offloading garbage collection to a GPU, by examining the performance trade-offs for the mark phase of a mark & sweep garbage collector. We present a theoretical analysis and an algorithm that demonstrates the feasibility of this approach. We also discuss a number of algorithmic design trade-offs required to leverage the strengths and capabilities of the GPU hardware. Our algorithm has been integrated into the Jikes RVM and we present promising performance results.

Selected Publications