Selected Publications
More can be found at http://people.eecs.berkeley.edu/~krste/publications.html
More can be found at http://people.eecs.berkeley.edu/~krste/publications.html
This work presents the generator of high-fidelity, runtime-reconfigurable, last-level cache and DRAM timing models provided by MIDAS and used in FireSim.
Selected as one of IEEE Micro's "Top Picks from Computer Architecture Conferences" for 2018
Selected as one of IEEE Micro's "Top Picks from Computer Architecture Conferences" for 2018
This work presents a RISC-V system-on-chip (SoC) with integrated voltage regulation and power management implemented in 28nm FD-SOI. A fully integrated switched-capacitor DC-DC converter, coupled with an adaptive clocking system, achieves 82-89% system conversion efficiency across a wide operating range, yielding a total system efficiency of 41.8 double-precision GFLOPS/W. Measurement circuits can detect changes in processor workload and an integrated power management unit responds by adjusting the core voltage at sub-microsecond timescales. The power management system reduces the energy consumption of a synthetic benchmark by 39.8% with negligible performance penalty and 2.0% area overhead, enabling extremely fine-grained (<1μs) adaptive voltage scaling for mobile devices.
This paper presents a sample-based energy simulation methodology that enables fast and accurate estimations of performance and average power for arbitrary RTL designs. Our approach uses an FPGA to simultaneously simulate the performance of an RTL design and to collect samples containing exact RTL state snapshots. Each snapshot is then replayed in gate-level simulation, resulting in a workload-specific average power estimate with confidence intervals. For arbitrary RTL and workloads, our methodology guarantees a minimum of four-orders-of-magnitude speedup over commercial CAD gate-level simulation tools and gives average energy estimates guaranteed to be within 5% of the true average energy with 99% confidence. We believe our open-source sample-based energy simulation tool Strober can not only rapidly provide ground truth for more abstract power models, but can enable productive design-space exploration early in the RTL design process.
This work demonstrates a RISC-V vector microprocessor implemented in 28 nm FDSOI with fully integrated simultaneous-switching switched-capacitor DC–DC (SC DC–DC) converters and adaptive clocking that generates four on-chip voltages between 0.45 and 1 V using only 1.0 V core and 1.8 V IO voltage inputs. The converters achieve high efficiency at the system level by switching simultaneously to avoid charge-sharing losses and by using an adaptive clock to maximize performance for the resulting voltage ripple. Details about the implementation of the DC–DC switches, DC–DC controller, and adaptive clock are provided, and the sources of conversion loss are analyzed based on measured results. This system pushes the capabilities of dynamic voltage scaling by enabling fast transitions (20 ns), simple packaging (no off-chip passives), low area overhead (16%), high conversion efficiency (80%–86%), and high energy efficiency (26.2 DP GFLOPS/W) for mobile devices.
The final phase of CMOS technology scaling provides continued increases in already vast transistor counts, but minimal improvements in energy efficiency, thus requiring innovation in circuits and architectures. However, even huge teams are struggling to complete large, complex designs on schedule using traditional rigid development flows. This article presents our agile hardware development methodology, which we have adopted for eleven RISC-V microprocessor tapeouts on modern 28 nm and 45 nm CMOS processes in the past five years, and discusses how it enabled small teams to build energy-efficient, cost-effective, and industry-competitive high-performance microprocessors in a matter of months. Our agile methodology relies on rapid iterative improvement of fabricatable prototypes using hardware generators written in Chisel, a new hardware description language embedded in a modern programming language. The parameterized generators construct highly customized systems based on the free, open, and extensible RISC-V platform. We present a case study of one such prototype featuring a RISC-V vector microprocessor integrated with a switched-capacitor DC-DC converter alongside an adaptive clock generator in a 28 nm fully-depleted silicon-on-insulator (FD-SOI) process.
Abstract: Graph processing is an increasingly important application domain and is typically communication-bound. In this work, we analyze the performance characteristics of three high performance graph algorithm codebases using hardware performance counters on a conventional dual-socket server. Unlike many other communication-bound workloads, graph algorithms struggle to fully utilize the platform’s memory bandwidth and so increasing memory bandwidth utilization could be just as effective as decreasing communication. Based on our observations of simultaneous low compute and bandwidth utilization, we find there is substantial room for a different processor architecture to improve performance without requiring a new memory system.
This work demonstrates a RISC-V vector microprocessor implemented in 28nm FDSOI with fully-integrated non-interleaved switched-capacitor DCDC (SC-DCDC) converters and adaptive clocking that generates four on-chip voltages between 0.5V and 1V using only 1.0V core and 1.8V IO voltage inputs. The design pushes the capabilities of dynamic voltage scaling by enabling fast transitions (20ns), simple packaging (no off-chip passives), low area overhead (16%), high conversion efficiency (80-86%), and high energy efficiency (26.2 DP GFLOPS/W) for mobile devices.
Abstract: Cloud systems such as Hadoop, Spark and Zookeeper are frequently written in Java or other garbage-collected languages. However, GC-induced pauses can have a significant impact on these workloads. Specifically, GC pauses can reduce throughput for batch workloads, and cause high tail-latencies for interactive applications.
In this paper, we show that distributed applications suffer from each node’s language runtime system making GC-related decisions independently. We first demonstrate this problem on two widely-used systems (Apache Spark and Apache Cassandra). We then propose solving this problem using a Holistic Runtime System, a distributed language runtime that collectively manages runtime services across multiple nodes.
We present initial results to demonstrate that this Holistic GC approach is effective both in reducing the impact of GC pauses on a batch workload, and in improving GC-related tail-latencies in an interactive setting.
Abstract: This paper presents a new, co-designed compiler and architecture called GhostRider for supporting privacy preserving computation in the cloud. GhostRider ensures all programs satisfy a property called memory-trace obliviousness (MTO): Even an adversary that observes memory, bus traffic, and access times while the program executes can learn nothing about the program’s sensitive inputs and outputs. One way to achieve MTO is to employ Oblivious RAM (ORAM), allocating all code and data in a single ORAM bank, and to also disable caches or fix the rate of memory traffic. This baseline approach can be inefficient, and so GhostRider’s compiler uses a program analysis to do better, allocating data to non-oblivious, encrypted RAM (ERAM) and employing a scratchpad when doing so will not compromise MTO. The compiler can also allocate to multiple ORAM banks, which sometimes significantly reduces access times. We have formalized our approach and proved it enjoys MTO. Our FPGA-based hardware prototype and simulation results show that GhostRider significantly outperforms the baseline strategy.
Abstract: We anticipate that, by 2020, the basic unit of warehouse-scale cloud computing will be a rack-sized machine instead of an individual server. At the same time, we expect a shift from commodity hardware to custom SoCs that are specifically designed for the use in warehouse-scale computing. In this paper, we make the case that the software for such custom rack-scale machines should move away from the model of running managed language workloads in separate language runtimes on top of a traditional operating system but instead run a distributed language runtime system capable of handling different target languages and frameworks. All applications will execute within this runtime, which performs most traditional OS and cluster manager functionality such as resource management, scheduling and isolation.
Abstract: We introduce Phantom, a new secure processor that obfuscates its memory access trace. To an adversary who can observe the processor’s output pins, all memory access traces are computationally indistinguishable (a property known as obliviousness). We achieve obliviousness through a cryptographic construct known as Oblivious RAM or ORAM. We first improve an existing ORAM algorithm and construct an empirical model for its trusted storage requirement. We then present Phantom, an oblivious processor whose novel memory controller aggressively exploits DRAM bank parallelism to reduce ORAM access latency and scales well to a large number of memory channels. Finally, we build a complete hardware implementation of Phantom on a commercially available FPGA-based server, and through detailed experiments show that Phantom is efficient in both area and performance. Accessing 4KB of data from a 1GB ORAM takes 26.2us (13.5us for the data to be available), a 32× slowdown over accessing 4KB from regular memory, while SQLite queries on a population database see 1.2−6× slowdown. Phantom is the first demonstration of a practical, oblivious processor and can provide strong confidentiality guarantees when offloading computation to the cloud.
Abstract: GPUs have become part of most commodity systems. Nonetheless, they are often underutilized when not executing graphicsintensive or special-purpose numerical computations, which are rare in consumer workloads. Emerging architectures, such as integrated CPU/GPU combinations, may create an opportunity to utilize these otherwise unused cycles for offloading traditional systems tasks. Garbage collection appears to be a particularly promising candidate for offloading, due to the popularity of managed languages on consumer devices.
We investigate the challenges for offloading garbage collection to a GPU, by examining the performance trade-offs for the mark phase of a mark & sweep garbage collector. We present a theoretical analysis and an algorithm that demonstrates the feasibility of this approach. We also discuss a number of algorithmic design trade-offs required to leverage the strengths and capabilities of the GPU hardware. Our algorithm has been integrated into the Jikes RVM and we present promising performance results.