FireBox

The FireBox project aims to develop a system architecture for third-generation Warehouse-Scale Computers (WSCs). Firebox scales up to a ~1 MegaWatt WSC containing up to 10,000 compute nodes and up to an Exabyte (2^60 Bytes) of non-volatile memory connected via a low-latency, high-bandwidth optical switch. The FireBox project will produce custom datacenter SoCs, distributed simulation tools for warehouse-scale machines, and systems software for FireBox-style disaggregated datacenters.

Background

The complete FireBox design contains petabytes of flash storage, large quantities of bulk DRAM, as well as high-bandwidth on-package DRAM. Each FireBox node contains a custom System-on-a-Chip (SoC) with combinations of application processors, vector machines, and custom hardware accelerators. Fast SoC network interfaces reduce the software overhead of communicating between application services and high-radix network backplane switches connected by Terabit/sec optical fibers reduce the network's contribution to tail latency. The very large non-volatile store directly supports in-memory databases, and pervasive encryption ensures that data is always protected in transit and in storage. These system characteristics raise a number of novel questions in programming environments, operating systems, and hardware design.

View Krste Asanovic's talk from FAST'14 below for more details:

The following sections summarize a variety of ongoing FireBox-related projects in the BAR group.

Hardware

FireSim

The FireSim project aims to enable accurate prototyping and benchmarking of radically new datacenter designs like FireBox. FireSim simulates these datacenters at full-scale and using full software stacks (e.g. BITS, PerfKit), by taking advantage of existing clusters with distributed reconfigurable computing elements.

The modularity of FireSim allows users to trade-off between simulation accuracy and performance at the granularity of individual components in the warehouse-scale machine. Models used in FireSim can be derived from silicon-proven RTL, while handwritten software models can also be used for prototyping before RTL is available.

FireChip

FireChip is a planned series of tapeouts to prototype FireBox SoCs containing out-of-order RISC-V cores, accelerators, and high-performance I/O devices.

System Software

Wabash

Motivation

Future high-performance clusters, be it supercomputing (e.g. exascale), or in the datacenter (e.g. warehouse-scale computers), will have several novel properties that present challenges for operating system design. CPUs are getting more specialized and self-contained in a move away from traditional NUMA architectures. The advent of silicon photonics will allow high-radix links at very high bandwidth and low latency. Finally, novel NVM and bulk-DRAM technologies are leading to a new level in the memory hierarchy. Together these technologies suggest future clusters will involve many SoCs connected over photonic links to both each other and bulk disaggregated memory. Examples of these trends can already be seen in systems like the Cori supercomputer (which has many Intel Phi processors and bulk memory "burst buffers") or HP's "The Machine" (which has NVM connected over photonic links).

Along with hardware trends, current software trends will influence OS design. Current datacenter/HPC applications must be able to scale out; few applications will use a single CPU and resource demands may change dynamically. Often, scaling will occur through a service-oriented architecture, where many specialized and latency-sensitive applications interact to form a larger job. All of this will occur in a tenant environment where many independent users require performance and security isolation. Finally, increasing power/cooling costs and overprovisioning will lead to power-constrained systems that require sophisticated power management from the OS.

Wabash OS

Wabash is a project to provide a cluster-first OS to address the challenges of future datacenter and HPC environments. By cluster-first we mean that the entire cluster is managed as a single machine and individual modules are simply resources. Mesosphere has already begun this trend with the concept of a "Datacenter Operating System" (DCOS) which deploys several distributed cluster-management tools over a number of traditional Linux nodes. Indeed, the "Wabash 1.0" system is simply the Xen hypervisor, running application-specific unikernels, and paired with a cluster management framework (e.g. Mesos, OpenStack,...).

From this base system, we move toward a true cluster-first OS through various sub-projects:

Remote/Non-Volatile Memory: Nephele

As systems become more modular, memory and compute can fail independently. This persistence and high-availability is only useful if the data remains in a consistent state at all times. Nephele implements a transactional interface over critical memory, allowing an application to ensure that it's critical in-memory data is in a consistent state at all times. It does this through a simple allocator-based interface that doesn't require explicit serialization/deserialization (pointers continue to work after recovery). Changes are automatically detected and only the changed data is replicated at commit points.

Embrace the Noise

As parallelism increases, applications become increasingly sensitive to performance uncertainty ("noise"). This has long been true for bulk-synchronous applications, which are common in scientific computing. However, large interactive applications (which are common in cloud settings) are now suffering from noise as well through the issue of tail-latency. There are many potential approaches to reducing the noise of a system. One technique is to use batch-scheduling where entire nodes are allocated to jobs until completion. While common in HPC and big-data systems, interactive applications would have unacceptable levels of utilization. Another solution is gang-scheduling, where tasks are multiplexed, but all threads of a task are scheduled at the same time. We implemented such a scheduler in Xen and ran several benchmarks to evaluate it's performance.

While gang-scheduling succeeded in eliminating much of the uncertainty, it too had poor utilization. In fact, many other techniques to eliminate noise come with significant drawbacks. For instance, dynamic voltage and frequency scaling (DVFS), which can change the performance of a processor in response to changing thermal/power conditions, is often turned off in high-performance clusters. However, DVFS results in significant power savings and will likely become more aggressive as process scaling slows.

Instead, we pursue the philosophy of "Embrace the Noise". Like fault-tolerance, performance uncertainty is an unavoidable fact in large clusters. Instead of trying to eliminate noise, we need introduce a certain flexibility into all layers of the software stack. It has long been understood that a truly fault-tolerant OS is not practical. Likewise, a noise-free OS is not practical, although there are practical techniques to reduce the noise significantly. We plan to pursue this idea through research on latency-aware schedulers, unikernels, and application-level techniques (such as the Taurus garbage-collection coordination framework).

Applications

FireCaffe

Long training times for high-accuracy deep neural networks (DNNs) impede research into new DNN architectures and slow the development of high-accuracy DNNs. FireCaffe successfully scales deep neural network training across a cluster of GPUs.

FireCaffe is designed with FireBox-style warehouse-scale computing in mind. First, we select network hardware that achieves high bandwidth between GPU servers. Second, we consider a number of communication algorithms, and we find that reduction trees are more efficient and scalable than the traditional parameter server approach. Third, we optionally increase the batch size to reduce the total quantity of communication during DNN training, and we identify hyperparameters that allow us to reproduce the small-batch accuracy while training with large batch sizes.

When training GoogLeNet and Network-in-Network on ImageNet, we achieve a 47x and 39x speedup, respectively, when training on a cluster of 128 GPUs.

Read the FireCaffe paper on arXiv.

Sponsors

FireBox is funded by DARPA, STARnet, CFAR, Intel, Google, HP, Huawei, LG, Nvidia, Oracle, and Samsung.