The FireBox project aims to develop a system architecture for third-generation Warehouse-Scale Computers (WSCs). Firebox scales up to a ~1 MegaWatt WSC containing up to 10,000 compute nodes and up to an Exabyte (2^60 Bytes) of non-volatile memory connected via a low-latency, high-bandwidth optical switch. The FireBox project will produce custom datacenter SoCs, distributed simulation tools for warehouse-scale machines, and systems software for FireBox-style disaggregated datacenters.
The complete FireBox design contains petabytes of flash storage, large
quantities of bulk DRAM, as well as high-bandwidth on-package DRAM. Each FireBox
node contains a custom System-on-a-Chip (SoC) with combinations of application
processors, vector machines, and custom hardware accelerators. Fast SoC network
interfaces reduce the software overhead of communicating between application
services and high-radix network backplane switches connected by Terabit/sec
optical fibers reduce the network's contribution to tail latency. The very large
non-volatile store directly supports in-memory databases, and pervasive encryption
ensures that data is always protected in transit and in storage. These system
characteristics raise a number of novel questions in programming environments,
operating systems, and hardware design.
View Krste Asanovic's talk from
FAST'14 below for more details:
The following sections summarize a variety of ongoing FireBox-related projects
in the BAR group.
The FireSim project aims to enable accurate prototyping and benchmarking of
radically new datacenter designs like FireBox. FireSim simulates these datacenters at
full-scale and using full software stacks (e.g. BITS,
PerfKit), by taking advantage
of existing clusters with distributed reconfigurable computing elements.
The modularity of FireSim allows users to trade-off between simulation accuracy and
performance at the granularity of individual components in the warehouse-scale
machine. Models used in FireSim can be derived from silicon-proven RTL, while
handwritten software models can also be used for prototyping before RTL is
FireChip is a planned series of tapeouts to prototype FireBox SoCs containing
out-of-order RISC-V cores, accelerators, and high-performance I/O devices.
Future high-performance clusters, be it supercomputing (e.g. exascale), or in the datacenter (e.g. warehouse-scale computers), will have several novel properties that present challenges for operating system design. CPUs are getting more specialized and self-contained in a move away from traditional NUMA architectures. The advent of silicon photonics will allow high-radix links at very high bandwidth and low latency. Finally, novel NVM and bulk-DRAM technologies are leading to a new level in the memory hierarchy. Together these technologies suggest future clusters will involve many SoCs connected over photonic links to both each other and bulk disaggregated memory. Examples of these trends can already be seen in systems like the Cori supercomputer (which has many Intel Phi processors and bulk memory "burst buffers") or HP's "The Machine" (which has NVM connected over photonic links).
Along with hardware trends, current software trends will influence OS design. Current datacenter/HPC applications must be able to scale out; few applications will use a single CPU and resource demands may change dynamically. Often, scaling will occur through a service-oriented architecture, where many specialized and latency-sensitive applications interact to form a larger job. All of this will occur in a tenant environment where many independent users require performance and security isolation. Finally, increasing power/cooling costs and overprovisioning will lead to power-constrained systems that require sophisticated power management from the OS.
Wabash is a project to provide a cluster-first OS to address the challenges of future datacenter and HPC environments. By cluster-first we mean that the entire cluster is managed as a single machine and individual modules are simply resources. Mesosphere has already begun this trend with the concept of a "Datacenter Operating System" (DCOS) which deploys several distributed cluster-management tools over a number of traditional Linux nodes. Indeed, the "Wabash 1.0" system is simply the Xen hypervisor, running application-specific unikernels, and paired with a cluster management framework (e.g. Mesos, OpenStack,...).
From this base system, we move toward a true cluster-first OS through various sub-projects:
Remote/Non-Volatile Memory: Nephele
As systems become more modular, memory and compute can fail independently. This persistence and high-availability is only useful if the data remains in a consistent state at all times. Nephele implements a transactional interface over critical memory, allowing an application to ensure that it's critical in-memory data is in a consistent state at all times. It does this through a simple allocator-based interface that doesn't require explicit serialization/deserialization (pointers continue to work after recovery). Changes are automatically detected and only the changed data is replicated at commit points.
Embrace the Noise
As parallelism increases, applications become increasingly sensitive to performance uncertainty ("noise"). This has long been true for bulk-synchronous applications, which are common in scientific computing. However, large interactive applications (which are common in cloud settings) are now suffering from noise as well through the issue of tail-latency. There are many potential approaches to reducing the noise of a system. One technique is to use batch-scheduling where entire nodes are allocated to jobs until completion. While common in HPC and big-data systems, interactive applications would have unacceptable levels of utilization. Another solution is gang-scheduling, where tasks are multiplexed, but all threads of a task are scheduled at the same time. We implemented such a scheduler in Xen and ran several benchmarks to evaluate it's performance.
While gang-scheduling succeeded in eliminating much of the uncertainty, it too had poor utilization. In fact, many other techniques to eliminate noise come with significant drawbacks. For instance, dynamic voltage and frequency scaling (DVFS), which can change the performance of a processor in response to changing thermal/power conditions, is often turned off in high-performance clusters. However, DVFS results in significant power savings and will likely become more aggressive as process scaling slows.
Instead, we pursue the philosophy of "Embrace the Noise". Like fault-tolerance, performance uncertainty is an unavoidable fact in large clusters. Instead of trying to eliminate noise, we need introduce a certain flexibility into all layers of the software stack. It has long been understood that a truly fault-tolerant OS is not practical. Likewise, a noise-free OS is not practical, although there are practical techniques to reduce the noise significantly. We plan to pursue this idea through research on latency-aware schedulers, unikernels, and application-level techniques (such as the Taurus garbage-collection coordination framework).
Long training times for high-accuracy deep neural networks (DNNs) impede
research into new DNN architectures and slow the development of high-accuracy
DNNs. FireCaffe successfully scales deep neural network training across
a cluster of GPUs.
FireCaffe is designed with FireBox-style warehouse-scale computing in mind.
First, we select network hardware that achieves high bandwidth between GPU
servers. Second, we consider a number of communication algorithms, and we find
that reduction trees are more efficient and scalable than the traditional
parameter server approach. Third, we optionally increase the batch size to
reduce the total quantity of communication during DNN training, and we identify
hyperparameters that allow us to reproduce the small-batch accuracy while
training with large batch sizes.
When training GoogLeNet and Network-in-Network on ImageNet, we achieve a 47x
and 39x speedup, respectively, when training on a cluster of 128 GPUs.