Computational storage devices are the new must-have peripherals for intensive storage applications.
BEN WHITEHEAD, Storage Product Specialist, Mentor, a Siemens Business, and LAURO RIZZATTI, Verification Expert
July 2020
In the world of computing, moving data is an expensive proposition, slowing performance and increasing power consumption. The challenge is magnified as storage developers attempt to replace hard-disk drives (HDDs) with solid-state drives (SSDs).
The design community has looked for ways to decrease data movement between storage and compute engines since the early days of SSDs. In the quest, a new approach called computational storage device (CSD) emerged that will reduce overall energy consumption and increase performance.
When comparing movement of data in storage versus compute engines, storage data access consumes orders of magnitude more energy than data processing in the computing engine.
Power consumption of various computations versus memory accesses is recapped in Table 1. An 8-bit fix-integer addition consumes 0.03 picoJoules (pJ), while the same addition based on 8-bit floating-point burns 10 times more energy. Reading from an 8 kilobyte SRAM consumes 5 pJ or 10X more power, and two orders of magnitude (640 pJ) more than when reading from a large DRAM.
As shown in FIGURE 1, CSD eliminates backward and forward data transfers with the host computer by performing computations locally within the SSD.
The top blue arrow in an SSD example represents the request data from the host to storage. The two orange arrows represent the read data from the SSD moving across the data bus to the host, and the manipulated data by the host moving back to the SSD. Orange arrows highlight massive power consumption and performance degradation. The concept encapsulated in the CSD adds a small processing element next to the storage. The request data from the
host can be computed locally by that processing element, eliminating the data transfer with the host, saving power and accelerating execution. Two implementations of the CSD architecture are available today, either using a dedicated field programmable gate array (FPGA) or an application processor (AP). Data can be processed locally by attaching an FPGA to an SSD on the same board and configuring the programmable device with compute elements, such as an Arm core or DSPs with advantages and drawbacks. The benefit is improved performance, while the disadvantage is the need to make the FPGA file-aware of the activity inside the SSD to be able to retrieve and process files. A system for file management with reading, processing and writing files back to the storage fabric is needed.
The alternative is to deploy an AP that runs Linux. Linux is natively file aware, readily available, ensures the consistency of the file system and is served by an open-source developer community, existing infrastructures, a range of development tools and many applications and protocols. The setup is known as “on-drive Linux.”
The host could view a CSD running as a standard non-volatile memory express (NVMe) SSD when installed in a PCI port. The user can create a secure-shell (SSH) connection into the drive, view it like any other host on the network because Linux runs locally as the CSD becomes a network-attached –– in effect, a headless-server running inside an SSD. Applications can directly manipulate files stored on the NAND.
Disaggregated storage will be even more important as storage systems move to a non-volatile memory express over fabrics (NVMe-oF) protocol to connect hosts to storage across a network fabric using the NVMe protocol. Designers can disaggregate and move compute “in situ” to improve performance, lower power usage and clear PCIe bandwidth for the rest of the system using CSD. NVMe-oF reduces some storage network bottleneck.
While adding SSDs to servers in data centers scales the amount of storage, it does not scale processing power. Instead, adding CSDs scales linearly total capacity as well as performance, the result of a local processor. By increasing the number of CSDs in data centers, the number of processors increases linearly leading to performance scaling.
The TERASORT benchmark in FIGURE 2 compares deployment of an SSD with a CSD by increasing their number of cores from 1 to 8. The number of SSDs does not change the performance (orange line), but performance is accelerated by adding CSDs with the cross point at four units (blue line). If one CSD takes approximately 850 milliseconds (ms) to run a TERASORT benchmark, eight CSDs run the same benchmark in about 500ms.
CSD verification
Adding processors plus an entire Linux stack and applications to the already complex hardware/firmware of the traditional SSD make CSD verification a challenging task.
Traditional verification approaches are inadequate due to a recent discovery that found the non-deterministic nature of the SSD storage interfered with hyperscale data center requirements. The solution to both is hardware emulation-based virtual verification that allows for pre-silicon performance and latency testing within 5% of actual silicon. Veloce VirtuaLAB from Mentor, a Siemens Business, is an example of the virtualization methodology. A new set of tools for CSD design verification from block level to system level were needed to build on top of tools and expertise developed to support a networking verification methodology.
System-level verification includes six parts starting with a PCIe/NVMe standard host interface setup:
1. Virtual NVMe/PCIe host running real-world applications on Quick EMUlator (QEMU) to implement host traffic
2. Veloce Protocol Analyzer for visibility on all interfaces such as NMVe, PCIe and the NAND fabric
3. Hybrid configurable platform with software stack to boot Linux and run applications virtually with the ability to save the state of the system at any point and re-start from as needed
4. Veloce emulation platform to emulate the CSD design under test (DUT) in pre-silicon with real-world traffic
5. Soft models for