For the past few years, the battle over AI, deep learning, and other HPC (High-Performance Computing) workloads has been mostly a two-horse race. It’s between Nvidia, the first company to launch a GPGPU architecture that could theoretically handle such workloads, and Intel, who has continued to focus on increasing the number of FLOPS its Core processors can handle per clock cycle. AMD is ramping up its own Radeon Instinct and Vega Frontier Edition cards to tackle AI as well, though the company has yet to win much market share in that arena. But now there’s an emerging fourth player — Fujitsu.
Fujitsu’s new DLU (Deep Learning Unit) is meant to be 10x faster than existing solutions from its competitors, with support for Fujitu’s torus interconnect. It’s not clear if this refers to Tofu (torus fusion) 1, which the existing K computer uses, or if the platform will also support Tofu 2, which improves bandwidth from 40Gbps to 100Gbps (from 5GBps to 12.5GBps). Tofu2 would seem to be the much better choice, but Fujitsu hasn’t clarified that point yet.
Underneath the DLU are an unspecified number of DPUs (Deep Learning Processing Unit). The DPUs are capable of running FP32, FP16, INT16, and INT8 data types. According to the Top500, Fujitsu has previously demonstrated that INT8 can be used without a significant loss of accuracy. Depending on the design specs, this may be one way Fujitsu hopes to hit its performance-per-watt targets.
Here’s what we know about the underlying design:
Each of the DPUs contains 16 DLEs (Deep Learning Processing Elements), and each DPE has 8 SIMD units with a very large register file (no cache) under software control. The entire DPU is controlled by a separate master core, which manages execution and manages memory access between the DPU and its on-chip memory controller.
So just to clarify: The DLU is the entire silicon chip — memory, register files, everything. DPUs are controlled by a separate master controller and negotiate memory accesses. The DPUs are made up of DLEs with their 8 SIMD units, and this is where the number crunching takes place. At a very high level, we’ve seen both AMD and Nvidia use similar ways of grouping resources into CUs, with certain resources duplicated per Compute Unit, and each compute unit having an associated number of cores.
Fujitsu is already planning a second-generation core that will embed itself directly with a CPU, rather than being an off-chip distinct component. The company hopes to have the first-generation device ready for sale sometime in 2018, which no firm date given for the introduction of the second-gen device.