Supporting Extremely Heterogeneous Computing in HPC, AI, and Data Analytics
MCL aims at abstracting the low-level hardware details of a system, supporting the execution of complex workflows that consists of multiple, independent applications (e.g., scientific simulation coupled with in-situ analysis or AI frameworks that analyze the results of a physic simulation), and performing efficient and asynchronous execution of computation tasks. MCL is not meant to be the programming model employed by domain scientists to implement their algorithms, but rather to support several high-level Domain-Specific Language (DSL)s and programming model runtimes. Currently, MCL supports OpenMP, OpenACC [5], TACO [6], MPI, and pthreads. Work is in progress to support AI frameworks, such as TensorFlow, and other DSLs for chemistry applications.
An MCL application consists of a sequence of tasks that need to be executed on available computing resources. The MCL programming
model Application Programming Interface (API) allows user to specify tasks and control dependencies among tasks. Once submitted, tasks are scheduled for
execution on a specific device by the MCL scheduler, according to the scheduling algorithm in use.
MCL leverages the OpenCL library and API to interface with computing devices and express computational kernels. Normally, users do not need to directly
write OpenCL kernels, as they are automatically generated by the higher-level DSL compiler (e.g., TACO), though directly writing OpenCL kernels and
implementing an algorithm using the MCL API is certainly possible. OpenCL allows MCL to execute the same computational kernel on different computing
devices, including, CPUs, GPUs, FPGAs, as well as some of the novel AI engines, such as the NVIDIA Deep Learning Accelerator (DLA).
MCL has been shown to effectively leverage heterogeneous computing resources [2] and scale up to complex multi- device systems and down to efficient embedded
systems. Code developed on a laptop computer seamlessly scales to powerful multi-GPU workstations without any modification and achieves between 5-17x
on an 8-GPU node automatically.
MCL has been shown to effectively leverage heterogeneous computing resources [and scale up to complex multi-device systems and down to efficient embedded systems. Code developed on a laptop computer seamlessly scales to powerful multi-GPU workstations without any modification and achieves between 5-17x on an 8-GPU node automatically.