Speaker
Description
Modern measurement workflows require the iterative solution of hundreds or thousands of linear systems with unique sources but a constant discrete Dirac fermion stencil. Algorithmically batching multiple independent linear solves with a fixed stencil improves compute throughput by exposing additional data parallelism and increasing data reuse. The multiplicative benefit of utilizing batched solves in LQCD workflows improves time-to-science with minimal additional work by users. The publicly available QUDA library for all GPUs now includes a feature-complete implementation of batched solves, including support for batched deflation and multi-grid algorithms. In this poster we present results from real science workflows driven by the MILC and Chroma applications and accelerated by the new batched algorithms in QUDA.