Workshop: 12th Workshop on Accelerator Programming and Directives (WACCPD 2025)
Authors: Fernando Vazquez-Novoa (Barcelona Supercomputing Center (BSC)), Pedro López and José Flich (Universidad Politecnica de Valencia), and Rosa M. Badia (Barcelona Supercomputing Center (BSC))
Abstract: Training large neural networks is computationally demanding and often limited by synchronization overhead in distributed environments. Traditional data-parallel frameworks, such as Horovod or PyTorch DDP, average gradients at every batch, which can limit scalability due to communication bottlenecks.
In this work, we propose two novel data-parallel strategies that reduce synchronization by averaging weights and biases only at the end of each epoch. These methods are implemented using the PyCOMPSs task-based programming model and integrated into dislib, enabled by a new distributed tensor abstraction (ds-tensor) that supports multidimensional data structures suitable for deep learning workloads.
We evaluate our approach on classification and regression tasks using real-world datasets and federated learning scenarios. Results show up to 95% training time reduction and strong scalability up to 64 workers, while maintaining or improving model accuracy. Our strategies enable asynchronous, communication-efficient training and are well-suited for heterogeneous and large-scale HPC systems.
Back to 12th Workshop on Accelerator Programming and Directives (WACCPD 2025) Archive Listing Back to Full Workshop Archive Listing