Hyppää sisältöön
    • Suomeksi
    • In English
Trepo
  • Suomeksi
  • In English
  • Kirjaudu
Näytä viite 
  •   Etusivu
  • Trepo
  • TUNICRIS-julkaisut
  • Näytä viite
  •   Etusivu
  • Trepo
  • TUNICRIS-julkaisut
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

PoCL-R: A Scalable Low Latency Distributed OpenCL Runtime

Solanti, Jan; Babej, Michal; Ikkala, Julius; Malamal Vadakital, Vinod Kumar; Jääskeläinen, Pekka (2022)

 
Avaa tiedosto
PoCL_R_A_Scalable_Low_Latency.pdf (637.2Kt)
Lataukset: 



Solanti, Jan
Babej, Michal
Ikkala, Julius
Malamal Vadakital, Vinod Kumar
Jääskeläinen, Pekka
2022

This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
doi:10.1007/978-3-031-04580-6_6
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202205044345

Kuvaus

Peer reviewed
Tiivistelmä
Offloading the most demanding parts of applications to an edge GPU server cluster to save power or improve the result quality is a solution that becomes increasingly realistic with new networking technologies. In order to make such a computing scheme feasible, an application programming layer that can provide both low latency and scalable utilization of remote heterogeneous computing resources is needed. To this end, we propose a latency-optimized scalable distributed heterogeneous computing runtime implementing the standard OpenCL API.In the proposed runtime, network-induced latency is reduced by means of peer-to-peer data transfers and event synchronization as well as a streamlined control protocol implementation. Further improvements can be obtained streaming of source data directly from the producer device to the compute cluster. Compute cluster scalability is improved by distributing the command and event processing responsibilities to remote compute servers. We also show how a simple optional dynamic content size buffer OpenCL extension can significantly speed up applications that utilize variable length data.For evaluation we present a smartphone-based augmented reality rendering case study which, using the runtime, receives 19× improvement in frames per second and 17× improvement in energy per frame when offloading parts of the rendering workload to a nearby GPU server. The remote kernel execution latency overhead of the runtime is only 60 ms on top of the network roundtrip time. The scalability on multi-server multi-GPU clusters is shown with a distributed large matrix multiplication application.
Kokoelmat
  • TUNICRIS-julkaisut [23480]
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste
 

 

Selaa kokoelmaa

TekijätNimekkeetTiedekunta (2019 -)Tiedekunta (- 2018)Tutkinto-ohjelmat ja opintosuunnatAvainsanatJulkaisuajatKokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste