![]() ![]() Were the schedule to consider the performance of each kernel alone, it would assign the product kernel to a CPU and the summation kernel to a GPU, yielding an execution time of 65 time units. An illustration of the program task dependency graph for computing A × B + C of matrices A, B, C. The schedule, which optimizes the performance of each kernel separately, is no longer sufficient for obtaining the best performance of the application as a whole.įigure 36.1. However, for applications composed of multiple kernels with data dependencies, whereby the subsequent kernels use the results of the previous ones, different assignments or schedules of the computations on a CPU or a GPU may decisively influence the application running time. For such cases, the data will always be transferred from the CPU to the GPU and back, thus allowing for a local decision that considers only the performance of a given kernel on each device. It will work well for isolated kernels, where both the kernel input and output must reside on a CPU. In Section 36.3.3 we focus on optimizing the choice of the processor for the kernel execution in applications with multiple inter-dependent kernels.Ī simple approach is to greedily assign the device providing the best overall performance for a given input. Furthermore, the overhead of the CPU-GPU communications over the PCI Express bus may reduce or completely cancel out the advantages of using a GPU. In some cases executing the kernel on a GPU may actually decrease the performance, such as when not enough parallelism is available. Often, despite optimizations, the kernel performance may vary substantially for different inputs. Kernel performance optimization, however, is only one component in making the complete application run faster. In Section 36.3.1 we introduce techniques for analyzing the data access patterns and designing a read-only low-overhead software-managed cache for NVIDIA GPUs. The main challenge, then, is to minimize the overhead of the cache management code, which resides on the critical path of every memory access. For cases where this determination is data-dependent, the decision must be made at runtime. By design, the scratchpad memory lacks hardware caching support 1 hence, it is the responsibility of the kernel to implement a software-managed cache, which implies determining which data to stage from the main memory and when to stage it. ![]() Modern NVIDIA GPUs expose fast scratchpad memory shared by multiple streaming processors on a multiprocessor. ![]() Unfortunately, high performance is difficult and sometimes even impossible to achieve without the ability to control the replacement decisions. ![]() Maximizing cache performance to exploit data reuse requires restructuring the code so that the actual access pattern matches the cache replacement algorithm. Hardware caches employ input-independent replacement algorithms, such as Least Recently Used (LRU). Often the same data are reused many times, and reorganizing the computations to exploit small but fast on-die caches might thus reduce the main memory bandwidth pressure and improve performance. It is of added importance if the algorithm has a low compute-to-memory access ratio. Memory access optimization is among the main tools for improving application performance in CPUs and GPUs. In the chapter we describe the solution for each problem and demonstrate their combined effect on a real application as a whole. Yet we believe that our techniques are applicable in a general context, and can be employed together and separately. We faced both these problems when developing an application for computing the probability of evidence in probabilistic networks, and only by solving both did we achieve the desired performance improvement. This chapter endeavors to assist developers in overcoming two major bottlenecks of the high-end GPU platforms: memory bandwidth to the main (global) memory of the GPU, and the CPU-GPU communications. Owens, in GPU Computing Gems Jade Edition, 2012 36.1 Introduction, Problem Statement, and Context Applying Software-Managed Caching and CPU/GPU Task Scheduling for Accelerating Dynamic Workloads ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |