With the rapid growth of deep learning models and deep learning-based applications, how to accelerate the inference of deep neural networks, especially neural network operators, has become an increasingly important research area. As a bridge between a front-end deep learning framework and a back-end hardware platform, deep learning compilers aim to optimize various deep learning models for a range of hardware platforms with model- and hardware-specific optimizations. Apache TVM (or TVM for short), a well-known open-source deep learning compiler, uses a customized domain-specific language, called Tensor Expression Language, to define hardware-specific optimizations for neural network operators. TVM also allows users to write tensor expressions to design customized optimizations for specific operators. However, TVM does not assist users with supporting information, such as what computations are performed within an operator, and tools for optimizing the operators in a deep learning model. In addition, tensor expressions have an entirely different syntax from imperative languages and are not easy to get started with. Furthermore, although TVM comes with an auto-tuning module, called AutoTVM, which facilitates the tuning of optimization configurations (e.g., tiling size and loop order), AutoTVM takes quite a long time to search the optimum configurations for a set of optimizations. In this paper, we present DLOOPT, an optimization assistant that assists optimization developers in designing effective optimizations for neural network operators and/or obtaining optimum optimization configurations in a timely manner. DLOOPT specifically addresses three key aspects: (1) developers can focus only on designing optimizations by using DLOOPT, which offers sufficient information about the operators of a given model and provides an easier way to write optimizations, (2) the number of optimizations that developers need to design can be minimized by using DLOOPT, which allows optimizations to be reused, and (3) the tuning process can be greatly simplified by using DLOOPT, which implements a set of tuning strategies in AutoTVM. The evaluation results showed that DLOOPT reduced more than 99% of time in terms of developing adequate optimizations for operators in a model. We believe that DLOOPT is friendly to optimization developers and allows them to quickly develop effective optimizations for neural network operators.
LinkCyber-Physical Systems (CPS) are increasingly used in many complex applications, such as autonomous delivery drones, the automotive CPS design, power grid control systems, and medical robotics. However, existing programming languages lack certain design patterns for CPS designs, including temporal semantics and concurrency models. Future research directions may involve programming language extensions to support CPS designs. On the other hand, JSF++, MISRA, and MISRA C++ are providing specifications intended to increase the reliability of safety-critical systems. This article also describes the development of rule checkers based on the MISRA C++ specification using the Clang open-source tool, which allows for the annotation of code and the easy extension of the MISRA C++ specification to other programming languages and systems. This is potentially useful for future CPS language research extensions to work with reliability software specifications using the Clang tool. Experiments were performed using key C++ benchmarks to validate our method in comparison with the well-known Coverity commercial tool. We illustrate key rules related to class, inheritance, template, overloading, and exception handling. Open-source benchmarks that violate the rules detected by our checkers are also illustrated. A random graph generator is further used to generate diamond case with multiple inheritance testdata for our software validations. The experimental results demonstrate that our method can provide information that is more detailed than that obtained using Coverity for nine open-source C++ benchmarks. Since the Clang tool is widely used, it will further allow developers to annotate their own extensions.
LinkFinding a good compiler autotuning methodology, particularly for selecting the right set of optimisations and finding the best ordering of these optimisations for a given code fragment has been a long-standing problem. As the rapid development of machine learning techniques, tackling the problem of compiler autotuning using machine learning or deep learning has become increasingly common in recent years. There have been many deep learning models proposed to solve problems, such as predicting the optimal heterogeneous mapping or thread-coarsening factor; however, very few have revisited the problem of optimisation phase tuning. In this paper, we intend to revisit and tackle the problem using deep learning techniques. Unfortunately, the problem is too complex to be addressed in its full scope. We present a new problem, called reduced O3 subsequence labelling – a reduced O3 subsequence is defined as a subsequence of O3 optimisation passes that contains no useless passes, which is a simplified problem and expected to be a stepping stone towards solving the optimisation phase tuning problem. We formulated the problem and attempted to solve the problem by using genetic algorithms. We believe that with mature deep learning techniques, a machine learning model that predicts the reduced O3 subsequences or even the best O3 subsequence for a given code fragment could be developed and used to prune the search space of the problem of the optimisation phase tuning, thereby shortening the tuning process and also providing more effective tuning results.
LinkBinary translators, which translate the binary executables from one instruction set to another, are useful tools. Indirect branches are one of the key factors that affect the efficiency of binary translators. In the previous research, our lab developed an LLVM-based binary translation framework, called Rabbit. Rabbit introduces novel optimisations: platform-dependent hyperchaining and platform-independent hyperchaining for improving the emulation of the indirect branch instructions. Indirect branch instructions may have several destinations, and these destinations are not known until runtime. Both platform-independent and platform-dependent hyperchaining establish a search table for each indirect branch instruction to record the visited branch destinations at runtime. In this work, we focus on the translation from AArch64 binary to RISC-V binary and further develop the profile-guided optimisation for indirect branch, which collects runtime information, including branch destinations and execution frequency of each destination for each indirect branch instruction, and then use the information to improve hyperchaining (i.e. accelerate the process of finding the branch destination). The profile-guided optimisation can be divided to profile-guided platform-independent hyperchaining and profile-guided platform-dependent hyperchaining. We finally use SPEC CPU 2006 CINT benchmark to evaluate the optimisations. The experiment results indicate that compared with (1) no chaining, (2) platform-independent hyperchaining and (3) platform-dependent hyperchaining, profile-guided platform-independent hyperchaining provides 1.123×, 1.066× and 1.098× speedup respectively. Similarly, profile-guided platform-dependent hyperchaining achieves 1.106×, 1.047× and 1.083× speedup with respect to the above three configurations, respectively.
LinkHeterogeneous computing has become popular in the past decade. Many frameworks have been proposed to provide a uniform way to program for accelerators, such as GPUs, DSPs, and FPGAs. Among them, an open and royalty-free standard, OpenCL, is widely adopted by the industry. However, many OpenCL-enabled accelerators and the standard itself do not support preemptive multitasking. To the best of our knowledge, previously proposed techniques are not portable or cannot handle ill-designed kernels (the codes that are executed on the accelerators), which will never ever finish. This paper presents a framework (called CLPKM) that provides an abstraction layer between OpenCL applications and the underlying OpenCL runtime to enable preemption of a kernel execution instance based on a software checkpointing mechanism. CLPKM includes (1) an OpenCL runtime library that intercepts OpenCL API calls, (2) a source-to-source compiler that performs the preemption-enabling transformation, and (3) a daemon that schedules OpenCL tasks using priority-based preemptive scheduling techniques. Experiments demonstrated that CLPKM reduced the slowdown of high-priority processes from 4.66x to 1.52–2.23x under up to 16 low-priority, heavy-workload processes running in the background and caused an average of 3.02–6.08x slowdown for low-priority processes.
LinkGraphics processing units (GPUs) are now widely used in embedded systems for manipulating computer graphics and even for general-purpose computation. However, many embedded systems have to manage highly restricted hardware resources in order to achieve high performance or energy efficiency. The number of registers is one of the common limiting factors in an embedded GPU design. Programs that run with a low number of registers may suffer from high register pressure if register allocation is not properly designed, especially on a GPU in which a register is divided into four elements and each element can be accessed separately, because allocating a register for a vector-type variable that does not contain values in all elements wastes register spaces. In this article, we present a vector-aware register allocation framework to improve register utilization on shader architectures. The framework involves two major components: (1) element-based register allocation that allocates registers based on the element requirement of variables and (2) register packing that rearranges elements of registers in order to increase the number of contiguous free elements, thereby keeping more live variables in registers. Experimental results on a cycle-approximate simulator showed that the proposed framework decreased 92% of register spills in total and made 91.7% of 14 common shader programs spill free. These results indicate an opportunity for energy management of the space that is used for storing spilled variables, with the framework improving the performance by a geometric mean of 8.3%, 16.3%, and 29.2% for general shader processors in which variables are spilled to memory with 5-, 10-, and 20-cycle access latencies, respectively. Furthermore, the reduction in the register requirement of programs enabled another 11 programs with high register pressure to be runnable on a lightweight GPU.
LinkThe Dalvik virtual machine (VM) is an integral component used to execute applications in Android, which is one of the leading operating systems for mobile devices. The Dalvik VM is an interpreter and is equipped with a trace-based just-in-time compiler for enhancing the execution performance of frequently executed paths, or traces. However, traces generated by the Dalvik VM can be stopped in a conditional branch or a method call/return, which means that these traces usually have a short lifetime, decreasing the effectiveness of the compiler optimizations applied to them. Furthermore, the just-in-time compiler applies only a few simple optimizations because of performance considerations. In this article we present a traces-to-region (T2R) framework that extends traces to regions and statically compiles these regions into native binaries so as to improve the execution of Android applications. The T2R framework involves three main stages: (i) the profiling stage, in which the run-time trace information of an application is extracted; (ii) the compilation stage, in which regions are constructed from the extracted traces and are statically compiled into a native binary; and (iii) the execution stage, in which the compiled binary is loaded into the code cache when the application starts to execute. Experiments performed on an Android tablet demonstrated that the T2R framework was effective in improving the execution performance of applications by 10.5–16.2% and decreasing the size of the code cache by 4.6–28.5%. Copyright © 2015 John Wiley & Sons, Ltd.
LinkThe interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running on a multi-GPU system may compete for some of the GPU devices while leaving other GPU devices unused. Moreover, the distributed memory model defined in OpenCL, where each device has its own memory space, increases the complexity of managing the memory among multiple GPU devices. In this article we propose a framework (called VirtCL) that reduces the programming burden by acting as a layer between the programmer and the native OpenCL run-time system for abstracting multiple devices into a single virtual device and for scheduling computations and communications among the multiple devices. VirtCL comprises two main components: (1) a front-end library, which exposes primary OpenCL APIs and the virtual device, and (2) a back-end run-time system (called CLDaemon) for scheduling and dispatching kernel tasks based on a history-based scheduler. The front-end library forwards computation requests to the back-end CLDaemon, which then schedules and dispatches the requests. We also propose a history-based scheduler that is able to schedule kernel tasks in a contention- and communication-aware manner. Experiments demonstrated that the VirtCL framework introduced a small overhead (mean of 6%) but outperformed the native OpenCL run-time system for most benchmarks in the Rodinia benchmark suite, which was due to the abstraction layer eliminating the time-consuming initialization of OpenCL contexts. We also evaluated different scheduling policies in VirtCL with a real-world application (clsurf) and various synthetic workload traces. The results indicated that the VirtCL framework provides scalability for multiple kernel tasks running on multi-GPU systems.
LinkMultithread programming is widely adopted in novel embedded system applications due to its high performance and flexibility. This article addresses compiler optimization for reducing the power consumption of multithread programs. A traditional compiler employs energy management techniques that analyze component usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage power can be controlled by inserting on and off instructions based on component usage information generated by flow equations. However, these methods cannot be directly extended to a multithread environment due to concurrent execution issues.This article presents a multithread power-gating framework composed of multithread power-gating analysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading (SMT) machines. Our multithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management. We performed experiments by incorporating our power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The experimental results show that the total energy consumption of a system with PPG support and our power optimization method is reduced by an average of 10.09% for BSP programs relative to a system without a power-gating mechanism on leakage contribution set to 30%; and the total energy consumption is reduced by an average of 4.27% on leakage contribution set to 10%. The results demonstrate our mechanisms are effective in reducing the leakage energy of BSP multithread programs.
LinkSUMMARYThe importance of heterogeneous multicore programming is increasing, and Open Computing Language (OpenCL) is an open industrial standard for parallel programming that provides a uniform programming model for programmers to write efficient, portable code for heterogeneous computing devices. However, OpenCL is not supported in the system virtualization environments that are often used to improve resource utilization. In this paper, we propose an OpenCL virtualization framework based on Kernel-based Virtual Machine with API remoting to enable multiplexing of multiple guest virtual machines (guest VMs) over the underlying OpenCL resources. The framework comprises three major components: (i) an OpenCL library implementation in guest VMs for packing/unpacking OpenCL requests/responses; (ii) a virtual device, called virtio-CL, that is responsible for the communication between guest VMs and the hypervisor (also called the VM monitor); and (iii) a thread, called CL thread, that is used for the OpenCL API invocation. Although the overhead of the proposed virtualization framework is directly affected by the amount of data to be transferred between the OpenCL host and devices because of the primitive nature of API remoting, experiments demonstrated that our virtualization framework has a small virtualization overhead (mean of 6.8%) for six common device-intensive OpenCL programs and performs well when the number of guest VMs involved in the system increases. These results indirectly infer that the framework allows for effective resource utilization of OpenCL devices.Copyright © 2012 John Wiley & Sons, Ltd.
LinkGraphics processing units (GPUs) are now being widely adopted in system-on-a-chip designs, and they are often used in embedded systems for manipulating computer graphics or even for general-purpose computation. Energy management is of concern to both hardware and software designers. In this article, we present an energy-aware code-motion framework for a compiler to generate concentrated accesses to input and output (I/O) buffers inside a GPU. Our solution attempts to gather the I/O buffer accesses into clusters, thereby extending the time period during which the I/O buffers are clock or power gated. We performed experiments in which the energy consumption was simulated by incorporating our compiler-analysis and code-motion framework into an in-house compiler tool. The experimental results demonstrated that our mechanisms were effective in reducing the energy consumption of the shader processor by an average of 13.1% and decreasing the energy-delay product by 2.2%.
LinkEmbedded processors developed within the past few years have employed novel hardware designs to reduce the ever-growing complexity, power dissipation, and die area. Although using a distributed register file architecture is considered to have less read/write ports than using traditional unified register file structures, it presents challenges in compilation techniques to generate efficient codes for such architectures. This paper presents a novel scheme for register allocation that includes global and local components on a VLIW DSP processor with distributed register files whose port access is highly restricted. In the scheme, an optimization phase performed prior to conventional global/local register allocation, named global/local register file assignment (RFA), is used to minimize various register file communication costs. A heuristic algorithm is proposed for global RFA to make suitable decisions based on local RFA. Experiments were performed by incorporating our schemes on a novel VLIW DSP processor with non-uniform register files. The results indicate that the compilation based on our proposed approach delivers significant performance improvements, compared with the solution without using our proposed global register allocation scheme. Copyright © 2008 John Wiley & Sons, Ltd.
LinkThe compiler is generally regarded as the most important software component that supports a processor design to achieve success. This paper describes our application of the open research compiler infrastructure to a novel VLIW DSP (known as the PAC DSP core) and the specific design of code generation for its register file architecture. The PAC DSP utilizes port-restricted, distributed, and partitioned register file structures in addition to a heterogeneous clustered data-path architecture to attain low power consumption and a smaller die. As part of an effort to overcome the new challenges of code generation for the PAC DSP, we have developed a new register allocation scheme and other retargeting optimization phases that allow the effective generation of high quality code. Our preliminary experimental results indicate that our developed compiler can efficiently utilize the features of the specific register file architectures in the PAC DSP. Our experiences in designing compiler support for the PAC VLIW DSP with irregular resource constraints may also be of interest to those involved in developing compilers for similar architectures.
LinkReducing the size of code is a significant concern in modern industrial settings. This has led to the exploration of various strategies, including the use of function call inlining via compiler optimizations. However, modern compilers like GCC and LLVM often rely on heuristics, which occasionally yield suboptimal outcomes. As a response to this challenge, autotuning mechanisms have been introduced, one of which is the local inlining autotuner that has received attention in previous research. This autotuner has been found to reduce code size by 4.9% compared to LLVM’s -Oz optimization level on SPEC2017 by fine-tuning function inlining decisions. However, the local inlining autotuner has limitations since it refines each function inlining decision individually before combining them, which can lead to complications arising from potential interference between function calls, increasing tuning durations and resulting in larger code sizes. Empirical investigations have revealed that in most cases, the impact of inlining a function call affects nearby function calls, which are referred to as “neighbors.” From this observation, we can substantially reduce the recompilation overheads entailed by the autotuner. To tackle the interference problem and expedite the tuning process, we propose an enhanced autotuner for function inlining, called the interference-aware inlining autotuner. This autotuner considers the repercussions of inlining a function call when formulating subsequent decisions and exploits the neighbor relationships between function calls to augment tuning efficiency. Experimental evaluations have validated the effectiveness of the interference-aware inlining autotuner, delivering an average code size reduction of 0.4% (up to 1.5%) across the SPEC2017 benchmark suite compared to the local inlining autotuner. Furthermore, the interference-aware autotuner achieved an average code size reduction of 5.3% compared to LLVM’s -Oz optimization level. In terms of tuning time, the serial interference-aware inlining autotuner exhibited a 2.9x acceleration (3.5x for resource-intensive tasks) compared to the parallel local inlining autotuner.
LinkWith the increasing demand for heterogeneous computing, OpenMP has introduced an offloading feature that allows programmers to offload a task to a device (e.g., a GPU or an FPGA) by adding appropriate directives to the task since version 4.0. Compared to other low-level programming models, such as CUDA and OpenCL, OpenMP significantly reduces the burden on programmers to ensure that tasks are performed correctly on the device. However, OpenMP still has a data-mapping problem, which arises from the separate memory spaces between the host and the device. It is still necessary for programmers to specify data-mapping directives to indicate how data are transferred between the host and the device. When using complex data structures such as linked lists and graphs, it becomes more difficult to compose reliable and efficient data-mapping directives. Moreover, the OpenMP runtime library may incur substantial overhead due to data-mapping management. In this paper, we propose a compiler and runtime collaborative framework, called OpenMP-UM, to address the data-mapping problem. Using the CUDA unified memory mechanism, OpenMP-UM eliminates the need for data-mapping directives and reduces the overhead associated with data-mapping management. The key concept behind OpenMP-UM is to use unified memory as the default memory storage for all host data, including automatic, static, and dynamic data. Experiments have demonstrated that OpenMP-UM not only removed programmers’ burden in writing data-mapping to offload in OpenMP applications but also achieved an average of 7.3x speedup for applications that involve deep copies and an average of 1.02x speedup for regular applications.
LinkMore and more applications are shifting from traditional desktop applications to web applications due to the prevalence of mobile devices and recent advances in wireless communication technologies. The Web Workers API has been proposed to allow for offloading computation-intensive tasks from applications’ main browser thread, which is responsible for managing user interfaces and interacting with users, to other worker threads (or web workers) and thereby improving user experience. Prior studies have further offloaded computation-intensive tasks to remote servers by dispatching web workers to the servers and demonstrated their effectiveness in improving the performance of web applications. However, the approaches proposed by these prior studies expose potential vulnerabilities of servers due to their design and implementation and do not consider multiple web workers executing in a concurrent or parallel manner. In this paper, we propose an offloading framework (called Offworker) that transparently enables concurrent web workers to be offloaded to edge or cloud servers and provides a more secure execution environment for web workers. We also design a benchmark suite (called Rodinia-JS), which is a JavaScript version of the Rodinia parallel benchmark suite, to evaluate the proposed framework. Experiments demonstrated that Offworker effectively improved the performance of parallel applications (with up to 4.8x of speedup) when web workers were offloaded from a mobile device to a server. Offworker introduced only a geometric mean overhead of 12.1% against the native execution for computation-intensive applications. We believe Offworker offers a promising and secure solution for computation offloading of parallel web applications.
LinkToday, mobile applications use thousands of concurrent tasks to process multiple sensor inputs to ensure a better user experience. With this demand, the ability to manage these concurrent tasks efficiently and easily is becoming a new challenge, especially in their lifetimes. Structured concurrency is a technique that reduces the complexity of managing a large number of concurrent tasks. There have been several languages or libraries (e.g., Kotlin, Swift, and Trio) that support such a paradigm for better concurrency management. It is worth noting that structured concurrency has been consistently implemented on top of coroutines across all these languages and libraries. However, there are no documents or studies in the literature that indicate why and how coroutines are relevant to structured concurrency. In contrast, the mainstream community views structured concurrency as a successor to structured programming; that is, the concept of “structure” extends from ordinary programming to concurrent programming. Nevertheless, such a viewpoint does not explain, as the concept of structured concurrency came out more than 40 years later after structured programming was introduced in the early 1970s, whereas concurrent programming started in the 1960s. In this paper, we introduce a new theory to complement the origin of structured concurrency from historical and technical perspectives—it is the foundation established by coroutines that gives birth to structured concurrency.
LinkBinary translation translates binary programs from one instruction set to another. It is widely used in virtual machines and emulators. We extend mc2llvm, which is an LLVM-based retargetable 32-bit binary translator developed in our lab in the past several years, to support 64-bit ARM instruction set. In this paper, we report the translation of AArch64 floating-point instructions in our mc2llvm. For floating-point instructions, due to the lack of floating-point support in LLVM [13, 14], we add support for the flush-to-zero mode, not-a-number processing, floating-point exceptions, and various rounding modes. On average, mc2llvm-translated binary can achieve 47% and 24.5% of the performance of natively compiled x86-64 binary on statically translated EEMBC benchmark and dynamically translated SPEC CINT2006 benchmarks, respectively. Compared to QEMU-translated binary, mc2llvm-translated binary runs 2.92x, 1.21x and 1.41x faster on statically translated EEMBC benchmark, dynamically translated SPEC CINT2006, and CFP2006 benchmarks, respectively. (Note that the benchmarks contain both floating-point instructions and other instructions, such as load and store instructions.)
LinkHeterogeneous computing has become popular in the past decade. Many frameworks have been proposed to provide a uniform way to program for accelerators, such as GPUs, DSPs, and FPGAs. Among them, an open and royalty-free standard, OpenCL, is widely adopted by the industry. However, many OpenCL-enabled accelerators and the standard itself do not support preemptive multitasking. To the best of our knowledge, previously proposed techniques are not portable or cannot handle ill-designed kernels (the codes that are executed on the accelerators), which will never ever finish. This paper presents a framework (called CLPKM) that provides an abstraction layer between OpenCL applications and the underlying OpenCL runtime to enable preemption of a kernel execution instance based on a software checkpointing mechanism. CLPKM includes (1) an OpenCL runtime library that intercepts OpenCL API calls, (2) a source-to-source compiler that performs the preemption-enabling transformation, and (3) a daemon that schedules OpenCL tasks using priority-based preemptive scheduling techniques. Experiments demonstrated that CLPKM reduced the slowdown of high-priority processes from 4.66x to 1.52--2.23x under up to 16 low-priority, heavy-workload processes running in the background and caused an average of 3.02--6.08x slowdown for low-priority processes.
LinkThe interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running on a multi-GPU system may compete for some of the GPU devices while leaving other GPU devices unused. Moreover, the distributed memory model defined in OpenCL, where each device has its own memory space, increases the complexity of managing the memory among multiple GPU devices. In this article we propose a framework (called VirtCL) that reduces the programming burden by acting as a layer between the programmer and the native OpenCL run-time system for abstracting multiple devices into a single virtual device and for scheduling computations and communications among the multiple devices. VirtCL comprises two main components: (1) a front-end library, which exposes primary OpenCL APIs and the virtual device, and (2) a back-end run-time system (called CLDaemon) for scheduling and dispatching kernel tasks based on a history-based scheduler. The front-end library forwards computation requests to the back-end CLDaemon, which then schedules and dispatches the requests. We also propose a history-based scheduler that is able to schedule kernel tasks in a contention- and communication-aware manner. Experiments demonstrated that the VirtCL framework introduced a small overhead (mean of 6%) but outperformed the native OpenCL run-time system for most benchmarks in the Rodinia benchmark suite, which was due to the abstraction layer eliminating the time-consuming initialization of OpenCL contexts. We also evaluated different scheduling policies in VirtCL with a real-world application (clsurf) and various synthetic workload traces. The results indicated that the VirtCL framework provides scalability for multiple kernel tasks running on multi-GPU systems.
LinkGraphics processing units (GPUs) are now widely used in embedded systems for manipulating computer graphics and even for general-purpose computation. However, many embedded systems have to manage highly restricted hardware resources in order to achieve high performance or energy efficiency. The number of registers is one of the common limiting factors in an embedded GPU design. Programs that run with a low number of registers may suffer from high register pressure if register allocation is not properly designed, especially on a GPU in which a register is divided into four elements and each element can be accessed separately, because allocating a register for a vector-type variable that does not contain values in all elements wastes register spaces. In this paper we present a vector-aware register allocation framework to improve register utilization on shader architectures. The framework involves two major components: (1) element-based register allocation that allocates registers based on the element requirement of variables and (2) register packing that rearranges elements of registers in order to increase the number of contiguous free elements, thereby keeping more live variables in registers. Experimental results on a cycle-approximate simulator showed that the proposed framework decreased 92% of register spills in total and made 91.7% of 14 common shader programs spill-free. These results indicate an opportunity for energy management of the space that is used for storing spilled variables, with the framework improving the performance by a geometric mean of 8.3%, 16.3%, and 29.2% for general shader processors in which variables are spilled to memory with 5-, 10-, and 20-cycle access latencies, respectively. Furthermore, the reduction in the register requirement of programs enabled another 11 programs with high register pressure to be runnable on a lightweight GPU.
LinkCUDA is a C-extended programming model that allows programmers to write code for both central processing units and graphics processing units (GPUs). In general, GPUs require high thread-level parallelism (TLP) to reach their maximal performance, but the TLP of a CUDA program is deeply affected by the resource allocation of GPUs, including allocation of shared memory and registers since these allocation results directly determine the number of active threads on GPUs. There were some research work focusing on the management of memory allocation for performance enhancement, but none proposed an effective approach to speed up programs in which TLP is limited by insufficient registers. In this paper, we propose a TLP-aware register-pressure reduction framework to reduce the register requirement of a CUDA kernel to a desired degree so as to allow more threads active and thereby to hide the long-latency global memory accesses among these threads. The framework includes two schemes: register rematerialization and register spilling to shared memory. The experimental results demonstrate that the framework is effective in performance improvement of CUDA kernels with a geometric average of 14.8%, while the geometric average performance improvement for CUDA programs is 5.5%.
Link