Publications

Journal Papers Conference Papers Patents Others

Journal Papers

Yu-Sheng Hsieh and Yi-Ping You, "DLOOPT: An Optimization Assistant on AutoTVM for Deep Learning Operators", Journal of Signal Processing Systems, Vol. 95, Issue 5, pp. 585-607, 2023. DOI:10.1007/s11265-022-01804-0

With the rapid growth of deep learning models and deep learning-based applications, how to accelerate the inference of deep neural networks, especially neural network operators, has become an increasingly important research area. As a bridge between a front-end deep learning framework and a back-end hardware platform, deep learning compilers aim to optimize various deep learning models for a range of hardware platforms with model- and hardware-specific optimizations. Apache TVM (or TVM for short), a well-known open-source deep learning compiler, uses a customized domain-specific language, called Tensor Expression Language, to define hardware-specific optimizations for neural network operators. TVM also allows users to write tensor expressions to design customized optimizations for specific operators. However, TVM does not assist users with supporting information, such as what computations are performed within an operator, and tools for optimizing the operators in a deep learning model. In addition, tensor expressions have an entirely different syntax from imperative languages and are not easy to get started with. Furthermore, although TVM comes with an auto-tuning module, called AutoTVM, which facilitates the tuning of optimization configurations (e.g., tiling size and loop order), AutoTVM takes quite a long time to search the optimum configurations for a set of optimizations. In this paper, we present DLOOPT, an optimization assistant that assists optimization developers in designing effective optimizations for neural network operators and/or obtaining optimum optimization configurations in a timely manner. DLOOPT specifically addresses three key aspects: (1) developers can focus only on designing optimizations by using DLOOPT, which offers sufficient information about the operators of a given model and provides an easier way to write optimizations, (2) the number of optimizations that developers need to design can be minimized by using DLOOPT, which allows optimizations to be reused, and (3) the tuning process can be greatly simplified by using DLOOPT, which implements a set of tuning strategies in AutoTVM. The evaluation results showed that DLOOPT reduced more than 99% of time in terms of developing adequate optimizations for operators in a model. We believe that DLOOPT is friendly to optimization developers and allows them to quickly develop effective optimizations for neural network operators.
Link
Che-Chia Lin, Wei-Hsu Chu, Chia-Hsuan Chang, Hui-Hsin Liao, Chun-Chieh Yang, Jenq-Kuen Lee, Yi-Ping You, and Tien-Yuan Hsieh, "The Support of MISRA C++ Analyzer for Reliability of Embedded Systems", ACM Transactions on Cyber-Physical Systems, Just Accepted, 2023. DOI:10.1145/3611390

Cyber-Physical Systems (CPS) are increasingly used in many complex applications, such as autonomous delivery drones, the automotive CPS design, power grid control systems, and medical robotics. However, existing programming languages lack certain design patterns for CPS designs, including temporal semantics and concurrency models. Future research directions may involve programming language extensions to support CPS designs. On the other hand, JSF++, MISRA, and MISRA C++ are providing specifications intended to increase the reliability of safety-critical systems. This article also describes the development of rule checkers based on the MISRA C++ specification using the Clang open-source tool, which allows for the annotation of code and the easy extension of the MISRA C++ specification to other programming languages and systems. This is potentially useful for future CPS language research extensions to work with reliability software specifications using the Clang tool. Experiments were performed using key C++ benchmarks to validate our method in comparison with the well-known Coverity commercial tool. We illustrate key rules related to class, inheritance, template, overloading, and exception handling. Open-source benchmarks that violate the rules detected by our checkers are also illustrated. A random graph generator is further used to generate diamond case with multiple inheritance testdata for our software validations. The experimental results demonstrate that our method can provide information that is more detailed than that obtained using Coverity for nine open-source C++ benchmarks. Since the Clang tool is widely used, it will further allow developers to annotate their own extensions.
Link
Yi-Ping You and Yi-Chiao Su, "Reduced O3 Subsequence Labelling: A Stepping Stone towards Optimisation Sequence Prediction", Connection Science, Vol. 34, Issue 1, pp. 2860-2877, 2022. DOI:10.1080/09540091.2022.2044761

Finding a good compiler autotuning methodology, particularly for selecting the right set of optimisations and finding the best ordering of these optimisations for a given code fragment has been a long-standing problem. As the rapid development of machine learning techniques, tackling the problem of compiler autotuning using machine learning or deep learning has become increasingly common in recent years. There have been many deep learning models proposed to solve problems, such as predicting the optimal heterogeneous mapping or thread-coarsening factor; however, very few have revisited the problem of optimisation phase tuning. In this paper, we intend to revisit and tackle the problem using deep learning techniques. Unfortunately, the problem is too complex to be addressed in its full scope. We present a new problem, called reduced O3 subsequence labelling – a reduced O3 subsequence is defined as a subsequence of O3 optimisation passes that contains no useless passes, which is a simplified problem and expected to be a stepping stone towards solving the optimisation phase tuning problem. We formulated the problem and attempted to solve the problem by using genetic algorithms. We believe that with mature deep learning techniques, a machine learning model that predicts the reduced O3 subsequences or even the best O3 subsequence for a given code fragment could be developed and used to prune the search space of the problem of the optimisation phase tuning, thereby shortening the tuning process and also providing more effective tuning results.
Link
Jyun-Siang Huang, Wuu Yang, and Yi-Ping You, "Profile-Guided Optimisation for Indirect Branches in a Binary Translator", Connection Science, Vol. 34, Issue 1, pp. 749-765, 2022. DOI:10.1080/09540091.2022.2041555

Binary translators, which translate the binary executables from one instruction set to another, are useful tools. Indirect branches are one of the key factors that affect the efficiency of binary translators. In the previous research, our lab developed an LLVM-based binary translation framework, called Rabbit. Rabbit introduces novel optimisations: platform-dependent hyperchaining and platform-independent hyperchaining for improving the emulation of the indirect branch instructions. Indirect branch instructions may have several destinations, and these destinations are not known until runtime. Both platform-independent and platform-dependent hyperchaining establish a search table for each indirect branch instruction to record the visited branch destinations at runtime. In this work, we focus on the translation from AArch64 binary to RISC-V binary and further develop the profile-guided optimisation for indirect branch, which collects runtime information, including branch destinations and execution frequency of each destination for each indirect branch instruction, and then use the information to improve hyperchaining (i.e. accelerate the process of finding the branch destination). The profile-guided optimisation can be divided to profile-guided platform-independent hyperchaining and profile-guided platform-dependent hyperchaining. We finally use SPEC CPU 2006 CINT benchmark to evaluate the optimisations. The experiment results indicate that compared with (1) no chaining, (2) platform-independent hyperchaining and (3) platform-dependent hyperchaining, profile-guided platform-independent hyperchaining provides 1.123×, 1.066× and 1.098× speedup respectively. Similarly, profile-guided platform-dependent hyperchaining achieves 1.106×, 1.047× and 1.083× speedup with respect to the above three configurations, respectively.
Link
Ming-Tsung Chiu and Yi-Ping You, "CLPKM: A checkpoint-based preemptive multitasking framework for OpenCL kernels", Journal of Systems Architecture, Vol. 98, pp. 53-62, 2019. DOI:10.1016/j.sysarc.2019.06.008

Heterogeneous computing has become popular in the past decade. Many frameworks have been proposed to provide a uniform way to program for accelerators, such as GPUs, DSPs, and FPGAs. Among them, an open and royalty-free standard, OpenCL, is widely adopted by the industry. However, many OpenCL-enabled accelerators and the standard itself do not support preemptive multitasking. To the best of our knowledge, previously proposed techniques are not portable or cannot handle ill-designed kernels (the codes that are executed on the accelerators), which will never ever finish. This paper presents a framework (called CLPKM) that provides an abstraction layer between OpenCL applications and the underlying OpenCL runtime to enable preemption of a kernel execution instance based on a software checkpointing mechanism. CLPKM includes (1) an OpenCL runtime library that intercepts OpenCL API calls, (2) a source-to-source compiler that performs the preemption-enabling transformation, and (3) a daemon that schedules OpenCL tasks using priority-based preemptive scheduling techniques. Experiments demonstrated that CLPKM reduced the slowdown of high-priority processes from 4.66x to 1.52–2.23x under up to 16 low-priority, heavy-workload processes running in the background and caused an average of 3.02–6.08x slowdown for low-priority processes.
Link
Yi-Ping You and Szu-Chien Chen, "VecRA: A Vector-Aware Register Allocator for GPU Shader Processors", ACM Transactions on Embedded Computing Systems, Vol. 15, Issue 4, pp. 64:1–64:30, 2016. DOI:10.1145/2961026

Graphics processing units (GPUs) are now widely used in embedded systems for manipulating computer graphics and even for general-purpose computation. However, many embedded systems have to manage highly restricted hardware resources in order to achieve high performance or energy efficiency. The number of registers is one of the common limiting factors in an embedded GPU design. Programs that run with a low number of registers may suffer from high register pressure if register allocation is not properly designed, especially on a GPU in which a register is divided into four elements and each element can be accessed separately, because allocating a register for a vector-type variable that does not contain values in all elements wastes register spaces. In this article, we present a vector-aware register allocation framework to improve register utilization on shader architectures. The framework involves two major components: (1) element-based register allocation that allocates registers based on the element requirement of variables and (2) register packing that rearranges elements of registers in order to increase the number of contiguous free elements, thereby keeping more live variables in registers. Experimental results on a cycle-approximate simulator showed that the proposed framework decreased 92% of register spills in total and made 91.7% of 14 common shader programs spill free. These results indicate an opportunity for energy management of the space that is used for storing spilled variables, with the framework improving the performance by a geometric mean of 8.3%, 16.3%, and 29.2% for general shader processors in which variables are spilled to memory with 5-, 10-, and 20-cycle access latencies, respectively. Furthermore, the reduction in the register requirement of programs enabled another 11 programs with high register pressure to be runnable on a lightweight GPU.
Link
Yi-Ping You and Jian-Ru Chen, "A Static Region-Based Compiler for the Dalvik Virtual Machine", Software: Practice and Experience, Vol. 46, Issue 8, pp. 1109-1130, 2016. DOI:10.1002/spe.2344

The Dalvik virtual machine (VM) is an integral component used to execute applications in Android, which is one of the leading operating systems for mobile devices. The Dalvik VM is an interpreter and is equipped with a trace-based just-in-time compiler for enhancing the execution performance of frequently executed paths, or traces. However, traces generated by the Dalvik VM can be stopped in a conditional branch or a method call/return, which means that these traces usually have a short lifetime, decreasing the effectiveness of the compiler optimizations applied to them. Furthermore, the just-in-time compiler applies only a few simple optimizations because of performance considerations. In this article we present a traces-to-region (T2R) framework that extends traces to regions and statically compiles these regions into native binaries so as to improve the execution of Android applications. The T2R framework involves three main stages: (i) the profiling stage, in which the run-time trace information of an application is extracted; (ii) the compilation stage, in which regions are constructed from the extracted traces and are statically compiled into a native binary; and (iii) the execution stage, in which the compiled binary is loaded into the code cache when the application starts to execute. Experiments performed on an Android tablet demonstrated that the T2R framework was effective in improving the execution performance of applications by 10.5–16.2% and decreasing the size of the code cache by 4.6–28.5%. Copyright © 2015 John Wiley & Sons, Ltd.
Link
Yi-Ping You, Hen-Jung Wu, Yeh-Ning Tsai, and Yen-Ting Chao, "VirtCL: A Framework for OpenCL Device Abstraction and Management", SIGPLAN Notices, Vol. 50, Issue 8, pp. 161–172, 2015. DOI:10.1145/2858788.2688505

The interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running on a multi-GPU system may compete for some of the GPU devices while leaving other GPU devices unused. Moreover, the distributed memory model defined in OpenCL, where each device has its own memory space, increases the complexity of managing the memory among multiple GPU devices. In this article we propose a framework (called VirtCL) that reduces the programming burden by acting as a layer between the programmer and the native OpenCL run-time system for abstracting multiple devices into a single virtual device and for scheduling computations and communications among the multiple devices. VirtCL comprises two main components: (1) a front-end library, which exposes primary OpenCL APIs and the virtual device, and (2) a back-end run-time system (called CLDaemon) for scheduling and dispatching kernel tasks based on a history-based scheduler. The front-end library forwards computation requests to the back-end CLDaemon, which then schedules and dispatches the requests. We also propose a history-based scheduler that is able to schedule kernel tasks in a contention- and communication-aware manner. Experiments demonstrated that the VirtCL framework introduced a small overhead (mean of 6%) but outperformed the native OpenCL run-time system for most benchmarks in the Rodinia benchmark suite, which was due to the abstraction layer eliminating the time-consuming initialization of OpenCL contexts. We also evaluated different scheduling policies in VirtCL with a real-world application (clsurf) and various synthetic workload traces. The results indicated that the VirtCL framework provides scalability for multiple kernel tasks running on multi-GPU systems.
Link
Wen-Li Shih, Yi-Ping You, Chung-Wen Huang, and Jenq Kuen Lee, "Compiler Optimization for Reducing Leakage Power in Multithread BSP Programs", ACM Transactions on Design Automation of Electronic Systems, Vol. 20, Issue 1, pp. 9:1–9:34, 2014. DOI:10.1145/2668119

Multithread programming is widely adopted in novel embedded system applications due to its high performance and flexibility. This article addresses compiler optimization for reducing the power consumption of multithread programs. A traditional compiler employs energy management techniques that analyze component usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage power can be controlled by inserting on and off instructions based on component usage information generated by flow equations. However, these methods cannot be directly extended to a multithread environment due to concurrent execution issues.This article presents a multithread power-gating framework composed of multithread power-gating analysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading (SMT) machines. Our multithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management. We performed experiments by incorporating our power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The experimental results show that the total energy consumption of a system with PPG support and our power optimization method is reduced by an average of 10.09% for BSP programs relative to a system without a power-gating mechanism on leakage contribution set to 30%; and the total energy consumption is reduced by an average of 4.27% on leakage contribution set to 10%. The results demonstrate our mechanisms are effective in reducing the leakage energy of BSP multithread programs.
Link
Tsan-Rong Tien and Yi-Ping You, "Enabling OpenCL Support for GPGPU in Kernel-Based Virtual Machine", Software: Practice and Experience, Vol. 44, Issue 5, pp. 483-510, 2014. DOI:10.1002/spe.2166

SUMMARYThe importance of heterogeneous multicore programming is increasing, and Open Computing Language (OpenCL) is an open industrial standard for parallel programming that provides a uniform programming model for programmers to write efficient, portable code for heterogeneous computing devices. However, OpenCL is not supported in the system virtualization environments that are often used to improve resource utilization. In this paper, we propose an OpenCL virtualization framework based on Kernel-based Virtual Machine with API remoting to enable multiplexing of multiple guest virtual machines (guest VMs) over the underlying OpenCL resources. The framework comprises three major components: (i) an OpenCL library implementation in guest VMs for packing/unpacking OpenCL requests/responses; (ii) a virtual device, called virtio-CL, that is responsible for the communication between guest VMs and the hypervisor (also called the VM monitor); and (iii) a thread, called CL thread, that is used for the OpenCL API invocation. Although the overhead of the proposed virtualization framework is directly affected by the amount of data to be transferred between the OpenCL host and devices because of the primitive nature of API remoting, experiments demonstrated that our virtualization framework has a small virtualization overhead (mean of 6.8%) for six common device-intensive OpenCL programs and performs well when the number of guest VMs involved in the system increases. These results indirectly infer that the framework allows for effective resource utilization of OpenCL devices.Copyright © 2012 John Wiley & Sons, Ltd.
Link
Yi-Ping You and Shen-Hong Wang, "Energy-Aware Code Motion for GPU Shader Processors", ACM Transactions on Embedded Computing Systems, Vol. 13, Issue 3, pp. 49:1–49:24, 2013. DOI:10.1145/2539036.2539045

Graphics processing units (GPUs) are now being widely adopted in system-on-a-chip designs, and they are often used in embedded systems for manipulating computer graphics or even for general-purpose computation. Energy management is of concern to both hardware and software designers. In this article, we present an energy-aware code-motion framework for a compiler to generate concentrated accesses to input and output (I/O) buffers inside a GPU. Our solution attempts to gather the I/O buffer accesses into clusters, thereby extending the time period during which the I/O buffers are clock or power gated. We performed experiments in which the energy consumption was simulated by incorporating our compiler-analysis and code-motion framework into an in-house compiler tool. The experimental results demonstrated that our mechanisms were effective in reducing the energy consumption of the shader processor by an average of 13.1% and decreasing the energy-delay product by 2.2%.
Link
Chia-Han Lu, Yung-Chia Lin, Yi-Ping You, and Jenq-Kuen Lee, "LC-GRFA: Global Register File Assignment with Local Consciousness for VLIW DSP Processors with Non-uniform Register Files", Concurrency and Computation: Practice and Experience, Vol. 21, Issue 1, pp. 101-114, 2009. DOI:10.1002/cpe.1334

Embedded processors developed within the past few years have employed novel hardware designs to reduce the ever-growing complexity, power dissipation, and die area. Although using a distributed register file architecture is considered to have less read/write ports than using traditional unified register file structures, it presents challenges in compilation techniques to generate efficient codes for such architectures. This paper presents a novel scheme for register allocation that includes global and local components on a VLIW DSP processor with distributed register files whose port access is highly restricted. In the scheme, an optimization phase performed prior to conventional global/local register allocation, named global/local register file assignment (RFA), is used to minimize various register file communication costs. A heuristic algorithm is proposed for global RFA to make suitable decisions based on local RFA. Experiments were performed by incorporating our schemes on a novel VLIW DSP processor with non-uniform register files. The results indicate that the compilation based on our proposed approach delivers significant performance improvements, compared with the solution without using our proposed global register allocation scheme. Copyright © 2008 John Wiley & Sons, Ltd.
Link
Yung-Chia Lin, Chia Han Lu, Chung-Ju Wu, Chung-Lin Tang, Yi-Ping You, Ya-Chaio Moo, and Jenq-Kuen Lee, "Effective Code Generation for Distributed and Ping-Pong Register Files: A Case Study on PAC VLIW DSP Cores", Journal of Signal Processing Systems, Vol. 51, Issue 3, pp. 269-288, 2008. DOI:10.1007/s11265-007-0059-4

The compiler is generally regarded as the most important software component that supports a processor design to achieve success. This paper describes our application of the open research compiler infrastructure to a novel VLIW DSP (known as the PAC DSP core) and the specific design of code generation for its register file architecture. The PAC DSP utilizes port-restricted, distributed, and partitioned register file structures in addition to a heterogeneous clustered data-path architecture to attain low power consumption and a smaller die. As part of an effort to overcome the new challenges of code generation for the PAC DSP, we have developed a new register allocation scheme and other retargeting optimization phases that allow the effective generation of high quality code. Our preliminary experimental results indicate that our developed compiler can efficiently utilize the features of the specific register file architectures in the PAC DSP. Our experiences in designing compiler support for the PAC VLIW DSP with irregular resource constraints may also be of interest to those involved in developing compilers for similar architectures.
Link

Conference Papers

Hsuan Chang and Yi-Ping You, "Interference-Aware Function Inlining for Code Size Reduction", in Proceedings of the 53rd International Conference on Parallel Processing Workshops (ICPP-W '24), pp. 7-14, 2024. DOI:10.1145/3677333.3678147

Reducing the size of code is a significant concern in modern industrial settings. This has led to the exploration of various strategies, including the use of function call inlining via compiler optimizations. However, modern compilers like GCC and LLVM often rely on heuristics, which occasionally yield suboptimal outcomes. As a response to this challenge, autotuning mechanisms have been introduced, one of which is the local inlining autotuner that has received attention in previous research. This autotuner has been found to reduce code size by 4.9% compared to LLVM’s -Oz optimization level on SPEC2017 by fine-tuning function inlining decisions. However, the local inlining autotuner has limitations since it refines each function inlining decision individually before combining them, which can lead to complications arising from potential interference between function calls, increasing tuning durations and resulting in larger code sizes. Empirical investigations have revealed that in most cases, the impact of inlining a function call affects nearby function calls, which are referred to as “neighbors.” From this observation, we can substantially reduce the recompilation overheads entailed by the autotuner. To tackle the interference problem and expedite the tuning process, we propose an enhanced autotuner for function inlining, called the interference-aware inlining autotuner. This autotuner considers the repercussions of inlining a function call when formulating subsequent decisions and exploits the neighbor relationships between function calls to augment tuning efficiency. Experimental evaluations have validated the effectiveness of the interference-aware inlining autotuner, delivering an average code size reduction of 0.4% (up to 1.5%) across the SPEC2017 benchmark suite compared to the local inlining autotuner. Furthermore, the interference-aware autotuner achieved an average code size reduction of 5.3% compared to LLVM’s -Oz optimization level. In terms of tuning time, the serial interference-aware inlining autotuner exhibited a 2.9x acceleration (3.5x for resource-intensive tasks) compared to the parallel local inlining autotuner.
Link
Jia-Sian Hong and Yi-Ping You, "Mapping-Free GPU Offloading in OpenMP Using Unified Memory", in Proceedings of the 52nd International Conference on Parallel Processing Workshops (ICPP-W '23), pp. 104-111, 2023. DOI:10.1145/3605731.3605907

With the increasing demand for heterogeneous computing, OpenMP has introduced an offloading feature that allows programmers to offload a task to a device (e.g., a GPU or an FPGA) by adding appropriate directives to the task since version 4.0. Compared to other low-level programming models, such as CUDA and OpenCL, OpenMP significantly reduces the burden on programmers to ensure that tasks are performed correctly on the device. However, OpenMP still has a data-mapping problem, which arises from the separate memory spaces between the host and the device. It is still necessary for programmers to specify data-mapping directives to indicate how data are transferred between the host and the device. When using complex data structures such as linked lists and graphs, it becomes more difficult to compose reliable and efficient data-mapping directives. Moreover, the OpenMP runtime library may incur substantial overhead due to data-mapping management. In this paper, we propose a compiler and runtime collaborative framework, called OpenMP-UM, to address the data-mapping problem. Using the CUDA unified memory mechanism, OpenMP-UM eliminates the need for data-mapping directives and reduces the overhead associated with data-mapping management. The key concept behind OpenMP-UM is to use unified memory as the default memory storage for all host data, including automatic, static, and dynamic data. Experiments have demonstrated that OpenMP-UM not only removed programmers’ burden in writing data-mapping to offload in OpenMP applications but also achieved an average of 7.3x speedup for applications that involve deep copies and an average of 1.02x speedup for regular applications.
Link
An-Chi Liu and Yi-Ping You, "Offworker: An Offloading Framework for Parallel Web Applications", in Proceedings of Web Information Systems Engineering (WISE '22), pp. 170-185, 2022. DOI:10.1007/978-3-031-20891-1_13

More and more applications are shifting from traditional desktop applications to web applications due to the prevalence of mobile devices and recent advances in wireless communication technologies. The Web Workers API has been proposed to allow for offloading computation-intensive tasks from applications’ main browser thread, which is responsible for managing user interfaces and interacting with users, to other worker threads (or web workers) and thereby improving user experience. Prior studies have further offloaded computation-intensive tasks to remote servers by dispatching web workers to the servers and demonstrated their effectiveness in improving the performance of web applications. However, the approaches proposed by these prior studies expose potential vulnerabilities of servers due to their design and implementation and do not consider multiple web workers executing in a concurrent or parallel manner. In this paper, we propose an offloading framework (called Offworker) that transparently enables concurrent web workers to be offloaded to edge or cloud servers and provides a more secure execution environment for web workers. We also design a benchmark suite (called Rodinia-JS), which is a JavaScript version of the Rodinia parallel benchmark suite, to evaluate the proposed framework. Experiments demonstrated that Offworker effectively improved the performance of parallel applications (with up to 4.8x of speedup) when web workers were offloaded from a mobile device to a server. Offworker introduced only a geometric mean overhead of 12.1% against the native execution for computation-intensive applications. We believe Offworker offers a promising and secure solution for computation offloading of parallel web applications.
Link
Yi-An Chen and Yi-Ping You, "Structured Concurrency: A Review", in Proceedings of the 51st International Conference on Parallel Processing Workshops (ICPP-W '22), pp. 1-8, 2022. DOI:10.1145/3547276.3548519

Today, mobile applications use thousands of concurrent tasks to process multiple sensor inputs to ensure a better user experience. With this demand, the ability to manage these concurrent tasks efficiently and easily is becoming a new challenge, especially in their lifetimes. Structured concurrency is a technique that reduces the complexity of managing a large number of concurrent tasks. There have been several languages or libraries (e.g., Kotlin, Swift, and Trio) that support such a paradigm for better concurrency management. It is worth noting that structured concurrency has been consistently implemented on top of coroutines across all these languages and libraries. However, there are no documents or studies in the literature that indicate why and how coroutines are relevant to structured concurrency. In contrast, the mainstream community views structured concurrency as a successor to structured programming; that is, the concept of “structure” extends from ordinary programming to concurrent programming. Nevertheless, such a viewpoint does not explain, as the concept of structured concurrency came out more than 40 years later after structured programming was introduced in the early 1970s, whereas concurrent programming started in the 1960s. In this paper, we introduce a new theory to complement the origin of structured concurrency from historical and technical perspectives—it is the foundation established by coroutines that gives birth to structured concurrency.
Link
Yi-Ping You, Tsung-Chun Lin, and Wuu Yang, "Translating AArch64 Floating-Point Instruction Set to the x86-64 Platform", in Proceedings of the 48th International Conference on Parallel Processing Workshops (ICPP-W '19), pp. 1–7, 2019. DOI:10.1145/3339186.3339192

Binary translation translates binary programs from one instruction set to another. It is widely used in virtual machines and emulators. We extend mc2llvm, which is an LLVM-based retargetable 32-bit binary translator developed in our lab in the past several years, to support 64-bit ARM instruction set. In this paper, we report the translation of AArch64 floating-point instructions in our mc2llvm. For floating-point instructions, due to the lack of floating-point support in LLVM [13, 14], we add support for the flush-to-zero mode, not-a-number processing, floating-point exceptions, and various rounding modes. On average, mc2llvm-translated binary can achieve 47% and 24.5% of the performance of natively compiled x86-64 binary on statically translated EEMBC benchmark and dynamically translated SPEC CINT2006 benchmarks, respectively. Compared to QEMU-translated binary, mc2llvm-translated binary runs 2.92x, 1.21x and 1.41x faster on statically translated EEMBC benchmark, dynamically translated SPEC CINT2006, and CFP2006 benchmarks, respectively. (Note that the benchmarks contain both floating-point instructions and other instructions, such as load and store instructions.)
Link
Ming-Tsung Chiu and Yi-Ping You, "Enabling OpenCL Preemptive Multitasking Using Software Checkpointing", in Proceedings of the 47th International Conference on Parallel Processing Workshops (ICPP-W '18), pp. 1-7, 2018. DOI:10.1145/3229710.3229725

Heterogeneous computing has become popular in the past decade. Many frameworks have been proposed to provide a uniform way to program for accelerators, such as GPUs, DSPs, and FPGAs. Among them, an open and royalty-free standard, OpenCL, is widely adopted by the industry. However, many OpenCL-enabled accelerators and the standard itself do not support preemptive multitasking. To the best of our knowledge, previously proposed techniques are not portable or cannot handle ill-designed kernels (the codes that are executed on the accelerators), which will never ever finish. This paper presents a framework (called CLPKM) that provides an abstraction layer between OpenCL applications and the underlying OpenCL runtime to enable preemption of a kernel execution instance based on a software checkpointing mechanism. CLPKM includes (1) an OpenCL runtime library that intercepts OpenCL API calls, (2) a source-to-source compiler that performs the preemption-enabling transformation, and (3) a daemon that schedules OpenCL tasks using priority-based preemptive scheduling techniques. Experiments demonstrated that CLPKM reduced the slowdown of high-priority processes from 4.66x to 1.52--2.23x under up to 16 low-priority, heavy-workload processes running in the background and caused an average of 3.02--6.08x slowdown for low-priority processes.
Link
Yi-Ping You, Hen-Jung Wu, Yeh-Ning Tsai, and Yen-Ting Chao, "VirtCL: A Framework for OpenCL Device Abstraction and Management", in Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '15), pp. 161–172, 2015. DOI:10.1145/2688500.2688505

The interest in using multiple graphics processing units (GPUs) to accelerate applications has increased in recent years. However, the existing heterogeneous programming models (e.g., OpenCL) abstract details of GPU devices at the per-device level and require programmers to explicitly schedule their kernel tasks on a system equipped with multiple GPU devices. Unfortunately, multiple applications running on a multi-GPU system may compete for some of the GPU devices while leaving other GPU devices unused. Moreover, the distributed memory model defined in OpenCL, where each device has its own memory space, increases the complexity of managing the memory among multiple GPU devices. In this article we propose a framework (called VirtCL) that reduces the programming burden by acting as a layer between the programmer and the native OpenCL run-time system for abstracting multiple devices into a single virtual device and for scheduling computations and communications among the multiple devices. VirtCL comprises two main components: (1) a front-end library, which exposes primary OpenCL APIs and the virtual device, and (2) a back-end run-time system (called CLDaemon) for scheduling and dispatching kernel tasks based on a history-based scheduler. The front-end library forwards computation requests to the back-end CLDaemon, which then schedules and dispatches the requests. We also propose a history-based scheduler that is able to schedule kernel tasks in a contention- and communication-aware manner. Experiments demonstrated that the VirtCL framework introduced a small overhead (mean of 6%) but outperformed the native OpenCL run-time system for most benchmarks in the Rodinia benchmark suite, which was due to the abstraction layer eliminating the time-consuming initialization of OpenCL contexts. We also evaluated different scheduling policies in VirtCL with a real-world application (clsurf) and various synthetic workload traces. The results indicated that the VirtCL framework provides scalability for multiple kernel tasks running on multi-GPU systems.
Link
Yi-Ping You and Szu-Chieh Chen, "Vector-Aware Register Allocation for GPU Shader Processors", in Proceedings of the 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES '15), pp. 99–108, 2015.

Graphics processing units (GPUs) are now widely used in embedded systems for manipulating computer graphics and even for general-purpose computation. However, many embedded systems have to manage highly restricted hardware resources in order to achieve high performance or energy efficiency. The number of registers is one of the common limiting factors in an embedded GPU design. Programs that run with a low number of registers may suffer from high register pressure if register allocation is not properly designed, especially on a GPU in which a register is divided into four elements and each element can be accessed separately, because allocating a register for a vector-type variable that does not contain values in all elements wastes register spaces. In this paper we present a vector-aware register allocation framework to improve register utilization on shader architectures. The framework involves two major components: (1) element-based register allocation that allocates registers based on the element requirement of variables and (2) register packing that rearranges elements of registers in order to increase the number of contiguous free elements, thereby keeping more live variables in registers. Experimental results on a cycle-approximate simulator showed that the proposed framework decreased 92% of register spills in total and made 91.7% of 14 common shader programs spill-free. These results indicate an opportunity for energy management of the space that is used for storing spilled variables, with the framework improving the performance by a geometric mean of 8.3%, 16.3%, and 29.2% for general shader processors in which variables are spilled to memory with 5-, 10-, and 20-cycle access latencies, respectively. Furthermore, the reduction in the register requirement of programs enabled another 11 programs with high register pressure to be runnable on a lightweight GPU.
Link
Yi-Ping You and Yu-Shiuan Tsai, "Compiler-Assisted Resource Management for CUDA Programs", in Proceedings of the 16th Workshop on Compilers for Parallel Computing (CPC '12), 2012.

CUDA is a C-extended programming model that allows programmers to write code for both central processing units and graphics processing units (GPUs). In general, GPUs require high thread-level parallelism (TLP) to reach their maximal performance, but the TLP of a CUDA program is deeply affected by the resource allocation of GPUs, including allocation of shared memory and registers since these allocation results directly determine the number of active threads on GPUs. There were some research work focusing on the management of memory allocation for performance enhancement, but none proposed an effective approach to speed up programs in which TLP is limited by insufficient registers. In this paper, we propose a TLP-aware register-pressure reduction framework to reduce the register requirement of a CUDA kernel to a desired degree so as to allow more threads active and thereby to hide the long-latency global memory accesses among these threads. The framework includes two schemes: register rematerialization and register spilling to shared memory. The experimental results demonstrate that the framework is effective in performance improvement of CUDA kernels with a geometric average of 14.8%, while the geometric average performance improvement for CUDA programs is 5.5%.
Link
Yen-Hsiang Fang, Yuan-Shin Hwang, Yi-Ping You, and Jenq-Kuen Lee, "Compiler-Based vs. Hardware-Based Power Gating Techniques for Functional Units", in Proceedings of the 6th Workshop on Optimizations for DSP and Embedded Systems (ODES-6), pp. 26-35, 2008.

Link

Patents

Jia-Rung CHANG, Yi-Chiao SU, Tien-Yuan Hsieh, and Yi-Ping You, "Optimization method, optimization system for computer programming code and electronic device using the same", United States Patent, No. US20220129254A1, issued on 2022/04/28. Link
張家榮, 蘇意喬, 謝天元, and 游逸平, "電腦程式碼之優化方法、優化系統及應用其之電子裝置", Taiwan Patent, No. TW202217552A, issued on 2022/05/01. Link
游逸平, 吴晞浩, 郑育镕, and 陈静芳, "针对软件程序的变量推论系统及方法", China Patent, No. CN105700932B, issued on 2019/02/05. Link
Yi-Ping You, Si-Hao WU, Yu-Jung Cheng, and Jing-Fung CHEN, "Variable inference system and method for software program", United States Patent, No. US9747087B2, issued on 2017/08/29. Link
游逸平, 吳晞浩, 鄭育鎔, and 陳靜芳, "針對軟體程式之變數推論系統及方法", Taiwan Patent, No. TWI571802B, issued on 2017/02/21. Link
游逸平, 陈柏裕, and 陈静芳, "Hybrid dynamic code compiling device, method and service system thereof", China Patent, No. CN104657189B, issued on 2017/12/22. Link
游逸平, 陳柏裕, and 陳靜芳, "Hybrid dynamic code compiling device, method, and service system thereof", Taiwan Patent, No. TWI525543B, issued on 2016/03/11. Link
Yi-Ping You, Po-Yu Chen, and Jing-Fung CHEN, "Hybrid dynamic code compiling device, method, and service system thereof", United States Patent, No. US9182953B2, issued on 2015/11/10. Link
Szu-Chieh Chen, Yi-Ping You, and Ming-Yung KO, "System and method for configuring graphics register data and recording medium", United States Patent, No. US8988444B2, issued on 2015/03/24. Link
Shenhung Wang, Yiping You, Yiting Lin, Mingyung Ko, Chiaming Chang, and Yujung Cheng, "低功率程式編譯方法、裝置以及儲存其之電腦可讀取紀錄媒體", Taiwan Patent, No. TWI425419B, issued on 2014/02/01. Link
Shen-Hung WANG, Yi-Ping You, Yi-Ting Lin, Ming-Yung KO, Chia-Ming Chang, and Yu-Jung Cheng, "Low power program compiling device, method and computer readable storage medium for storing thereof", United States Patent, No. US20120151456A1, issued on 2012/06/14. Link
yi You, Jeng Kuen Lee, Kuo Yu Chuang, and Chung Hsien Wu, "Multi-thread power-gating design", Taiwan Patent, No. TWI361345B, issued on 2012/04/01. Link
Yi-Ping You, Jeng Kuen Lee, Kuo Yu Chuang, and Chung-Hsien Wu, "Multi-thread power-gating control design", United States Patent, No. US7904736B2, issued on 2011/03/08. Link
Yung-Chia Lin, Yi-Ping You, Chung-Wen Huang, and Jenq-Kuen Lee, "Task scheduling method for low power dissipation in a system chip", United States Patent, No. US7779412B2, issued on 2010/08/17. Link
Jenq Kuen Lee, Yung Chia Lin, and Yi Ping Yu, "Method for allocating registers for a processor", United States Patent, No. US7650598B2, issued on 2010/01/19. Link
Jenq Kuen Lee, Yung Chia Lin, and Yi Ping Yu, "Method for allocating registers for a processor", Taiwan Patent, No. TWI320906B, issued on 2010/02/21. Link
Yi-Ping You, Chung Wen Huang, Jeng Kuen Lee, Chi-Lung Wang, and Kuo Yu Chuang, "Power-gating instruction scheduling for power leakage reduction", United States Patent, No. US7539884B2, issued on 2009/05/26. Link

Others

Chung-Yi Chen and Yi-Ping You, "Compact LLVM IR-Based Program Representation", in Proceedings of the 29th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '24), Taipei, Taiwan, 2024.
Mu-En Huang and Yi-Ping You, "Enhanced Compression-Aware Register Allocation", in Proceedings of the 29th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '24), Taipei, Taiwan, 2024.
Ya-Ran Guo and Yi-Ping You, "MLIR-Based Program Embeddings", in Proceedings of the 28th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '23), Tainan, Taiwan, 2023.
An-Chi Liu and Yi-Ping You, "Offloading Support for Parallel JavaScript Programs", in Proceedings of the 27th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '22), Taoyuan, Taiwan, 2022.
Yi-Chiao Su and Yi-Ping You, "On Subsequences of O3 Compiler Optimization Sequences", in Proceedings of the 26th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '21), Taipei, Taiwan, 2021.
Yun-Wei Lee and Yi-Ping You, "Automated Syntactic Refactoring", in Proceedings of the 25th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '19), Taipei, Taiwan, 2019.
Jia-Rung Chang and Yi-Ping You, "Selecting Function-Level Optimization Options with Reinforcement Learning", in Proceedings of the 24th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '18), Chiayi, Taiwan, 2018.
Ming-Tsung Chiu and Yi-Ping You, "Enabling Preemptive Execution of OpenCL Kernels.", in Proceedings of the 24th Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '18), Chiayi, Taiwan, 2018.
Nai-Jia Dong and Yi-Ping You, "Constructing Generic and Efficient Containers with C Preprocessor Macros", in Proceedings of the 23rd Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '17), Taichung, Taiwan, 2017.
Yen-Ting Chao and Yi-Ping You, "Capability-Aware Workload Partition on Multi-GPU Systems", in Proceedings of the 22nd Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '16), Hsinchu, Taiwan, 2016.
Po-Hsiang Chiu and Yi-Ping You, "LLVM-based AOT Compilation for Dynamic Languages: JavaScript as a Case Study ", in Proceedings of the 21st Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '15), Tainan, Taiwan, 2015.
Yi-Ping You, Hen-Jung Wu, Yeh-Ning Tsai, and Yen-Ting Chao, "VirtCL: A Framework for OpenCL Device Abstraction and Management", in Proceedings of the 21st Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing (CTHPC '15), Tainan, Taiwan, 2015.
Poyu Chen and Yi-Ping You, "JSComp: A Static Compiler for Hybrid Execution of JavaScript Programs", in Proceedings of the 20th Workshop on Compiler Techniques for High-Performance Computing (CTHPC '14), Hsinchu, Taiwan, 2014.
Yu-Shiuan Tsai, Pen-Yung Yu, and Yi-Ping You, "Compiler-Assisted Resource Management for CUDA Programs", in Proceedings of the 19th Workshop on Compiler Techniques for High-Performance Computing (CTHPC '13), Taipei, Taiwan, 2013.
Yi-Ping You and Szu-Chieh Chen, "Register Allocation Techniques for GPU Shader Processors", in Proceedings of the 18th Workshop on Compiler Techniques for High-Performance Computing (CTHPC '12), Chiayi, Taiwan, 2012.
Yi-Ping You, Shen-Hong Wang, and I-Ting Lin, "Energy-aware Code Motion for GPU Shader Processors", in Proceedings of the 17th Workshop on Compiler Techniques for High-Performance Computing (CTHPC '11), Taichung, Taiwan, 2011.