EFFICIENT FINE-GRAIN COOPERATIVE EXECUTION OF DYNAMIC TASK PARALLELISM ON HETEROGENEOUS MULTI/MANYCORE SYSTEMS
No Access Until
Permanent Link(s)
Collections
Other Titles
Author(s)
Abstract
Since the end of Dennard’s scaling, computer architects have fully embraced parallelism to con- tinue improving the performance and energy efficiency of general-purpose processors. Multicore processors with a few to tens of high performance processor cores have been the centerpiece of many computing platforms ranging from mobile devices to data centers. Manycore proces- sors with hundreds or thousands of simple processing elements have demonstrated their ability to achieve even higher throughput and energy efficiency when abundant explicit parallelism exists in the workloads. However, large-scale manycore processors often lack hardware-based cache co- herence. There is a growing trend towards a tighter integration between multicore and manycore processors, forming heterogeneous multi/manycore systems. These systems use heterogeneous cache coherence (HCC) with hardware-based cache coherence within the multicore and software- centric cache coherence with in the manycore. Unfortunately, programming heterogeneous multi/manycore systems to enable collaborative execution is challenging, especially when considering dynamic task parallelism. This thesis uses a combination of light-weight software and hardware techniques to elegantly address this problem. It provides a detailed description of how to imple- ment a work-stealing runtime to enable dynamic task parallelism on heterogeneous cache-coherent systems with a unified task-based programming model. This thesis also proposes direct task steal- ing (DTS), a new technique based on user-level interrupts to bypass the memory system and thus improve the performance and energy efficiency of work stealing. The cycle-level results in this thesis demonstrate that executing dynamic task-parallel applications on a 64-core system (4 big, 60 tiny) with complexity-effective HCC and DTS can achieve: 7× speedup over a single big core; 1.4x speedup over an area-equivalent eight big-core system with hardware-based cache coher- ence; and 21% better performance and similar energy efficiency compared to a 64-core system (4 big, 60 tiny) with full-system hardware-based cache coherence. This thesis also describes a realistic hardware implementation of heterogeneous multi/manycore systems based on an open-source hardware prototyping framework, OpenPiton. Using a VLSI methodology, this thesis shows that the heterogeneous multi/manycore approach achieves 3x hardware parallelism with the same area compared to a traditional homogeneous manycore.
Journal / Series
Volume & Issue
Description
Sponsorship
Date Issued
Publisher
Keywords
Location
Effective Date
Expiration Date
Sector
Employer
Union
Union Local
NAICS
Number of Workers
Committee Chair
Committee Co-Chair
Committee Member
Zhang, Zhiru