eCommons

 

EFFICIENT FINE-GRAIN COOPERATIVE EXECUTION OF DYNAMIC TASK PARALLELISM ON HETEROGENEOUS MULTI/MANYCORE SYSTEMS

Other Titles

Abstract

Since the end of Dennard’s scaling, computer architects have fully embraced parallelism to con- tinue improving the performance and energy efficiency of general-purpose processors. Multicore processors with a few to tens of high performance processor cores have been the centerpiece of many computing platforms ranging from mobile devices to data centers. Manycore proces- sors with hundreds or thousands of simple processing elements have demonstrated their ability to achieve even higher throughput and energy efficiency when abundant explicit parallelism exists in the workloads. However, large-scale manycore processors often lack hardware-based cache co- herence. There is a growing trend towards a tighter integration between multicore and manycore processors, forming heterogeneous multi/manycore systems. These systems use heterogeneous cache coherence (HCC) with hardware-based cache coherence within the multicore and software- centric cache coherence with in the manycore. Unfortunately, programming heterogeneous multi/manycore systems to enable collaborative execution is challenging, especially when considering dynamic task parallelism. This thesis uses a combination of light-weight software and hardware techniques to elegantly address this problem. It provides a detailed description of how to imple- ment a work-stealing runtime to enable dynamic task parallelism on heterogeneous cache-coherent systems with a unified task-based programming model. This thesis also proposes direct task steal- ing (DTS), a new technique based on user-level interrupts to bypass the memory system and thus improve the performance and energy efficiency of work stealing. The cycle-level results in this thesis demonstrate that executing dynamic task-parallel applications on a 64-core system (4 big, 60 tiny) with complexity-effective HCC and DTS can achieve: 7× speedup over a single big core; 1.4x speedup over an area-equivalent eight big-core system with hardware-based cache coher- ence; and 21% better performance and similar energy efficiency compared to a 64-core system (4 big, 60 tiny) with full-system hardware-based cache coherence. This thesis also describes a realistic hardware implementation of heterogeneous multi/manycore systems based on an open-source hardware prototyping framework, OpenPiton. Using a VLSI methodology, this thesis shows that the heterogeneous multi/manycore approach achieves 3x hardware parallelism with the same area compared to a traditional homogeneous manycore.

Journal / Series

Volume & Issue

Description

112 pages

Sponsorship

Date Issued

2021-05

Publisher

Keywords

cache coherence; computer architecture; parallel programming; task-based; work stealing

Location

Effective Date

Expiration Date

Sector

Employer

Union

Union Local

NAICS

Number of Workers

Committee Chair

Batten, Christopher

Committee Co-Chair

Committee Member

Martínez, José F.
Zhang, Zhiru

Degree Discipline

Electrical and Computer Engineering

Degree Name

Ph. D., Electrical and Computer Engineering

Degree Level

Doctor of Philosophy

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)

References

Link(s) to Reference(s)

Previously Published As

Government Document

ISBN

ISMN

ISSN

Other Identifiers

Rights

Attribution 4.0 International

Types

dissertation or thesis

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record