Show simple item record

dc.contributor.authorKim, Ji Yun
dc.date.accessioned2017-04-04T20:26:49Z
dc.date.available2017-04-04T20:26:49Z
dc.date.issued2017-01-30
dc.identifier.otherKim_cornellgrad_0058F_10048
dc.identifier.otherhttp://dissertations.umi.com/cornellgrad:10048
dc.identifier.otherbibid: 9905960
dc.identifier.urihttps://hdl.handle.net/1813/47714
dc.description.abstractComputer architects are increasingly turning to programmable accelerators tailored for narrower classes of applications in order to achieve high performance and energy efficiency. A continuing challenge with accelerators is enabling the programmer to easily extract maximum performance without intimate knowledge of the underlying microarchitecture. It is important to consider productivity and portability, in addition to performance, as first-class metrics when developing and evaluating modern computing platforms. Software-centric approaches to achieving 3P computing platforms are compelling, but sacrifice efficiency and flexibility by hiding parallel abstractions from hardware and limiting the scope of the application domain. This thesis proposes a new software/hardware co-design approach to achieving 3P platforms, called the loop-task accelerator (LTA) platform, that provides high productivity and portability without sacrificing performance or efficiency across a wide range of applications. The LTA platform addresses the weaknesses of existing approaches that are identified through detailed experimentation with and analysis of modern application development. Discussion of an early attempt at a hardware-centric approach to achieving 3P platforms provides insight into area-efficient accelerator designs and highlights the need for innovations in both software and hardware. The LTA platform focuses on exploiting loop-task parallelism by exposing loop-tasks as a common parallel abstraction at the programming API, runtime, ISA, and microarchitectural levels. The LTA programming API uses the parallel_for construct to express loop-tasks that can be exploited both across cores and within a core, the LTA runtime distributes loop-tasks across cores, and a new xpfor instruction explicitly encodes loop-tasks as functions applied to a range of loop iterations. This thesis introduces a novel task-coupling taxonomy that captures how tasks can be coupled in both space and time. The LTA engine template can be configured at design time with variable spatial and temporal task coupling to accelerate the execution of both regular and irregular loop-tasks within a core. The LTA platform is evaluated with respect to the 3P’s using a vertically integrated research methodology. Compared to an in-order multi-core baseline, the LTA platform yields average improvements of 5.5× in raw performance, 2.5× in performance per area, and 1.2× in energy efficiency, while offering high productivity and portability.
dc.language.isoen_US
dc.subjectEnergy-Efficient Accelerators
dc.subjectProgrammable Accelerators
dc.subjectSoftware/Hardware Co-Design
dc.subjectSpatial and Temporal Task Coupling
dc.subjectWork-Stealing Runtimes
dc.subjectComputer engineering
dc.subjectComputer science
dc.titleSoftware/Hardware Co-Design to Improve Productivity, Portability, and Performance of Loop-Task Parallel Applications
dc.typedissertation or thesis
thesis.degree.disciplineElectrical and Computer Engineering
thesis.degree.grantorCornell University
thesis.degree.levelDoctor of Philosophy
thesis.degree.namePh. D., Electrical and Computer Engineering
dc.contributor.chairBatten, Christopher
dc.contributor.committeeMemberManohar, Rajit
dc.contributor.committeeMemberMartinez, Jose F.
dcterms.licensehttps://hdl.handle.net/1813/59810
dc.identifier.doihttps://doi.org/10.7298/X4NC5Z64


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

Statistics