Iterative Graph Computation In The Big Data Era

Xie, Wenlei

Iterative Graph Computation In The Big Data Era

Files

wx49.pdf (1.25 MB)

Permanent Link(s)

https://hdl.handle.net/1813/41169

Collections

Cornell Theses and Dissertations

Full item page

Author(s)

Xie, Wenlei

Abstract

Iterative graph computation is a key component in many real-world applications, as the graph data model naturally captures complex relationships between entities. The big data era has seen the rise of several new challenges to this classic computation model. In this dissertation we describe three projects that address different aspects of these challenges. First, because of the increasing volume of data, it is increasingly important to scale iterative graph computation to large graphs. We observe that an important class of graph applications performing little computation per vertex scales poorly when running on multiple cores. These computationally light applications are limited by memory access rates, and cannot fully utilize the benefits of multiple cores. We propose a new block-oriented computation model which creates two levels of iterative computation. On each processor, a small block of highly connected vertices is iterated locally, while the blocks are updated iteratively at the global level. We show that block-oriented execution reduces the communication-to-computation ratio and significantly improves the performance of various graph applications. Second, because of the increasing velocity of data generation, iterative computation over graph streams is an increasingly important problem. Extracting insights from such graph data streams is difficult because most graph mining algorithms are designed for static graph structures. Existing systems are based on either a snapshot model or a sliding window model. Both of the models embody a binary view of an edge's role: they either forget old edges beyond a fixed period abruptly, or fail to emphasize recent data sufficiently. We propose a novel probabilistic edge decay model that samples edges according to a probability that decreases over time. We exploit the overlap between samples to reduce memory consumption and accelerate the analytic procedure. We design and implement the end-to-end system on top of Spark. Third, because of the increasing variety of data, people often perform the same iterative computation over a parametric family of graphs. These graphs have the same structure but associate different weights to the same vertices and edges. We focus on the parametric PageRank problem, usually known as personalized PageRank. This method has been widely used to find vertices in a graph that are most relevant to a query or user. However, edge-weighted personalized PageRank has been an open problem for over a decade. We describe the first fast algorithm for edge-weighted personalized PageRank on general graphs. With a reduced model built in the preprocessing stage, we can solve problems in a much smaller reduced space producing good approximate results. This opens opportunities to many applications that were previously infeasible, such as interactive learning-to-rank based on user's feedback.

Date Issued

2015-08-17

Keywords

Iterative computations, Graph data; Parallel computing, Time decay; Personalized PageRank, Model reduction

Committee Chair

Gehrke,Johannes E.

Committee Member

Bindel,David S.
Kleinberg,Robert David
Demers,Alan J.

Degree Discipline

Computer Science

Degree Name

Ph. D., Computer Science

Degree Level

Doctor of Philosophy

Types

dissertation or thesis

Iterative Graph Computation In The Big Data Era

Files

No Access Until

Permanent Link(s)

Collections

Other Titles

Author(s)

Abstract

Journal / Series

Volume & Issue

Description

Sponsorship

Date Issued

Publisher

Keywords

Location

Effective Date

Expiration Date

Sector

Employer

Union

Union Local

NAICS

Number of Workers

Committee Chair

Committee Co-Chair

Committee Member

Degree Discipline

Degree Name

Degree Level

Related Version

Related DOI

Related To

Related Part

Based on Related Item

Has Other Format(s)

Part of Related Item

Related To

Related Publication(s)

Link(s) to Related Publication(s)

References

Link(s) to Reference(s)

Previously Published As

Government Document

ISBN

ISMN

ISSN

Other Identifiers

Rights

Rights URI

Types

Accessibility Feature

Accessibility Hazard

Accessibility Summary

Link(s) to Catalog Record