Iterative Graph Computation In The Big Data Era
Iterative graph computation is a key component in many real-world applications, as the graph data model naturally captures complex relationships between entities. The big data era has seen the rise of several new challenges to this classic computation model. In this dissertation we describe three projects that address different aspects of these challenges. First, because of the increasing volume of data, it is increasingly important to scale iterative graph computation to large graphs. We observe that an important class of graph applications performing little computation per vertex scales poorly when running on multiple cores. These computationally light applications are limited by memory access rates, and cannot fully utilize the benefits of multiple cores. We propose a new block-oriented computation model which creates two levels of iterative computation. On each processor, a small block of highly connected vertices is iterated locally, while the blocks are updated iteratively at the global level. We show that block-oriented execution reduces the communication-to-computation ratio and significantly improves the performance of various graph applications. Second, because of the increasing velocity of data generation, iterative computation over graph streams is an increasingly important problem. Extracting insights from such graph data streams is difficult because most graph mining algorithms are designed for static graph structures. Existing systems are based on either a snapshot model or a sliding window model. Both of the models embody a binary view of an edge's role: they either forget old edges beyond a fixed period abruptly, or fail to emphasize recent data sufficiently. We propose a novel probabilistic edge decay model that samples edges according to a probability that decreases over time. We exploit the overlap between samples to reduce memory consumption and accelerate the analytic procedure. We design and implement the end-to-end system on top of Spark. Third, because of the increasing variety of data, people often perform the same iterative computation over a parametric family of graphs. These graphs have the same structure but associate different weights to the same vertices and edges. We focus on the parametric PageRank problem, usually known as personalized PageRank. This method has been widely used to find vertices in a graph that are most relevant to a query or user. However, edge-weighted personalized PageRank has been an open problem for over a decade. We describe the first fast algorithm for edge-weighted personalized PageRank on general graphs. With a reduced model built in the preprocessing stage, we can solve problems in a much smaller reduced space producing good approximate results. This opens opportunities to many applications that were previously infeasible, such as interactive learning-to-rank based on user's feedback.
Iterative computations, Graph data; Parallel computing, Time decay; Personalized PageRank, Model reduction
Bindel,David S.; Kleinberg,Robert David; Demers,Alan J.
Ph.D. of Computer Science
Doctor of Philosophy
dissertation or thesis