U Kang

Mining Tera-Scale Graphs: Theory, Engineering and Discoveries Degree Type: Ph.D. in Computer Science
Advisor(s): Christos Faloutsos
Graduated: May 2012

Abstract:

How do we find patterns and anomalies, on graphs with billions of nodes and edges, which do not fit in memory? How to use parallelism for such Tera- or Peta-scale graphs? In this thesis, we propose PEGASUS, a large scale graph mining system implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. PEGASUS includes algorithms which help us spot patterns and anomalous behaviors in large graphs.

PEGASUS enables the structure analysis on large graphs. We unify many different structure analysis algorithms, including the analysis on connected components, PageRank, and radius/diameter, into a general primitive called GIM-V. GIM-V is highly optimized, achieving good scale-up on the number of edges and available machines. We discover surprising patterns using GIM-V, including the 7-degrees of separation in one of the largest publicly available Web graphs, with 7 billion edges.

PEGASUS also enables the inference and the spectral analysis on large graphs. We design an efficient distributed belief propagation algorithm which infer the states of unlabeled nodes given a set of labeled nodes. We also develop an eigensolver for computing top k eigenvalues and eigenvectors of the adjacency matrices of very large graphs. We use the eigensolver to discover anomalous adult advertisers in the who-follows-whom Twitter graph with 3 billion edges. In addition, we develop an efficient tensor decomposition algorithm and use it to analyze a large knowledge base tensor.

Finally, PEGASUS allows the management of large graphs. We propose efficient graph storage and indexing methods to answer graph mining queries quickly. We also develop an edge layout algorithm for better compressing graphs

Thesis Committee:
Christos Faloutsos (Chair)
Tom Mitchell
Garth Gibson
Robert Grossman (University of Chicago)

Jeannette Wing, Head, Computer Science Department
Randy Bryant, Dean, School of Computer Science

Keywords:
Graph mining, MAPREDUCE, HADOOP, graph structure analysis, radius plot, diameter, connected component, inference, spectral graph analysis, eigensolver, tensor analysis, graph management, graph indexing, graph compression

CMU-CS-12-112.pdf (11.91 MB) ( 178 pages)
Copyright Notice