Mugizi Robert Rwebangira

Techniques for Exploiting Unlabeled Data Degree Type: Ph.D. in Computer Science
Advisor(s): Avrim Blum, John Lafferty
Graduated: December 2008

Abstract:

In many machine learning application domains obtaining labeled data is expensive but obtaining unlabeled data is much cheaper. For this reason there has been growing interest in algorithms that are able to take advantage of unlabeled data. In this thesis we develop several methods for taking advantage of unlabeled data in classification and regression tasks.

Specific contributions include:

  • A method for improving the performance of the graph mincut algorithm of Blum and Chawla [12] by taking randomized mincuts. We give theoretical motivation for this approach and we present empirical results showing that randomized mincut tends to outperform the original graph mincut algorithm, especially when the number of labeled examples is very small.
  • An algorithm for semi-supervised regression based on manifold regularization using local linear estimators. This is the first extension of local linear regression to the semi-supervised setting. In this thesis we present experimental results on both synthetic and real data and show that this method tends to perform better than methods which only utilize the labeled data.
  • An investigation of practical techniques for using the Winnow algorithm (which is not directly kernelizable) together with kernel functions and general similarity functions via unlabeled data. We expect such techniques to be particularly useful when we have a large feature space as well as additional similarity measures that we would like to use together with the original features. This method is also suited to situations where the best performing measure of similarity does not satisfy the properties of a kernel. We present some experiments on real and synthetic data to support this approach.

Thesis Committee:
Avrim Blum (Co-Chair)
John Lafferty (Co-Chair)
William Cohen
Xiaojin (Jerry) Zhu (Wisconsin)

Peter Lee, Head, Computer Science Department
Randy Bryant, Dean, School of Computer Science

Keywords:
Semi-supervised, regression, unlabeled data, similarity

CMU-CS-08-164.pdf (889.9 KB) ( 114 pages)
Copyright Notice