eCommons

 

Probabilistic Hashing Techniques For Big Data

dc.contributor.authorShrivastava, Anshumali
dc.contributor.chairLi,Ping
dc.contributor.committeeMemberGehrke,Johannes E.
dc.contributor.committeeMemberSamorodnitsky,Gennady
dc.date.accessioned2015-10-15T18:01:00Z
dc.date.available2020-08-17T06:01:30Z
dc.date.issued2015-08-17
dc.description.abstractWe investigate probabilistic hashing techniques for addressing computational and memory challenges in large scale machine learning and data mining systems. In this thesis, we show that the traditional idea of hashing goes far beyond near-neighbor search and there are some striking new possibilities. We show that hashing can improve state of the art large scale learning algorithms, and it goes beyond the conventional notions of pairwise similarities. Despite being a very well studied topic in literature, we found several opportunities for fundamentally improving some of the well know textbook hashing algorithms. In particular, we show that the traditional way of computing minwise hashes is unnecessarily expensive and without loosing anything we can achieve an order of magnitude speedup. We also found that for cosine similarity search there is a better scheme than SimHash. In the end, we show that the existing locality sensitive hashing framework itself is very restrictive, and we cannot have efficient algorithms for some important measures like inner products which are ubiquitous in machine learning. We propose asymmetric locality sensitive hashing (ALSH), an extended framework, where we show provable and practical efficient algorithms for Maximum Inner Product Search (MIPS). Having such an efficient solutions to MIPS directly scales up many popular machine learning algorithms. We believe that this thesis provides significant improvements to some of the heavily used subroutines in big-data systems, which we hope will be adopted.
dc.identifier.otherbibid: 9255145
dc.identifier.urihttps://hdl.handle.net/1813/40886
dc.language.isoen_US
dc.subjectLarge Scale Machine Learning
dc.subjectRandomized Algorithms for Big-Data
dc.subjectHashing, Sketching
dc.titleProbabilistic Hashing Techniques For Big Data
dc.typedissertation or thesis
thesis.degree.disciplineComputer Science
thesis.degree.grantorCornell University
thesis.degree.levelDoctor of Philosophy
thesis.degree.namePh. D., Computer Science

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
as2446.pdf
Size:
3.78 MB
Format:
Adobe Portable Document Format