An Excursion in Probabilistic Hashing Techniques for Big Data

Sep 23

Wednesday, September 23, 2015

3:30 pm - 4:30 pm
Gross Hall 330


Anima Anandkumar, Rice University

12noon learner luncheon for postdocs/grad students 3:30pm reception/seminar Abstract: Large scale machine learning and data mining applications are constantly dealing with datasets at TB scale and the anticipation is that soon it will reach PB levels. At this scale conventional algorithms fail and simple data mining operations such as search, learning, clustering, etc. become challenging In this talk, I will introduce probabilistic hashing techniques for large scale search and learning. I will show how the old hashing framework, originally meant for sub-linear search, can be converted into fast learning algorithms. I will talk about our recent success in constructing hash functions for dot product by making use of asymmetry. Such a construction is not possible in the conventional setting and was a known hard problem. I will further show the direct consequence of hashing inner products in speeding up popular learning algorithms. Later, I will discuss the recent improvements that I found in some decade old textbook hashing algorithms, which will include the fastest way of performing minwise hashing in practice. I will demonstrate the utility of the above techniques on various real applications including search, learning, collaborative filtering, record linkage, etc.