Ville Hyvönen defends his PhD thesis on a Machine Learning Approach to Approximate Nearest Neighbor Search

On Monday the 23rd of October 2023, M.Sc. Ville Hyvönen defends his PhD thesis on a Machine Learning Approach to Approximate Nearest Neighbor Search. The thesis is related to research done in the Department of Computer Science and in the Complex Systems Computation group.

M.Sc. Ville Hyvönen defends his doctoral thesis "A Machine Learning Approach to Approximate Nearest Neighbor Search" on Monday the 23rd of October 2023 at 14 o'clock in the University of Helsinki Chemicum building, Auditorium A129 (A.I. Virtasen aukio 1, 1st floor). His opponent is Professor Sanjoy Dasgupta (University of California, San Diego, USA) and custos Professor Teemu Roos (University of Helsinki). The defence will be held in English.

The thesis of Ville Hyvönen is a part of research done in the Department of Computer Science and in the Complex Systems Computation group at the University of Helsinki. His supervisor has been Professor Teemu Roos  (University of Helsinki).

A Machine Learning Approach to Approximate Nearest Neighbor Search

Approximate nearest neighbor (ANN) search is a fundamental algorithmic problem. The task is to build an index structure that facilitates fast but approximate nearest neighbor queries. ANN search is often used in downstream applications where data sets are large and high-dimensional and real-time responses are required. Important modern application areas include recommendation systems, training of large language models, inference in large-scale models, and information retrieval.

ANN algorithms can be classified into hashing-, tree-, graph-, and quantization-based methods. In this thesis, we consider partition-based ANN algorithms, i.e., tree- and hashing-based methods. The contributions of this thesis can be divided into two parts. First, we increase the efficiency and practicality of tree-based methods for ANN search. Second, we formulate candidate set selection for ANN search as a multilabel classification problem and explore the implications of the resulting machine learning framework.

In particular, we increase the efficiency of space-partitioning trees for ANN search by proposing sparse principal component (PCA) and random projection (RP) trees that are faster to build, require less memory, and have faster query times than their dense counterparts. Further, we propose voting search as a candidate set selection method for ANN search, and demonstrate that it is strictly superior to the earlier lookup search. Finally, we make the resulting ANN algorithm (an ensemble of sparse RP trees combined with voting search) practical by designing an automatic hyperparameter tuning algorithm for it. We released the ANN algorithm as an open source C++ library with Python bindings.

The second part of the thesis is based on the observation that the candidate set selection for ANN search can be formulated as a multilabel classification problem. In this multilabel classification framework partition-based index structures are interpreted as partitioning classifiers. This interpretation directly suggests the natural classifier as a new candidate set selection method. According to our experiments, the natural classifier has a superior performance compared to the earlier lookup search and voting search. The natural classifier can be used in combination with any partition-based index structure to increase its performance for ANN search.

In addition to increasing the performance of earlier partition-based index structures, the established framework enables using any existing multilabel classifier as an index structure for ANN search, thus opening a new research direction into an established algorithmic problem. We demonstrate this by using a multilabel random forest as an index structure for ANN search.

The formulation of the candidate set selection as a multilabel classification problem also enables considering ANN search in the standard statistical learning framework. Specifically, we provide a sufficient condition for the consistency of a partition-based index structure for ANN search. We establish the consistency of chronological k-d trees and both dense and sparse RP trees by verifying this sufficient condition.

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation will be available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-9955-3.

Printed copies will be available on request from Ville Hyvönen: ville.o.hyvonen@helsinki.fi.