Mr. GAN Junhao and Prof. Yufei TAO Won the Best Paper Award at ACM Conference on Management of Data (SIGMOD) 2015

Congratulations to Mr. GAN Junhao and Prof. Yufei TAO won the Best Paper Award at ACM Conference on Management of Data (SIGMOD) 2015.

SIGMOD is the Association for Computing Machinery’s Special Interest Group on Management of Data, which specializes in large-scale data management problems and databases. The annual SIGMOD conference is one of the most prestigious conferences in database research. It is a leading international forum for database researchers, practitioners, developers, and users to explore cutting-edge ideas and results, and to exchange techniques, tools, and experiences. The best paper award recognizes the highest quality paper at the conference.

Title: DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation

Paper link: http://dl.acm.org/citation.cfm?doid=2723372.2737792
Abstract:
DBSCAN is a popular method for clustering multi-dimensional objects. Just as notable as the method’s vast success is the research community’s quest for its efficient computation. The original KDD’96 paper claimed an algorithm with O(n log n) running time, where n is the number of objects. Unfortunately, this is a mis-claim; and that algorithm actually requires O(n2) time. There has been a fix in 2D space, where a genuine O(n log n)-time algorithm has been found. Looking for a fix for dimensionality d ≥ 3 is currently an important open problem.

In this paper, we prove that for d ≥ 3, the DBSCAN problem requires Ω(n4/3) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. This (i) explains why the community’s search for fixing the aforementioned mis-claim has been futile for d ≥ 3, and (ii) indicates (sadly) that all DBSCAN algorithms must be intolerably slow even on moderately large n in practice. Surprisingly, we show that the running time can be dramatically brought down to O(n) in expectation regardless of the dimensionality d, as soon as slight inaccuracy in the clustering results is permitted. We formalize our findings into the new notion of ρ-approximate DBSCAN, which we believe should replace DBSCAN on big data due to the latter’s computational intractability.