Home

Random sampling techniques for space e cient online computation of order statistics of large datasets


Author(s) : Gurmeet Singh Manku, 
Publisher : N/A
Publication Date : 1999
ISSN : N/A
Abstract : In a recent paper [MRL98], we had described a general framework for single pass approximate quantile nding algorithms. This framework included several known algorithms as special cases. We had identi ed a new algorithm, within the framework, which had a signi cantly smaller requirement for main memory than other known algorithms. In this paper, we address two issues left open in our earlier paper. First, all known and space e cient algorithms for approximate quantile nding require advance knowledge of the length of the input sequence. Many important database applications employing quantiles cannot provide this information. In this paper, we present anovel non-uniform random sampling scheme and an extension of our framework. Together, they form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length. Second, if the desired quantile is an extreme value (e.g., within the top 1 % of the elements), the space requirements of currently known algorithms are overly pessimistic. We provide a simple algorithm which estimates extreme values using less space than required by the earlier more general technique for computing all quantiles. Our principal observation here is that random sampling is quanti ably better when estimating extreme values than is the case with the median. 1,