In data structures, the range mode query problem asks to build a data structure on some input data to efficiently answer queries asking for the mode of any consecutive subset of the input.

Problem statement edit

Given an array  , we wish to answer queries of the form  , where  . The mode   of any array   is an element   such that the frequency of   is greater than or equal to the frequency of  . For example, if  , then   because it occurs three times, while all other values occur fewer times. In this problem, the queries ask for the mode of subarrays of the form  .

Theorem 1 edit

Let   and   be any multisets. If   is a mode of   and  , then   is a mode of  .

Proof edit

Let   be a mode of   and   be its frequency in  . Suppose that   is not a mode of  . Thus, there exists an element   with frequency   that is the mode of  . Since   is the mode of   and that  , then  . Thus,   should be the mode of   which is a contradiction.

Results edit

Space Query Time Restrictions Source
    [1]
      is the word size [1]
    [2]
      [1]
      [2]

Lower bound edit

Any data structure using   cells of   bits each needs   time to answer a range mode query.[3]

This contrasts with other range query problems, such as the range minimum query which have solutions offering constant time query time and linear space. This is due to the hardness of the mode problem, since even if we know the mode of   and the mode of  , there is no simple way of computing the mode of  . Any element of   or   could be the mode. For example, if   and its frequency is  , and   and its frequency is also  , there could be an element   with frequency   in   and frequency   in  .  , but its frequency in   is greater than the frequency of   and  , which makes   a better candidate for   than   or  .

Linear space data structure with square root query time edit

This method by Chan et al.[1] uses   space and   query time. By setting  , we get   and   bounds for space and query time.

Preprocessing edit

Let   be an array, and   be an array that contains the distinct values of A, where   is the number of distinct elements. We define   to be an array such that, for each  ,   contains the rank (position) of   in  . Arrays   can be created by a linear scan of  .

Arrays   are also created, such that, for each  ,  . We then create an array  , such that, for all  ,   contains the rank of   in  . Again, a linear scan of   suffices to create arrays   and  .

It is now possible to answer queries of the form "is the frequency of   in   at least  " in constant time, by checking whether  .

The array is split B into   blocks  , each of size  . Thus, a block   spans over  . The mode and the frequency of each block or set of consecutive blocks will be pre-computed in two tables   and  .   is the mode of  , or equivalently, the mode of  , and   stores the corresponding frequency. These two tables can be stored in   space, and can be populated in   by scanning     times, computing a row of   each time with the following algorithm:

algorithm computeS_Sprime is
    input: Array B = [0:n - 1], 
        Array D = [0:Delta - 1], 
        Integer s
    output: Tables S and Sprime
    let S ← Table(0:n - 1, 0:n - 1)
    let Sprime ← Table(0:n - 1, 0:n - 1)
    let firstOccurence ← Array(0:Delta - 1)
    for all i in {0, ..., Delta - 1} do
        firstOccurence[i] ← -1 
    end for
    for i ← 0:s - 1 do    
        let j ← i × t
        let c ← 0
        let fc ← 0
        let noBlock ← i
        let block_start ← j
        let block_end ← min{(i + 1) × t - 1, n - 1}
        while j < n do    
            if firstOccurence[B[j]] = -1 then
                firstOccurence[B[j]] ← j
            end if		
            if atLeastQInstances(firstOccurence[B[j]], block_end, fc + 1) then
                c ← B[j]
                fc ← fc + 1
            end if		
            if j = block_end then
                S[i * s + noBlock] ← c
                Sprime[i × s + noBlock] ← fc			
                noBlock ← noBlock + 1
                block_end ← min{block_end + t, n - 1}
            end if
        end while
        for all j in {0, ..., Delta - 1} do
            firstOccurence[j] ← -1 
        end for
    end for

Query edit

We will define the query algorithm over array  . This can be translated to an answer over  , since for any  ,   is a mode for   if and only if   is a mode for  . We can convert an answer for   to an answer for   in constant time by looking in   or   at the corresponding index.

Given a query  , the query is split in three parts: the prefix, the span and the suffix. Let   and  . These denote the indices of the first and last block that are completely contained in  . The range of these blocks is called the span. The prefix is then   (the set of indices before the span), and the suffix is   (the set of indices after the span). The prefix, suffix or span can be empty, the latter is if  .

For the span, the mode   is already stored in  . Let   be the frequency of the mode, which is stored in  . If the span is empty, let  . Recall that, by Theorem 1, the mode of   is either an element of the prefix, span or suffix. A linear scan is performed over each element in the prefix and in the suffix to check if its frequency is greater than the current candidate  , in which case   and   are updated to the new value. At the end of the scan,   contains the mode of   and   its frequency.

Scanning procedure edit

The procedure is similar for both prefix and suffix, so it suffice to run this procedure for both:

Let   be the index of the current element. There are three cases:

  1. If  , then it was present in   and its frequency has already been counted. Pass to the next element.
  2. Otherwise, check if the frequency of   in   is at least   (this can be done in constant time since it is the equivalent of checking it for  ).
    1. If it is not, then pass to the next element.
    2. If it is, then compute the actual frequency   of   in   by a linear scan (starting at index  ) or a binary search in  . Set   and  .

This linear scan (excluding the frequency computations) is bounded by the block size  , since neither the prefix or the suffix can be greater than  . A further analysis of the linear scans done for frequency computations shows that it is also bounded by the block size.[1] Thus, the query time is  .

Subquadratic space data structure with constant query time edit

This method by [2] uses   space for a constant time query. We can observe that, if a constant query time is desired, this is a better solution than the one proposed by Chan et al.,[1] as the latter gives a space of   for constant query time if  .

Preprocessing edit

Let   be an array. The preprocessing is done in three steps:

  1. Split the array   in   blocks  , where the size of each block is  . Build a table   of size   where   is the mode of  . The total space for this step is  
  2. For any query  , let   be the block that contains   and   be the block that contains  . Let the span be the set of blocks completely contained in  . The mode   of the block can be retrieved from  . By Theorem 1, the mode can be either an element of the prefix (indices of   before the start of the span), an element of the suffix (indices of   after the end of the span), or  . The size of the prefix plus the size of the suffix is bounded by  , thus the position of the mode isstored as an integer ranging from   to  , where  indicates a position in the prefix/suffix and   indicates that the mode is the mode of the span. There are   possible queries involving blocks   and  , so these values are stored in a table of size  . Furthermore, there are   such tables, so the total space required for this step is  . To access those tables, a pointer is added in addition to the mode in the table   for each pair of blocks.
  3. To handle queries   where   and   are in the same block, all such solutions are precomputed. There are   of them, they are stored in a three dimensional table   of this size.

The total space used by this data structure is  , which reduces to   if we take  .

Query edit

Given a query  , check if it is completely contained inside a block, in which case the answer is stored in table  . If the query spans exactly one or more blocks, then the answer is found in table  . Otherwise, use the pointer stored in table   at position  , where   are the indices of the blocks that contain respectively   and  , to find the table   that contains the positions of the mode for these blocks and use the position to find the mode in  . This can be done in constant time.

References edit

  1. ^ a b c d e f Chan, Timothy M.; Durocher, Stephane; Larsen, Kasper Green; Morrison, Jason; Wilkinson, Bryan T. (2013). "Linear-Space Data Structures for Range Mode Query in Arrays" (PDF). Theory of Computing Systems. Springer: 1–23.
  2. ^ a b c Krizanc, Danny; Morin, Pat; Smid, Michiel H. M. (2003). "Range Mode and Range Median Queries on Lists and Trees" (PDF). ISAAC: 517–526. arXiv:cs/0307034. Bibcode:2003cs........7034K.
  3. ^ Greve, M; Jørgensen, A.; Larsen, K.; Truelsen, J. (2010). "Cell probe lower bounds and approximations for range mode". Automata, Languages and Programming: 605–616.