Draft:String sorting algorithms


In computer science, string sorting algorithms are a special case of sorting algorithms, where the input is an array of strings with characters chosen from an alphabet Σ.

An example for sorting strings

Unlike traditional sorting algorithms that deal with atomic keys, string sorting encounters unique challenges. Sorting strings using conventional atomic sorting algorithms, which treat keys as indivisible objects, is inefficient because comparing entire strings can be costly and must be performed numerous times. Efficient string sorting algorithms, in contrast, inspect most characters of the input only once during the entire sorting process and they examine only those characters that are necessary to establish the correct ordering. Another challenge is that strings are represented as arrays of pointers. This representation results in indirect access to string characters, leading to cache faults during the access, even when scanning an array of strings. This is in contrast to sorting atomic keys, where scanning is notably cache efficient. The efficiency of string sorting algorithms depens upon multiple factors, including the size of the dataset (), the distinguishing prefix size of (), which is the minimal number of characters that need to be examined to sort the strings, the number of subproblems (σ), into which the algorithm breaks down the problem, and the underlying hardware. This indicates that no singular algorithm is universally optimal.

Sequential Methods edit

Multikey quicksort edit

 
An example for Mulitkey Quicksort for sorting strings

Developed by Bentley and Sedgewick in 1997, this algorithm is an adaptation of traditional quicksort, tailored for string sorting.[1] It uses the character   with a common prefix of length h as a splitter, organizing the strings into three distinct arrays based on their  th character's relation to  . The algorithm recurses until the termination condition is met: if   termination with  . With Insertion Sort as a base case sorter for constant input sizes, multikey quicksort has a complexity of  .

Most significant digit (MSD) radix sort edit

 
An example for MSD-Radix Sort for sorting strings

Most significant digit (MSD) radix sort is especially efficient for sorting large datasets, particularly when the alphabet size is small[2]. The algorithm initiates sorting by examining the (h+1)-th character of each string with   as the common prefix, subsequently dividing the dataset into σ distinct subproblems. Each subproblem is then recursively sorted with the common prefix length  . This strategy, which is a natural approach to string sorting, has been subject to numerous refinements and improvements across various studies in the literature [3] [4] [5]. The time complexity is O(D) plus the time required for sorting the base cases. For example, with multikey quicksort as the base case sorter MSD radix sort has a complexity of O(D + n log σ).

Burstsort edit

 

Burstsort uses a trie-based structure with containers at the leaves for sorting the strings.[6][7] Upon reaching a predefined threshold, these containers "burst", redistributing the strings into new containers based on their next character. These new containers are then attached to the appropriate child nodes of the trie. The sorting process involves traversing the trie and individually sorting the small containers. Key factors influencing the runtime efficiency of Burstsort include the trie implementation, the design of the containers, the burst threshold, and the chosen base algorithm for sorting the containers. Sinha and Zoble used an array for each trie node and unordered dynamic arrays of string pointers for the leaf containers, with a bursting threshold set at 8192.[8] With this configuration and multikey quicksort for sorting the leaves, burstsort has a complexity of O(D + n log σ).

LCP-mergesort edit

LCP-mergesort is an adaptation of the traditional merge sort algorithm, which stores and reuses the longest common prefixes (LCPs) of consecutive strings in the sorted subproblems [9]. This strategy enhances the efficiency of string comparisons. In the conventional method the strings   and   must be compared character-by-character. However, with the LCP information for   and   relative to another string   of similar or smaller size allows the preliminary use of the LCP. If the LCP between   and   is shorter than that between   and  , it follows that   precedes   in lexicographical order due to   and   sharing a shorter common prefix than   and  . This also applies symmetrically. LCP-Mergesort has a worst-case time complexity of  .

Insertion sort edit

Insertion sort is frequently used as the base case sorter for small sets of strings.[10] The algorithm stores an ordered array and inserts the unsorted items into their appropriate positions through linear scanning. This method treats strings as atomic units, necessitating full string comparisons during the linear scan to ensure the correct order. It has a worst-case time complexity of  . So it is particularly good for small   and  , due to the cache-efficient manner in which strings are scanned.

Parallel methods edit

The exploration of parallel string sorting algorithms remains limited, yet it is the only way to get performance out of Moore's Law.[11] The scalability of an algorithm in a parallel computing environment depends on various factors, similar to those affecting sequential methods. Many of the algorithms discussed in the sequential context can be adapted for parallel execution.

References edit

  1. ^ BENTLEY, Jon L.; SEDGEWICK, Robert. Fast algorithms for sorting and searching strings. In: Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. 1997. S. 360-369.
  2. ^ KÄRKKÄINEN, Juha; RANTALA, Tommi. Engineering radix sort for strings. In: International Symposium on String Processing and Information Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. S. 3-14.
  3. ^ KÄRKKÄINEN, Juha; RANTALA, Tommi. Engineering radix sort for strings. In: International Symposium on String Processing and Information Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. S. 3-14.
  4. ^ NG, Waihong; KAKEHI, Katsuhiko. Cache efficient radix sort for string sorting. IEICE transactions on fundamentals of electronics, communications and computer sciences, 2007, 90. Jg., Nr. 2, S. 457-466.
  5. ^ ANDERSSON, Arne; NILSSON, Stefan. Implementing radixsort. Journal of Experimental Algorithmics (JEA), 1998, 3. Jg., S. 7-es.
  6. ^ SINHA, Ranjan; ZOBEL, Justin. Efficient trie-based sorting of large sets of strings. In: ACSC. 2003. S. 11-18.
  7. ^ HEINZ, Steffen; ZOBEL, Justin; WILLIAMS, Hugh E. Burst tries: a fast, efficient data structure for string keys. ACM Transactions on Information Systems (TOIS), 2002, 20. Jg., Nr. 2, S. 192-223.
  8. ^ SINHA, Ranjan; ZOBEL, Justin. Cache-conscious sorting of large sets of strings with dynamic tries. Journal of Experimental Algorithmics (JEA), 2004, 9. Jg., S. 1.5-es.
  9. ^ NG, Waihong; KAKEHI, Katsuhiko. Merging string sequences by longest common prefixes. IPSJ Digital Courier, 2008, 4. Jg., S. 69-78.
  10. ^ MCCLELLAN, Michael T.; MINKER, Jack. The art of computer programming, vol. 3: sorting and searching. 1974.
  11. ^ MOORE, Gordon E. Cramming more components onto integrated circuits. Proceedings of the IEEE, 1998, 86. Jg., Nr. 1, S. 82-85.