Talk:External sorting

Latest comment: 11 months ago by 47.211.219.243

Some recent research (IEEE link) may help address several of the points listed in the Wishlist below. 47.211.219.243 (talk) 18:42, 5 May 2023 (UTC)Reply

What is the complexity, in terms of disc accesses? --80.250.159.240 (talk) 15:38, 29 May 2009 (UTC)Reply

Wishlist edit

  • Vocab cleanup. What we call a "one-pass sort" others call "two-pass". There's confusion with internal mergesort as flagged in the next section.
  • Citations. The Vitter source and the papers at sortbenchmark.org probably have most of the info we need or could ever want about this.
  • Detailed anaylsis of performance. People writing, using, or interested in external sorts could use a formula or two to calculate stuff like the number of seeks and amount of data transferred they'll need -- that must be in Vitter, or very similar to stuff in it. It can help folks model the runtime as well as what would happen if they added RAM, got faster or more disks, etc.
  • Common applications. Databases and full-text indexers do this. Maybe Index (search engine) or Lucene could use a bit about the algorithms, and link here.
  • Cutting-edge applications. There's lots of interesting stuff with solid citations in the CS literature to back it up:
    • STXXL (toolkit for easy external sorting)
    • External memory suffix array creation for more powerful full-text indexes
    • Pipelining algorithms to reduce I/O when several sorting and filtering steps are involved —Preceding unsigned comment added by 24.7.68.35 (talk) 07:35, 10 January 2010 (UTC)Reply

24.5.242.3 (talk) 23:28, 10 January 2011 (UTC)Reply

External Sort != Merge Sort edit

The links to the Merge Sort page are wrong. The n-way merge done in an external sort is different than the internal merge sort algorithm. Knuth Vol 3 contains the n-way merge algorithm. 75.35.77.112 (talk) 19:40, 17 December 2010 (UTC)Reply

Issue with external merge sort algorithm based on k-way merge edit

The main issue is the k-way merge in the description. Step 4 of the external merge sort description is Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer.. The issue with this is that on a modern hard drive, between rotational delays and head settle time versus transfer rate, the time it takes to perform 9 random accesses to read the first 10MB of 9 different 100MB chunks will take almost as long as it does to read 900MB sequentially. The process will end up doing 90 somewhat random accesses of the data to read the 900MB's of chunks. It's generally faster to stick with a conventional 2 way merge, doing larger sequential reads and writes of data, inspite of the increased number of passes.

Rcgldr (talk) 08:53, 12 January 2012 (UTC)Reply

What if we increase all the sizes 100x? It will still be 90 random accesses, but sequential reading times will be 100x longer, and it might become practical. Also we need to remember that even if the algorithm is not the most efficient one under certain circumstances, it still has a place in Wikipedia (along with WP:V analysis of the performance issues, if any). Ipsign (talk) 13:00, 12 January 2012 (UTC)Reply
The stated sizes are fine, my math was wrong. I was thinking 1/10th of the stated sizes. With a modern hard drive, a 10MB read or write would take about .1 second or more, while the random access overhead would be about .02 second, so random access overhead would not be that significant. Considering a PC could have 4GB to 32GB of ram, using 1GB of ram for the sort and doing 100MB reads or writes would take about 1 second each. 1.7GB would be enough for a 16-way merge.
A side issue is that Merge_sort#Use_with_tape_drives contains a link to External_sorting, and in the case of tape drives, a k-way merge sort normally requires 2 k tape drives. Perhaps the link from the tape section of merge sort should be removed. Rcgldr (talk) 16:53, 12 January 2012 (UTC)Reply

quicksort is not a stable sort edit

The article mentions using quicksort to sort 100MB chunks on the initial pass, but I'm wondering if stability of the sort (the order of records preserved on equal compares) is an issue for a typical file sort. An example might include sorting a file by one key for some reason, then sorting later by a different key, but wanting "equal" records in the second sort to retain the first key ordering. If sorting a list of pointers to records where comparason overhead may be greater than moving pointers, a ram based merge sort will usually be faster. If the amount of natural ordering is significant, then a natural merge sort would be significantly faster. It might be enough to just leave the reference to quicksort and mention merge sort in ram if stability is required, or in some situations where a merge sort would be faster. ... or simpler still, just mention sorting in ram by any (stable if needed) sort algorithm. Rcgldr (talk) 17:13, 12 January 2012 (UTC)Reply

External sorting == file sorting ? edit

I've seen few times people use term "file sort(ing)" as a synonym for subj. And recently on reddit. My English skills are rather low, and even lower for terminology. Can someone verify if these actually are synonymous? In that case we should create redirection page. --Alexander N. Malakhov (talk) 07:15, 24 July 2012 (UTC)Reply

An internal sort is done in the computer's main (internal) memory. An external sort uses external memory such as tape or disk drives (but not necessarily organized in a computer file system; the access may be raw). The internal/external terms are well defined. One can certainly say "sort a file" or the "file was sorted" in the context of a computer file, but that is applying the sort operation to an object (which might be a file, a database, a directory, or even a non-computer object such as recipes). In some contexts, "file sort" will mean an external sort, but I'm not sure it necessarily means an external sort: I might read the file into an application, internally sort the contents, and then write the file out. Furthermore, if I say "file sorting", one of the images that comes to mind is alphabetizing a lot of manila file folders in a dusty file cabinet. That involves sorting many manila "files" -- not records within the files. Consequently, I'm reluctant to add a file sort entry pointing here. Glrx (talk) 17:14, 25 July 2012 (UTC)Reply

Have the economics changed? edit

The text now notes that hooking up more hard disks for sequential bandwidth can be a cost-efficient improvement. That's certainly still true, but it's been a long time since I did the math on whether, say, an SSD as a layer between RAM and spinning disks can sometimes be superior to just getting more RAM or more disks. As of a few years ago SSDs only helped if you were willing to pay a lot per GB of dataset, but they've gotten both faster and cheaper.

Another potentially relevant topic is, roughly, the pragmatics of huge-dataset sorting on real-world clusters. There's an obvious example in the big public clouds where you have a menu of node/disk types and explicit cost figures for each, and some known network/reliability constraints. But even in typical private environments, folks have the same problems of choosing the best of some imperfect hardware options and dealing with networking and reliability. Don't know just what we can say besides laying out some principles and maybe example calculations, but it's certainly the sort of problem that people concerned with external sorts actually face.

Obviously we have write without original research, but if there's something worth the time to say it's probably out there.24.7.67.251 (talk) 21:51, 7 July 2013 (UTC)Reply

So I considered the arithmetic a little more. For (very roughly) the price of a 256GB SSD you could add 16GB of ECC RAM to your machine. So an SSD is only likely to provide noticeable benefits at affordable cost 1) if your data's few enough GBs you can afford to use *only* SSD for external storage (hooray), or 2) if your data's enough TBs that a purely RAM/HDD setup would spend substantial time on HDD seeks while merging even with 16GB of RAM. (16TB is enough for this to kick in--by then you need a thousand merge buffers and a million seeks on a 16GB RAM box; assuming the sequential part goes at 1GB/s and you do 160 seeks/s, you might spend a third of the sort seeking--enough time to matter but not enough to justify another pass.) In the huge-data case, the SSD could cut the number of HDD seeks to roughly what it would be if you had 256GB RAM: the initial sort phase would sort SSD-sized chunks using two passes, and you could do larger reads from fewer chunks into scratch space on the SSD during the merge phase. Finally, a power-constrained environment (I don't know, maybe a laptop workstation?), or one where you share the system so don't want to use all RAM for your sort (plausible) may point towards using SSDs.24.7.67.251 (talk) 20:54, 22 September 2013 (UTC) 24.7.67.251 (talk) 20:18, 27 November 2013 (UTC)Reply

Distributed sorts edit

Would be good to sketch out what that looks like, with some links to what Sort Benchmark or Hadoop (or anything else that's been written up) do. Maybe individual machines sort their stuff, then everyone agrees on how to partition the data, then destination machines receive and merge streams from their sources. Maybe the machines just sample the data at the start and then partition before sorting. Maybe some wacky permutation of that; would be neat to see from folks who've implemented it. Also, curious how it is connected with other tasks (aggregation), which edges towards talking about systems like Hadoop/MapReduce.

We've also only written in detail about sort-chunks-and-merge, which seems like a common strategy. But we could cover bucket-then-sort, especially if there are practical uses (producing first records earlier?). — Preceding unsigned comment added by 50.184.203.28 (talk) 07:49, 15 March 2015 (UTC)Reply

Not unrelatedly, the stuff about three passes doesn't ring quite right to me now. If the data is huge you turn it into a distributed sort, or add RAM, or maybe add SSD, and 500 GB as "a lot of data" is maybe true in some casual sense or in comparison to 1 GB of RAM, but doesn't reflect any fundamental CS truth. I think you need something about the math of sort cost as size varies, maybe with examples. It just can end up looking silly if it's too tied to a certain desktoppish scenario as of a certain time. — Preceding unsigned comment added by 24.7.64.61 (talk) 19:13, 27 March 2015 (UTC)Reply

Performance - increasing hardware speed edit

If the internal sort consists of sorting an array of pointers to records rather than sorting the records themselves, then the "descriptors" for the scatter / gather hard disk controllers (used to deal with scattered physical data from a contiguous virtual memory block), could be programmed to write the records according to the sorted array of pointers in a single disk write command, avoiding having to move the actual records before doing a write. Although this should be obvious to anyone aware of scatter / gather controllers and sorting via array of pointers, some Japanese company managed to get a patent for this (back in the 1980's or older, so the patent is expired by now). As for if sorting via an array of pointers is faster than directly sorting the records, it depends on the record size: by my testing, somewhere between 128 bytes and 256 bytes, so for classical record sizes of 80 or 120 bytes, a conventional approach would be faster. Rcgldr (talk) 23:12, 6 February 2018 (UTC)Reply