Friday, October 26, 2007

THE UNSHUFFLE ALGORITHM

----------------------------------
By Art S. Kagel


One of the most common program utilities is the sort. And one of the most common applications of a sort utility is to sort a list of data items that may already be well-ordered or indeed, sorted.

Sort algorithms are optimal for some data classes while exhibiting worst-case behavior for other data classes. Most sort algorithms do not handle it no better than random data, and a few actually perform slightly better when the data to be sorted is random.

An algorithm that deals efficiently with well-ordered data is useful in many programming situations, such as sorting a user input list of similar items. People tend to enter a list of related items in a logical, though usually imperfect, order. If asked to enter a list of U.S. presidents, for example, most people will either enter them in chronological or alphabetical order, with some “errors” (for example, RONAL Reagan might be listed first in either case since he is the current president and would probably come to many peoples’ mind first.)

Another major application for such an algorithm would be to repeatedly sort a list to which disorder may or may not be introduced from time to time. An example of would be a adding to the end of an already sorted list.

In developing the algorithm shown in Listing 1, I sought to create a method for sorting well-ordered data. What evolved was an extremely fast and efficient algorithm for data that is either well-ordered, well-ordered but reversed, or partially ordered. It compares favorably with the performance of other algorithms when applied to random data.

I have termed this algorithm the unshuffle algorithm (as opposed to sort) because of the principle behind it. My basic premise was to find an algorithm intelligent enough to know when a list is well-ordered or already sorted. (This concept is, of course, in the realm of artificial intelligence; indeed, my goal was not reached entirely.) I began by examining the nature of unordered lists with the idea that if I could determine how they become unordered, I could write an algorithm to efficiently reorder them.

Shuffling the deck

Think of a sorted list as a deck of cards, all one suit, in ascending order. If one wanted to create a condition of fairly random disorder within this deck, one might cut the deck into two or more subdecks and shuffle them back together. The resulting deck would appear to be randomly ordered (or rather, disordered).

In reality, each of the subdecks is still sorted after this first shuffle; the cards are simply intermingled. Undoing this shuffle would be a relatively simple problem to program. Of course, after several such shuffles the situation becomes more complex, and the problem of unshuffling the deck becomes equally complex but, one would hope, still manageable.

You might go about undoing the shuffle by turning the cards over one at a time onto piles. You could put a card on top of a pile if its value is greater than the top card on an existing pile or start a new pile if it is less. The result when you finished would be several sorted piles of cards (unless the deck was already sorted, in which case the first pile would be holding all the cards.)

To get back to one sorted deck, you might select the largest card showing on any pile and turn it over as bottom card on the final deck, then select the largest card still showing and turn it over on top of the first one, continuing in this manner until all the cards had been merged back into the original deck in sorted order.

This method of undoing the shuffle is the essence of the unshuffle algorithm. The algorithm tries to put items either on the top or the bottom of a pile before testing against the next pile. It then merges the piles one at a time into the first pile using a modification of the classic two pile merge.

Distribution and merge

This algorithm is more efficient for a list that is already well-ordered (like the first- order shuffle) because unlike a sort (which tries to introduce order into the disorderly world) the unshuffle algorithm tries to extract what order it can find. The algorithm simply goes about distributing cards onto piles until there are no more in the deck. It then performs a simple merge of the piles. If the data is orderly there will be fewer piles and both the distribution phase and merge phase will run more efficiently, making fewer comparisons.

Most sort algorithms need to complete their tasks even if the data is completely sorted. A classic shell sort, for example, would need to make at least N/I passes (where I is the number of initial partitions—see D. Knuth’s The Art of Computer Programming, vol. 3, page 84ff), even over an already sorted list of N elements, making from N/K(1) to N-1 comparisons (where K(i)= K(i-1)/I and K(1)=N/I) on each pass.

The unshuffle algorithm can perform its distribution phase in just one pass with from N-1 to N**2/2 comparisons, while the merge phase performs from N-4 to N**2/8 comparisons. In addition, the algorithm gains efficiency over other sorts because it never exchanges two elements and never inserts an element between two others—both expensive operations. It is also space efficient since relatively few piles are created (except in the worst case, where there are N/2 piles).

The distribution phase of the unshuffle algorithm exhibits its worst-case behavior under the rare circumstance that the smallest item is followed by the largest, which is followed by the next smallest followed by the next largest, etc. (for example, 1, 10, 2, 9, 3, 8, 4, 7, 5, 6). The order of the distribution phase in this case becomes ((N**2)-1)/2, or approximately N**2.

Fortunately, this class of data also represents the best general case for the merge phase of the algorithm, which in this instance is of order N. The best case for the distribution occurs when the list to be operated on is perfectly ordered but reversed (such as 10, 9, … 2, 1). In this case the order of the distribution phase of the algorithm becomes (N-1), approximately N.

For a perfectly ordered list the order list the order is (N-1)*2, which is also proportional to N. (The order of the sorted and sorted-but-reversed cases could be exchanged by reversing the sequence in which the next element is tested against the top and bottom elements of each pile.) The order of the merge phase in the perfectly ordered cases is zero, as only one pile is generated and there is nothing to merge.

The merge phase of the unshuffle algorithm takes advantage of a property of the piles to minimize the number of comparisons needed and to eliminate unnecessary code. This property is a result of the logic used during the distribution phase. In a simple merge of two lists (see The Art of Computer Programming, vol. 3, page 159f) containing M and N elements, M+N-1 comparisons (at most) would be necessary. On the other hand, three lists of M, N and L elements merged in pairs would take at most 2*(M+N)+L-2 comparisons.

In general, the order of a pairwise merge of P equal piles is (N/(P-1)*(P*(P+1)/2))/2. But due to certain properties of the distribution onto piles, the merge phase of the unshuffle algorithm will need at most N*(P-1)-2*(P-1)**2 comparisons when all piles contain two elements except the first (where P is the number of piles).

When all piles have two elements except the last, the merge phase exhibits its best general case behavior, performing N-2 comparisons. At its worst (when P=N/4+1), this general equation is marginally better than a general merge (one with no knowledge). At its best (when P=N/2, where only N-4 comparisons are performed), it is considerably better.

The distribution of the elements of the original list onto piles guarantees that the smallest element will be on top of the first pile created and the largest element will be on the bottom of the first pile. Each pile is wholly contained between the first and the last elements of the pile created before it. Therefore, when merging the Rth pile into the list that resulted from merging the first through (R-1)th piles, it is not necessary to compare elements in the Rth pile to any element in the previously merged list that is smaller than the element that was originally the smallest on the (R-1)th pile.

In addition, the pile rather than the previously merged list as long as you begin the merge with the first pile and end it with the last pile. This process eliminates the complex logic needed to keep track of which pile will become empty first.

Unshuffle in action

Table 1 (not available here)presents comparative timings that resulted from an implementation of the unshuffle algorithm and a similar implementation of shell sort. The implementation was written in MS Pascal language constructs and MS library procedure calls. Since it was designed as a library utility it is generally useful and has user- supplied procedures (passed as parameters) for comparisons and managing data items. It is therefore independent of the data item’s storage structure and type.

However, this implementation is somewhat slower than a pure implementation of the algorithm optimized for a specific data structure, due to procedure call overhead. The user-supplied comparison procedure looks at one element in relation two others to determine if it is less than or equal to the first, greater than or equal to the second, or somewhere between the two. This procedure improves performance in all cases except the well-ordered-but-reversed.

The shell sort algorithm used is a similar utility that also has user-supplied procedures for comparisons and data item management. The shell sort uses the K(i)=K(i-1)/I, K(1)=N/I, and I=2 algorithm; where K is the interval of comparison and I is the number of partitions for the first pass. It is a good implementation that is considerably more efficient for well-ordered data than random data.

The timings were run on an IBM PC/AT with word data stored in a singly linked list allocated on the long heap. Both algorithms were called to sort identical lists created using a random number generator (values were in the range 0…64K). Both were then asked to resort their output to acquire the timings for presorted data. Times were taken from the system clock immediately before and after the calls to the respective sort procedures.

It is obvious from the timings in the Table 1 that for already-sorted data the order of the sort is approximately N. This variance from the expected order of ~2N is due to the implementation that uses a single procedure call to compare an element to two others. It is a savings of procedure call overhead rather than algorithmic efficiency.

(Note from the timings that the order of the unshuffle algorithm is apparently 3N for the pseudorandom data that was generated. I have not attempted to determine the theoretical order of the algorithm for the random case and would welcome any reader input on the subject.)

Listing 1 shows the unshuffle algorithm in words. Listing 2 is a pure implementation of the algorithm in MS Pascal for integer data stored on the heap in a singly linked list. A third listing (which shows the generalized implementation used for the timings in Table 1) can be found on the COMPUTER LANGUAGE Bulletin board Service and CompuServe Forum.

I should note that the un shuffle algorithm does have one restriction: you can only create implementations of it for sorting a linked list containing either data elements or pointers to data elements, including array indices and file positions. I have not yet been able to write an implementation that can sort arrays directly.







-----------------------------------------------------------------------------------

Art Kagel has over five years’ experience as an analyst and programmer developing applications for media market research firms and the insurance industry. He is currently a consultant with Business Logic Inc., New York, N.Y., a consulting firm providing computerized solutions to business problems.
(This info was written around the 80s)

(I found this article interesting and so I published. It is taken from a very very old magazine about computers(forgot the name)...around the 80s)

No comments: