Your post is full of assumptions, omissions, etc. No worries. I'll sort you out.
Quote:
Tycho, believe it or not, the statements about quicksort are based on empirical data. It is true that merge sort parallelizes particularly well, but the best external-sort routine I've seen for distributed systems is based on column sort. Most of the complexity deals with handling the slowest cases, an issue you don't even appear to be considering.
I'm aware of that. Which is why I stressed not only highly randomized data, presorted and reverse sorted data. If there's a worst case, these three different data sets are going to nail it.
Quote:
Also, just about all the pages mentioned the algorithm's asymptotic execution time at some point. I just didn't quote that part.
My point wasn't that they didn't cite the O notation, but that they didn't use the word 'theoretically'. This is the problem with computer science majors. They think in terms of what would work best on ideal hardware where there are no latencies, and everything runs without delay.
Quote:
Quote:
O(n log n) may be true when n is 10, 1000, 10000. But when n is a billion, or a hundred billion, you start seeing that the timing is no longer what you expected. You will start getting 10n log n, 1000n log n, and so forth. Why is this? What are the computer science majors missing here?
By definition, high values of N make complexity bounds more important, not less. In this particular case, passing some particular value of N turns the problem into an external sort, which is a considerably different problem. Quicksort's locality of reference still gives it good performance in many configurations. Your optimal algorithm might well change with the next disk upgrade. I've explained how merge sort's main advantage is going to disappear when random disk accesses become as fast as sequential ones.
Your first sentence shows you're not thinking about this practically. I can guarantee you that your "worst case" timing will get exponentially worse with a large enough dataset. You don't have to push the envelope very far to see this yourself. I've seen this firsthand.
Ironically, you bring up one of the very points that I find crucial, and you've taken the side that I do. What algorithm is ideal, and how will it be ideal 5 years from now? 10 years? 20? There's no way you can program for the future. Take the change from Pentium III to Pentium 4 for example. The Pentium 4 had design flaws (a topic for a later time) which made it both very difficult to write code for, and it broke compatibility with the tricks that people had used for years to optimize for the Pentium, Pentium II and Pentium III. Intel saw the flaws and treated them as though they were a feature of the new chip. They wrote up optimization docs that were carefully worded, trying to both work around the flaws and not make them readily apparent.
This is one of the many things I'm working to solve. In the past decade, we've seen no less than 8 versions of MMX (MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE5). This is ridiculous. Intel and AMD have ridiculous fights which have no bearing on what's best for the future of the x86 architecture.
I think the solution is a hypervisor which runs at the lowest possible level, emulates x86 instructions from the numerous MMX versions, as well as any other extraneous instructions. This would be combined with a stripped x86 chip, containing perhaps only the instructions from a 586. This project is one I'm involved in right now, working with fellow programmer and computer engineer Darek Mihocka (
http://www.emulators.com).
Quote:
The computer science majors are trying to write code that will still work efficiently in ten years, or twenty, when all the hardware looks different and all the threshholds have changed. How's that optimized code for MS-DOS holding up? All that hand-tuned assembly language to write to the I/O ports directly and swap in overlays from disk really comes in handy, doesn't it? What's that? On modern machines, you have to run the program in an emulator and you've slowed the program down so much that it stutters? Those assumptions about the hardware you're running on are so completely outdated that it would be easier just to start over from scratch?
Quote:
Quote:
They can’t tell you the difference between an AMD Athlon XP and an Intel Pentium 4. They can’t explain why the new Core 2 is such a good thing.
Any optimization you make specifically to speed up performance on the Core 2 processor will be counterproductive five years from now. If you're writing a console game or embedded app, designed to run only on identical platforms for the rest of its life cycle, you can make more assumptions. If real-time response is a priority, then you also need to make whatever trade-offs you have to today on each individual platform.
You're partially correct. First of all, yes, computer science majors have written algorithms that will theoretically stand the test of time. I'm not calling for hand-tuned assembly, though. That was an interesting conclusion for you to make. I'm saying you need to optimize for the here and the now. There are many ways to do this in C and C++ without even writing a single line of assembly. Most of the time, it's learning how the compiler optimizes and tricking it into doing better than it already does.
But why I am telling you to optimize for current hardware? I very carefully named this challenge a computer
engineering problem. It's not about trees, linked lists and Java. It's about fine-tuning something to work as fast as possible on modern hardware, considering hardware constraints and overcoming them. That's what engineering is.
The problem could just as easily have been stated in early 1980's terms of you have a floppy disk full of a few tens of thousands of integers and an Apple II with say 48K of RAM, sort the data on the floppy disk. The issue isn't how large the data set is, the issue is how you do a sort where the dataset is much larger than your "fast memory" and thus you have to write an algorithm that takes into account real world resource limitations. Real software solutions have to deal with such issues.
Quote:
Quote:
Many of them can’t ever read an x86 disassembly or tell me the first thing about how many registers in a Pentium processor or what the registers are for.
All I can say is that this was certainly a required part of the curriculum my sophomore year. Any 32-bit assembly language you write will immediately break on a 64-bit operating system, though, or on any hardware other than a PC, or at the very least hold up the port until your replacement rips it out and replaces it with high-level code. Even if it doesn't, a completely different set of instructions will be optimal on the next CPU that comes down the pike.
Your argument seems to be repeating itself. Again, we're talking computer
engineering.
Quote:
Quote:
From my own testing, the fastest sort (tested on a data set of 100MB in RAM) has been MergeSort. QuickSort, HeapSort, and all the other sorts all finish in about 10-15 seconds. CombSort takes 20 seconds, and BubbleSort could take ages. MergeSort? 3 seconds.
What happens when you compare an external merge sort to my method for converting the external sort back into an internal one?
One data point, by itself, is meaningless, particularly since you're trying to determine the 'optimal' external sort by benchmarking an in-memory sort and your box probably doesn't look much like a server. In the first challenge, I easily got this much of a speedup over your entry just by compiling my program for a 64-bit processor. When I tried to do that for your program, it broke. There wasn't some minor fix for this. Your approach inherently could not scale up in any practical way because you hadn't thought about what would happen if you tried to run it on different input. Even on a 32-bit system, your program failed on valid input because you'd focused on optimizing the speed of one arbitrary benchmark on one box at the cost of correctness. I'm sorry to hear that you chose that approach deliberately. You did acknowledge that the abstract approach at least
works.
One data point? I've tested on varying data sizes, something I agree I didn't mention here, but I did tell you that it was tested as 1) randomized, 2) sorted, 3) reverse-sorted. Again, that should find the worst case in ANY sort.
Which program are you referring to? I've compiled/run my primality test programs on 64-bit machines, and have had no problem getting it working. If you find the code, PM it to me, and I'll PM it back with full 64-bit support. But that's not a priority, because you have missed the point here. The primality test program was a very practical real-world test of optimization, just as this challenge is. The point is not whether it's cross-platform, object-oriented, and 'elegant'. It's how fast it works on the hardware you're given.
You seem to think that optimizing for specific hardware is a terrible idea. I don't understand why. It's easily adaptable to whatever hardware comes your way in the future. Computer science majors and C++ programmers are a dime a dozen. There's a job fair event that happened downtown here in Seattle just recently where thousands of C++ programmers were lining up with signs that said things like "HIRE ME, C++ PROGRAMMER". Why do you think they had to do this? Because the job market is flooded with them, and they're not valued. C++ programmers are very often hired for a single job and then laid off. Computer engineers are harder to find and do much better at optimization and software engineering in general.
You seem to also think that just because I say we should take a computer engineering approach, we should throw our algorithms out the window. Computer engineering isn't exclusively about how hardware works. It's a combination of electrical engineering and computer science. Understand how the hardware works, know what algorithms you can use, and optimize the algorithm to work best on given hardware.
Let's put this challenge into better perspective. Your boss has given you this task to sort 100GB of data. Now, the computer science approach may finish in 10-16 hours. This is fine if you are just doing this as an end-of-day thing where you can leave it running overnight and have the results in the morning. But what if you must have it sooner? This is where computer science becomes useless.