Last visit was: It is currently Tue Dec 03, 2024 1:21 am


All times are UTC-05:00




Post new topic Reply to topic  [56 posts ] 
Author Message
 Post subject:
PostPosted:Thu Nov 15, 2007 7:31 pm 
 

Joined:Sun Jun 10, 2007 11:41 am
Posts:344
Location:Ninjaville
So a computer with a solid state HDD would theoretically benefit more from multithreading?


Top
Offline  
 Post subject:
PostPosted:Thu Nov 15, 2007 7:52 pm 
Literally Nine
User avatar
 

Joined:Sat Apr 02, 2005 3:31 pm
Posts:1171
Location:The vicinity of an area adjacent to a location.
Quote:
So a computer with a solid state HDD would theoretically benefit more from multithreading?
The HD throughput is the bottleneck even in non-multithreaded solutions to this problem.

_________________
- Tycho

Image


Top
Offline  
 Post subject:
PostPosted:Thu Nov 15, 2007 11:31 pm 
User avatar
 

Joined:Thu Aug 18, 2005 8:19 pm
Posts:36
AOL:Fattierob
Location:f(x)
Well yes - obviously once you reach a point threading becomes more of a burden then a helpful tool. Thats why I suggested two threads and not more.

In all honesty, I'm more curious as to whats your thoughts are on my sorting method. I am curious as to if theirs a name to it, as I don't remember where I read it, but it stuck in my mind.


Top
Offline  
 Post subject:
PostPosted:Sun Dec 02, 2007 5:03 pm 
 

Joined:Sat Dec 01, 2007 3:17 am
Posts:4
How about binary search trees? They are very good for sorting and searching because their specific properties allow you to make assumptions in your sorting / searching algorithms which makes things quite quick. In this example you can get a nice balanced tree by coding to choose a root name with a letter near the middle of the alphabet. So then searching is O(logn). Sorting on the other hand isn't quite a good when you factor in the overall time, especially on a small data set. BUT since this is a really big data set you will probably have to read the data in incrementally in which case this BST approach works well, since sorting incrementally is what BSTs are best at (and you can read in chunks equal to the size of memory you have availible, thus avoiding paging). Then you just write each chunk into the output file stream as you get it and start on the next chunk.

The other advantage of the BST method is it allows a lot of feature expansion, like adding the ability to search for a name quickly. My guess on performance:

If the dataset fits into real memory, then static sorting wins any day.
If the dataset does NOT fit into memory than this BST method wins.

I haven't done a lot with threading and its important now that multicore CPUs are more prevelant but my guess is that you can make the BST method threadsafe more easily than static lists.


Top
Offline  
 Post subject:
PostPosted:Tue Dec 04, 2007 1:41 am 
Literally Nine
User avatar
 

Joined:Sat Apr 02, 2005 3:31 pm
Posts:1171
Location:The vicinity of an area adjacent to a location.
Okay, I'm overdue for posting my solution to this problem. I wish I'd written my solution up earlier because it's going to take me a while to write it all up. But here goes...

So you've got a 100GB file. 100,000,000 1KB entries. The contents of these entries is truly irrelevant, because we aren't asking what specialized method is greatest for sorting a massive phone book, as my original post specifies the dataset is, but how to deal with the massive amount of data.

Sorting the file in place would be extremely time consuming because of disk seek time among other latencies. If you don't care about how long the sort takes, then you could certainly sort in place, as it would WORK, it'd just be incredibly SLOW. So for performance reasons, we're not going to use an in-place sort.

It's not possible to fit 100GB of data into system RAM because of hardware and software limitations. 32-bit machines can address 4GB of memory at most, and you have to also account for overhead caused by running an operating system, among other things. So we're ideally dealing with 2GB or 1GB chunks of this 100GB file.

Note first of all that the solution I'm about to present is not one I came up with. It is, however, a real-world solution to a real-world problem. Consider Google, for instance, which has basically indexed the entire internet on their servers. That's a massive quantity of data. They basically have to face this sort of challenge on a regular basis.

This very challenge was given to me by a friend of mine who recently left Microsoft to work on his own emulation projects and his proposed redesign of the AMD64 architecture.

So finally, let's look at the ideal solution:

Read the first 1GB of the file. Sort it in RAM (the optimal sort can be figured out on a slightly smaller scale first if need be). Write it to a new file on the disk. Repeat until we've finished reading the whole 1GB file.

So now we have 100 1GB chunks sorted on the hard disk. Note that I didn't suggest splitting the dataset up into alphabetical buckets or anything like that, because of the potentially lopsided distribution of the data. Splitting the data at exact 1GB boundaries ensures that no one chunk will be larger or smaller than any other. This is particularly advantageous because we can be certain to not exceed physical RAM limitations.

How do we merge the 100 1GB chunks? This is another interesting problem, because there are many ways to do it, but few are optimal. There's a great data structure called a min-heap that you can use for this task. The primary property of a min-heap is that when you pop data off the top of it, it's always guaranteed to be the lowest value item. To make disk reads from the 1GB chunks optimal, I would suggest reading the first 10,000 entries from each chunk and then adding them to the heap. Just keep popping data off of the heap and cache them (for a write to disk later) until you're out of entries read from a given 1GB temporary file. This is where you'd read the next 10,000 entries from that file and add them to the heap. Keep doing this until you've reached the end of all the 1GB chunks.

Voila, a 100GB sorted file. Let me review the ways in which this is optimal:
  • Disk latency is minimalized by reading and writing large chunks at a time. Small reads and writes have longer latencies.
  • RAM is utilized to the best of our ability. On a 64-bit system with a lot of RAM, we could expand the chunks to be larger, but to best illustrate real-world limitations, we've used a 32-bit machine.
  • We read through the 100GB file only once (no pre-scan to build a distribution table and then a second read to sort it for real or anything like that).

_________________
- Tycho

Image


Top
Offline  
 Post subject:
PostPosted:Tue Dec 04, 2007 2:03 am 
External Project Staff
User avatar
 

Joined:Sun Oct 30, 2005 3:40 pm
Posts:371
Website:http://idlesoft.net
Location:~/
This is not what I expected the answer to be. This solution is just a practical way to sort 100GB of data in a fast way when you have a limited amount of ram.

I expected, seeing as these are programming challenges, to come up with a some sort of program. The most fancy algorithm. or something like that. I think the question was a bit to open, you should have narrowed it down a bit, and described the problem a bit more. There is that, or I'm just stupid. I think both are true.

This solution is exactly what I use on my encryption programs. Just cache, encrypt, write. Or with sockets: cache, send, retrieve, write.

_________________
-- ChaosR

Image


Top
Offline  
 Post subject:
PostPosted:Tue Dec 04, 2007 5:27 pm 
Literally Nine
User avatar
 

Joined:Sat Apr 02, 2005 3:31 pm
Posts:1171
Location:The vicinity of an area adjacent to a location.
Quote:
This is not what I expected the answer to be. This solution is just a practical way to sort 100GB of data in a fast way when you have a limited amount of ram.

I expected, seeing as these are programming challenges, to come up with a some sort of program. The most fancy algorithm. or something like that. I think the question was a bit to open, you should have narrowed it down a bit, and described the problem a bit more. There is that, or I'm just stupid. I think both are true.

This solution is exactly what I use on my encryption programs. Just cache, encrypt, write. Or with sockets: cache, send, retrieve, write.
From my first post:
Quote:
What's the most efficient way possible to sort all the entries (to a separate file)?
I don't care how "fancy" or "elegant" a solution is, I'm asking for efficiency, pure and simple. Yes, it could have been narrowed, and I could have guided you directly to the answer I was looking for, but that doesn't help you think about the problem on your own and consider the potential solutions and weigh them appropriately.

This problem is about hardware limitations and how to deal with them in an efficient manner, not how to sort a phone book.

I hope this helps clear it up.

_________________
- Tycho

Image


Top
Offline  
 Post subject:
PostPosted:Wed Dec 05, 2007 1:20 am 
External Project Staff
User avatar
 

Joined:Sun Oct 30, 2005 3:40 pm
Posts:371
Website:http://idlesoft.net
Location:~/
Quote:
From my first post:
Quote:
What's the most efficient way possible to sort all the entries (to a separate file)?
Problem is, that could mean anything. It could also mean, as I first assumed, how to sort them in such a way that the lookup time is reduced.

It is probably just me being stupid, but I ask you, next time give more information, like: "We have file A and we need to get that in file B. In file B all entries are sorted alphabetically on name. But file A is 100GB, has 100000000 entries of 1k. We only have 2GB of ram, of which more than 1GB is free. What would be the fastest way possible with modern day computers to sort this."

Not a lot information, but it does make clear what kind of solution you want. It's like asking: "What do we know about power?". There are a million answers to that do, and yet, my physics teacher keeps asking me questions like that.

I do not blame you, I'm just asking to take in mind my stupidity next time.

_________________
-- ChaosR

Image


Top
Offline  
 Post subject:
PostPosted:Wed Dec 05, 2007 12:40 pm 
Literally Nine
User avatar
 

Joined:Sat Apr 02, 2005 3:31 pm
Posts:1171
Location:The vicinity of an area adjacent to a location.
If I'd stated all the hardware limitations, then people would take the hint. I wanted to see how many people would actually consider the practicality of sorting 100GB in modern hardware.

_________________
- Tycho

Image


Top
Offline  
 Post subject:Re: Challenge Three: A Computer Engineering Problem
PostPosted:Sat May 24, 2008 3:48 pm 
 

Joined:Mon Jul 03, 2006 7:02 am
Posts:16
Tycho: Please correct me, if I'm not right, but your solution will not be able to solve the problem to sort the file contents. If you split the 100GB file into pieces and sort them seperately, the final merge will be unsorted again. Following example (short)

File1
John Smith
Paul Easter
Hank White
.....

File2
Danny Zope
Alan Poe
Jim McAllester
.....

Ok lets sort the files by last name:

File1
Paul Easter
John Smith
Hank White

File2
Jim McAllester
Alan Poe
Danny Zope

and now lets merge it:

FileMerged
Paul Easter
John Smith
Hank White
Jim McAllester
Alan Poe
Danny Zope


As you can see, the final result is unsorted again. Did I missunderstand something in your result posting?

Regards
Maik


Top
Offline  
 Post subject:Re: Challenge Three: A Computer Engineering Problem
PostPosted:Tue May 27, 2008 11:29 am 
Literally Nine
User avatar
 

Joined:Sat Apr 02, 2005 3:31 pm
Posts:1171
Location:The vicinity of an area adjacent to a location.
The merging process isn't a mere concatenation. You put the data into a min heap and the element you pop off the top is always the minimum element. So let's say you put these names into the min heap:

Paul Easter
John Smith
Hank White

Jim McAllester
Alan Poe
Danny Zope

The first to be popped off the min heap is Paul Easter, then Jim McAllester, then Poe, Smith, White, and Zope.

Since there's still the issue of having too much data in memory by loading all the sorted segments at once, you just grab the first X number of entries from each of the individual sorted files and then when the min heap runs dry of entries from any one file, you read the next X entries to refill the min heap.

_________________
- Tycho

Image


Top
Offline  
Display posts from previous: Sort by 
Post new topic Reply to topic

All times are UTC-05:00


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Theme created by Miah with assistance from hyprnova