Last visit was: It is currently Tue Dec 03, 2024 12:01 am


All times are UTC-05:00




Post new topic Reply to topic  [56 posts ] 
Author Message
 Post subject:Challenge Three: A Computer Engineering Problem
PostPosted:Sun Oct 07, 2007 2:20 pm 
Literally Nine
User avatar
 

Joined:Sat Apr 02, 2005 3:31 pm
Posts:1171
Location:The vicinity of an area adjacent to a location.
I apologize for not having graded Challenge Two yet. I'll get around to it sometime. Just not soon enough.

However, people have asked when Challenge Three is starting, so I'm giving a very practical problem. I won't give any hints about it, because I want everyone to try to come up with the most efficient solution they can. I am not asking for actual code segments for this, though if you want to write a code segment to explain something or whatnot, I don't mind. I would prefer the actual solution you have be presented clearly and concisely.

Okay, here's the challenge (challenge closes November 30, 2007):

Let's say you have a phone book stored on your hard drive. 100,000,000 entries, 1K each. This means the file is roughly 100GB in size. These entries are not sorted in any fashion though, so it's basically 100GB of data that needs to be properly organized. What's the most efficient way possible to sort all the entries (to a separate file)?

_________________
- Tycho

Image


Top
Offline  
 Post subject:
PostPosted:Sun Oct 07, 2007 2:23 pm 
Connoisseur of the Godawful
User avatar
 

Joined:Tue Mar 01, 2005 9:00 am
Posts:456
ICQ:286315965
Website:http://rabidtinker.mine.nu/
Yahoo Messenger:alistair_lynn
AOL:Agent_Vast@mac.com
Location:127.0.0.1
Get the file size, mmap it, create another file of that size, mmap that for writing also, memcpy, in-place MergeSort.

Do I win a prize?

_________________
Alastair Lynn / Alumnus / Onlink Team


Top
Offline  
 Post subject:
PostPosted:Sun Oct 07, 2007 3:45 pm 
Literally Nine
User avatar
 

Joined:Sat Apr 02, 2005 3:31 pm
Posts:1171
Location:The vicinity of an area adjacent to a location.
Quote:
Get the file size, mmap it, create another file of that size, mmap that for writing also, memcpy, in-place MergeSort.

Do I win a prize?
No.

_________________
- Tycho

Image


Top
Offline  
 Post subject:
PostPosted:Mon Oct 08, 2007 8:57 am 
Literally Nine
User avatar
 

Joined:Tue Mar 01, 2005 9:00 am
Posts:1263
Methinks I know why you gave the date you did. Kek.

I should note that as a grader (probably) I will be grading for search-by-name time as well as a list-in-order time.

I'll give a hint; there's a very specific struct to use with this.


Top
Offline  
 Post subject:
PostPosted:Mon Oct 08, 2007 10:17 am 
Connoisseur of the Godawful
User avatar
 

Joined:Tue Mar 01, 2005 9:00 am
Posts:456
ICQ:286315965
Website:http://rabidtinker.mine.nu/
Yahoo Messenger:alistair_lynn
AOL:Agent_Vast@mac.com
Location:127.0.0.1
Ten-way tree!

_________________
Alastair Lynn / Alumnus / Onlink Team


Top
Offline  
 Post subject:
PostPosted:Mon Oct 08, 2007 6:14 pm 
Literally Nine
User avatar
 

Joined:Tue Mar 01, 2005 9:00 am
Posts:1263
<prophile> Why two when you have have a foursome? Or a tensome?

Anyone else hearing this? Prophile likes orgies?


Top
Offline  
 Post subject:
PostPosted:Mon Oct 08, 2007 6:21 pm 
Literally Nine
User avatar
 

Joined:Sat Apr 02, 2005 3:31 pm
Posts:1171
Location:The vicinity of an area adjacent to a location.
Quote:
I'll give a hint; there's a very specific struct to use with this.
Just talked with you about this... You're partially right. There's a VERY specific way to solve this. But there's no one data structure to do it.

_________________
- Tycho

Image


Top
Offline  
 Post subject:
PostPosted:Mon Oct 08, 2007 6:58 pm 
Literally Nine
User avatar
 

Joined:Tue Mar 01, 2005 9:00 am
Posts:1263
Quote:
Quote:
I'll give a hint; there's a very specific struct to use with this.
Just talked with you about this... You're partially right. There's a VERY specific way to solve this. But there's no one data structure to do it.
No common data structure, depending on how you answer the question I left you on Skype.


Top
Offline  
 Post subject:
PostPosted:Mon Oct 08, 2007 9:09 pm 
 

Joined:Thu Aug 09, 2007 9:09 am
Posts:26
Thanks for posting this; I see that I never did post my latest solution to challenge #2.

As for this, my solution would be:
  • Allocate an array large enough to hold the entire data set
  • Run quicksort (or your favorite sorting algorithm) on said array
  • Print the array to the new file.
Since the system can handle the original data file, it can potentially allocate that much virtual memory, or manipulate a file as if it were an array in memory. Maybe, because it's 100 gigabytes long, you can't literally keep the entire array in memory, and you end up doing a lot of seeking instead. That makes not one whit of difference to the algorithm itself.

One optimization that might allow you to store the full data set in memory is to index and hash the entries (in linear time) and sort the hashes. You could then print out the entries in the original file in the order they appear in the sorted array.


Top
Offline  
 Post subject:
PostPosted:Mon Oct 08, 2007 9:44 pm 
 

Joined:Thu Aug 09, 2007 9:09 am
Posts:26
To clarify what I mean, let's say you have the following entries:

Arnold Palmer (1048 bytes of data)
Eldrick Woods (1020 bytes of data)
Jack Nicklaus (1014 bytes of data)
Greg Norman (1080 bytes of data)

In linear time, we could make a pass through that file, and generate the following index:

PALMER ARNOLD, offset 0, length 1048
WOODS ELDRICK, offset 1048, length 1020
NICKLAUS JACK, offset 2068, length 1014
NORMAN GREG, offset 3082, length 1080

If we allow 14 characters for the name, 8 bytes for the file offset, and 2 bytes for the record length, we need 24 bytes per record, or a total of 2.4 GB of memory, for the array. A modern system will have this. We can now order any two entries with a constant-time lexicographic comparison of both 14-character name fields. (Only if these are identical will we need to look up the full names in the original records, which we can do from the offsets.) We can now quicksort or introsort this array of structures, into:

NICKLAUS JACK, offset 2068, length 1014
NORMAN GREG, offset 3082, length 1080
PALMER ARNOLD, offset 0, length 1048
WOODS ELDRICK, offset 1048, length 1020

From these lengths and offsets, reassembling the original file in sorted order is a linear-time operation.


Top
Offline  
 Post subject:
PostPosted:Wed Oct 10, 2007 1:01 pm 
External Project Staff
User avatar
 

Joined:Sun Oct 30, 2005 3:40 pm
Posts:371
Website:http://idlesoft.net
Location:~/
just dump everything in a mysql database and index it, done, oh wait, this are C++ challenges, heh

EDIT: if you need a more serious entry, I would go for the memory efficient way. built a file with hashes and the location in the file, then if you need a name, hash the name, lookup the hash in the hash table, find the location in the file, and grab the data from that location. You could also organise the hashes a bit, like if the hash is ABCDEF:
/hashes/A/B/C.txt, to make it easier to search trough files, or make a file that says: A entries start at place 0, B entries start at place 2345 (for the hashes, ofcourse)

_________________
-- ChaosR

Image


Top
Offline  
 Post subject:
PostPosted:Wed Oct 10, 2007 2:47 pm 
User avatar
 

Joined:Sat Mar 11, 2006 8:28 am
Posts:365
Location:Canada
I only know very basic C++ but, can't you just split the file and index it in pieces?

_________________
Best file compression around: "DEL *.*" = 100% compression
"640K ought to be enough for anybody." - Bill Gates, 1981


Top
Offline  
 Post subject:
PostPosted:Thu Oct 11, 2007 4:37 pm 
Literally Nine
User avatar
 

Joined:Sat Apr 02, 2005 3:31 pm
Posts:1171
Location:The vicinity of an area adjacent to a location.
Quote:
just dump everything in a mysql database and index it, done, oh wait, this are C++ challenges, heh
That's your response? This isn't a query, it's a freaking sort! Let's say you have the RESULT of a query and your job is to sort it.

_________________
- Tycho

Image


Top
Offline  
 Post subject:
PostPosted:Thu Oct 11, 2007 4:38 pm 
Literally Nine
User avatar
 

Joined:Sat Apr 02, 2005 3:31 pm
Posts:1171
Location:The vicinity of an area adjacent to a location.
Quote:
I only know very basic C++ but, can't you just split the file and index it in pieces?
How do you mean?

_________________
- Tycho

Image


Top
Offline  
 Post subject:
PostPosted:Thu Oct 11, 2007 6:33 pm 
User avatar
 

Joined:Sat Mar 11, 2006 8:28 am
Posts:365
Location:Canada
Well, lets say there is a file, 1000 mb (1 gb),

1. Split (5 pieces). (Or other size so its possible to fit it in ram)
2. Organize by A,B,C...
3. Combine then split again this time into letter groups.
4. Now you have a file for every letter group that is organized into alphabetical order.

Do this again if you want it organized by area code or something.

I have no clue how hard or how complicated the program would be or if its even possible, thats all I can think of for now.

_________________
Best file compression around: "DEL *.*" = 100% compression
"640K ought to be enough for anybody." - Bill Gates, 1981


Top
Offline  
Display posts from previous: Sort by 
Post new topic Reply to topic

All times are UTC-05:00


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Theme created by Miah with assistance from hyprnova