Thursday, October 18, 2012

Interview Question at LSSiData

There is a huge database listing of around 100 million records, which is divided into some 200 plain text files, with the record in csv format.

Given 5000 telephone numbers, how to retrieve the records matching those telephone numbers?

Brute-force is what comes up first:
Read the csv files one line at a time (each record is a line in the csv file), and get its telephone number field. Then compare the field with the given telephone number. If there is a match, print the record into that file. Otherwise, discard it.

Shortcoming: Need 5000 passes of the 100 million records, which is a performance nightmare.

How to improve the performance? Hash-table is the answer:
Organize the 5000 telephone numbers into a hash-table (hash the telephone numbers into a table, which is actually an array of keys computed from hashing the telephone numbers). Then, as we examine the record and find a match to the given telephone number, we just append the record to the linked list associated with the hash key. This way, we need only to iterate/traverse the 100 million records for once.

For details of hash-table, please refer to Section 2.9 and 3.3 of "The Practice of Programming" by Brian Kernighan and Rob Pike.

No comments:

Post a Comment