Wednesday, December 19, 2007

Power of SQL

Update: This is a part of a series now called "Power of SQL".

As a former developer at an elite government organization, I had the opportunity to leverage some of this SQL functionality. I had the opportunity to work on a project where they had a text file dump of Driver's license information. They also had a file dump (Select * from all_employees_of_all_employers_in_the_state). These statistics could show that a certain age, sex, ethnicity or even height of a person could help determine the wage of a person for a certain industry. If there was a similar correlation that short people get paid less than tall people, in a similar way that women get paid less than men, that could be a huge discovery to the HGH Industry - but I digress.

This program was to tie some "statistical" driver's license information collected by the state DMV to the wage information already collected under existing labor laws by NWD-DOL. After the program ties this information together OIT was then supposed to make sure the program would remove the "identifying" information (the coveted SSN), and then pass it along to the appropriate department (LMI). The program was inefficient and needed to have it optimized.

I opened the code up and like many programs I worked on, I was not terribly shocked by what I saw. Here is the pseudo-code:
Select * from wage_info Into wages
Select * from dmv_info into dmv
-- The dmv clearly shows that SSNs are not unique and are recycled.
IF dmv.deceased = 0
IF wages.ssn = dmv.ssn
insert into dmv_wage_match
Select * from dmv_wage_match into obfuscate
Set id = 0
BEGIN LOOP obfuscate
set id = id+1
update dmv_wage_match
set ssn = #id#
where ssn = #obfuscate.ssn#
So let me break this down for you. If there are 1,200,000 DMV records and 900,000 wage records, with a 750,000 of them that match ... well, you get the idea, the program was bloated.

Here's what I updated the code to:
INSERT INTO dmv_wage_match
select, d.height, d.weight, d.eye,, w.amount, w.industry_code ...
from dmv_info d
join wage_info w
on d.ssn = w.ssn
where d.deceased = 0
In this case, dmv_wage_match had an identity field that served to obfuscate the actual SSN. This also elegantly avoids the tedious insert / update that was being done in the previous version. This version ran in a few seconds as compared to the previous version that took almost an hour! Now that's an improvement.

This is a VERY simple example, but I continue to be amazed by the power of SQL. The key in my opinion though is knowing what SQL can do and then asking the right question, for example, "How can SQL do some of the heavy lifting for me?".

I hope this helps give some insight into those who are still using a spreadsheet or *tisk tisk* text files and then trying to do something meaningful with them. Yes, they hold data, but so does the phonebook. If you're not convinced yet, tell me, by glancing through the phonebook, what are all of your neighbor's names and phone numbers who share your street name? Please post back how long that took you, I'm sure the readers would be interested in knowing.

No comments: