Monday, October 12, 2009

Building a Sample Data Set

I had reason recently to build a sample data set for testing purposes. A little bit of background is required before I start showing you the code. We have a field representing a stock number. The first 3 characters of the stock number represent the vendor's product line. Before you start bashing me about the data design, this is old, old data and I did not design it. I know that we should have a separate field for the product line but we don’t. Anyway, I wanted to have three random records for each product line.

The data is in Oracle so I used the following as an expression for a column I named rank.

dense_rank() over (partition by substr(stk_num, 1, 3) order by substr(stk_num, 1, 3), dbms_random.Value) as rank


The rank() and dense_rank() analytical functions provide a way to rank over a grouping. This statement say rank the data grouped by the first 3 characters of the stock number in order by first those 3 characters and then some random value.

So, my full select statement looked like the following:

SELECT id, stk_num, rank
FROM
(
SELECT
product.*,
dense_rank() over (partition by substr(stk_num, 1, 3) order by substr(stk_num, 1, 3), dbms_random.Value) as rank
from product
where format <> 'V'
) r
where rank <= 3 ;


This yielded the first 3 random records but what I really needed to do is remove all of the records except for these three random ones. So I just tacked this on to a DELETE statement like the following:

DELETE FROM product where id NOT IN 
(SELECT id
FROM
(
SELECT
product.*,
dense_rank() over (partition by substr(stk_num, 1, 3) order by substr(stk_num, 1, 3), dbms_random.Value) as rank
from product
where format <> 'V'
) r
where rank <= 3 );
commit;


This took my test data set from over 222,000 records to less than 3200. This is probably not "technically" a random sampling but it should work much better than what we currently have in place.

Labels: , ,

0 Comments:

Post a Comment

<< Home