Building a Sample Data Set
I had reason recently to build a sample data set for testing purposes. A little bit of background is required before I start showing you the code. We have a field representing a stock number. The first 3 characters of the stock number represent the vendor's product line. Before you start bashing me about the data design, this is old, old data and I did not design it. I know that we should have a separate field for the product line but we don’t. Anyway, I wanted to have three random records for each product line.
The data is in Oracle so I used the following as an expression for a column I named rank.
dense_rank() over (partition by substr(stk_num, 1, 3) order by substr(stk_num, 1, 3), dbms_random.Value) as rank
The rank() and dense_rank() analytical functions provide a way to rank over a grouping. This statement say rank the data grouped by the first 3 characters of the stock number in order by first those 3 characters and then some random value.
So, my full select statement looked like the following:
SELECT id, stk_num, rank
FROM
(
SELECT
product.*,
dense_rank() over (partition by substr(stk_num, 1, 3) order by substr(stk_num, 1, 3), dbms_random.Value) as rank
from product
where format <> 'V'
) r
where rank <= 3 ;
This yielded the first 3 random records but what I really needed to do is remove all of the records except for these three random ones. So I just tacked this on to a DELETE statement like the following:
DELETE FROM product where id NOT IN
(SELECT id
FROM
(
SELECT
product.*,
dense_rank() over (partition by substr(stk_num, 1, 3) order by substr(stk_num, 1, 3), dbms_random.Value) as rank
from product
where format <> 'V'
) r
where rank <= 3 );
commit;
This took my test data set from over 222,000 records to less than 3200. This is probably not "technically" a random sampling but it should work much better than what we currently have in place.
Labels: Oracle, Sample Data, SQL Server
0 Comments:
Post a Comment
<< Home