The data is in Oracle so I used the following as an expression for a column I named rank.
dense_rank() over (partition by substr(stk_num, 1, 3) order by substr(stk_num, 1, 3), dbms_random.Value) as rank
The rank() and dense_rank() analytical functions provide a way to rank over a grouping. This statement say rank the data grouped by the first 3 characters of the stock number in order by first those 3 characters and then some random value.
So, my full select statement looked like the following:
SELECT id, stk_num, rank
FROM
(
SELECT
product.*,
dense_rank() over (partition by substr(stk_num, 1, 3) order by substr(stk_num, 1, 3), dbms_random.Value) as rank
from product
where format <> 'V'
) r
where rank <= 3 ;
This yielded the first 3 random records but what I really needed to do is remove all of the records except for these three random ones. So I just tacked this on to a DELETE statement like the following:
DELETE FROM product where id NOT IN
(SELECT id
FROM
(
SELECT
product.*,
dense_rank() over (partition by substr(stk_num, 1, 3) order by substr(stk_num, 1, 3), dbms_random.Value) as rank
from product
where format <> 'V'
) r
where rank <= 3 );
commit;
This took my test data set from over 222,000 records to less than 3200. This is probably not "technically" a random sampling but it should work much better than what we currently have in place.
No comments:
Post a Comment