In our last post, we went over one way to get a sample of data. In the end was it right? Heck, no – at least not if we wanted a percentage of rows returned. Now, SAMPLE does work fine if you want a specific number of random rows returned each time.
But, let’s face it – sometimes we will be asked for 10% of a table – especially in this world of Machine Learning. So let’s attempt to find a way to make this happen, shall we?
data:image/s3,"s3://crabby-images/34ae7/34ae79191a290ca626db94496a79ea63887e3644" alt=""
Let’s say that we want 20% of a table. Probably one of the easiest that we can break the table into 100 pieces is to use one of my favorite windowing functions – NTILE.
I’m sure that you’ve used it in SQL Server and it works the same way here in Snowflake.
SELECT *,
NTILE(100) OVER (ORDER BY EMPLOYEE_ID) AS NumberOfPieces
FROM public.employees;
data:image/s3,"s3://crabby-images/46c84/46c84df71463f3712a9e3ef4e4a7744353d93428" alt=""
290 rows – 100 Pieces – as shown by the number directly following NTILE in our function. NTILE does require that we have an ORDER BY after it, so we have one listed. In this case, I ordered the pieces by employee id, but that’s not going to work in our final product – since it’s so not random.
It just so happens that there’s a function in Snowflake, called RANDOM(), that does exactly what it says it will – provide random numbers.
data:image/s3,"s3://crabby-images/021e7/021e7d014a594fe0769ad056d3122bc4c66709fb" alt=""
Let’s use the RANDOM function in our ORDER BY to generate total randomness in our NTILE function.
SELECT *,
NTILE(100) OVER (ORDER BY RANDOM()) AS NumberOfPieces
FROM public.employees;
All right, now we have 100 different groups showing up. Now we just have to find our “percentage.” This sounds like a case for the WHERE clause!
data:image/s3,"s3://crabby-images/703f5/703f56c25cca257845a2b350d5fe743cd9fc8a16" alt=""
Well, that didn’t go like we expected. We should just be able to tell Snowflake that we need anything in the first 10 pieces and call it a day! But what is this in the error message? “Qualify”?
data:image/s3,"s3://crabby-images/bbe4f/bbe4fd05345a4cc6159519dc78ca2878c611e68e" alt=""
Sweet! Works just like WHERE except not… Truly, you use QUALIFY in the place of a WHERE clause for windowing functions (at least for the qualifying of the data generated by the windowing function.)
And with this, we have a working version of how to get a percentage of rows from a table. There are two things though that I’ll leave for the reader to determine:
- What do you do differently if you’re asked for the top .5 % of a table? Yes, slight changes – hopefully, you won’t have to dust off that old math book to figure it out.
data:image/s3,"s3://crabby-images/6933f/6933f7250d86f117c2fa93266a20039e1ea757e8" alt=""
2. What do you do differently if you want to JOIN another table in your answers? I can think of two ways to get the data you need.
data:image/s3,"s3://crabby-images/dba44/dba44bcfeb08b1037572066ba30c18fc5a2dbd5e" alt=""
And with that, my friends, we have come to the end of yet another post. Hopefully, you’ve learned a few new things when it comes to Snowflake and can easily return a percentage of rows from a table the next time you’re asked.