Finding a Percentage of Rows from a Table

In our last post, we went over one way to get a sample of data. In the end was it right? Heck, no – at least not if we wanted a percentage of rows returned. Now, SAMPLE does work fine if you want a specific number of random rows returned each time.

But, let’s face it – sometimes we will be asked for 10% of a table – especially in this world of Machine Learning. So let’s attempt to find a way to make this happen, shall we?

Let’s say that we want 20% of a table. Probably one of the easiest that we can break the table into 100 pieces is to use one of my favorite windowing functions – NTILE.

I’m sure that you’ve used it in SQL Server and it works the same way here in Snowflake.

SELECT *,
    NTILE(100) OVER (ORDER BY EMPLOYEE_ID) AS NumberOfPieces
FROM public.employees;

290 rows – 100 Pieces – as shown by the number directly following NTILE in our function. NTILE does require that we have an ORDER BY after it, so we have one listed. In this case, I ordered the pieces by employee id, but that’s not going to work in our final product – since it’s so not random.

It just so happens that there’s a function in Snowflake, called RANDOM(), that does exactly what it says it will – provide random numbers.

Let’s use the RANDOM function in our ORDER BY to generate total randomness in our NTILE function.

SELECT *,
    NTILE(100) OVER (ORDER BY RANDOM()) AS NumberOfPieces
FROM public.employees;

All right, now we have 100 different groups showing up. Now we just have to find our “percentage.” This sounds like a case for the WHERE clause!

Well, that didn’t go like we expected. We should just be able to tell Snowflake that we need anything in the first 10 pieces and call it a day! But what is this in the error message? “Qualify”?

Sweet! Works just like WHERE except not… Truly, you use QUALIFY in the place of a WHERE clause for windowing functions (at least for the qualifying of the data generated by the windowing function.)

And with this, we have a working version of how to get a percentage of rows from a table. There are two things though that I’ll leave for the reader to determine:

  1. What do you do differently if you’re asked for the top .5 % of a table? Yes, slight changes – hopefully, you won’t have to dust off that old math book to figure it out.

2. What do you do differently if you want to JOIN another table in your answers? I can think of two ways to get the data you need.

And with that, my friends, we have come to the end of yet another post. Hopefully, you’ve learned a few new things when it comes to Snowflake and can easily return a percentage of rows from a table the next time you’re asked.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.