Today, I want to talk about that fun edge case when you’re having to join a table to itself in Snowflake. Does it happen often? Not unless your architect just hates you.
Let’s use the normal pieces of data that everyone uses for this kind of thing – employee/manager relationships. We have our employee table that we’ve been working off that we’ll play with for this example.
We are going to add a manager id to the employee table and update it as necessary. Hint: I’ll talk about how I do it in a future post, but for now, have fun deciding how to fill in the manager data.
With this data, we can now populate the employee’s id and last name with their manager’s id and last name by joining the employee table to itself.
Now, we can do it the same way we would on SQL Server.
But, where’s the fun in that, right? Simple join that we’ve probably done at least a few times. Let’s do something fun in Snowflake!
WITH cte AS (
SELECT employee_ID, manager_ID, last_name,
sys_connect_by_path(last_name, '~') AS Pathy,
level as lvl
FROM employees
START WITH manager_ID = 0
CONNECT BY
manager_ID = PRIOR employee_id
)
SELECT employee_id, last_name,
manager_id,
replace(split(pathy, '~')[lvl - 1], '"') as ManagerLastName
FROM cte
ORDER BY employee_ID;
Now, there is a lot to unpack here, so let’s go thru some of this goodness.
The “CONNECT BY manager_ID = PRIOR employee_id” portion sets up our actual self join by saying that we want to connect to itself by connecting the manager_id to the employee_id. Note that it could have been written as “CONNECT BY PRIOR employee_id = manager_ID” with no difference in how it works.
The sys_connect_by_path function is used to create a path from the root level to the leaf that we’re on by the field that is listed. In this case, I’ve decided to connect them by using the ~ character. This can be multiple characters if needed.
Notice the level field – even if it’s not used in our table. This comes from using the CONNECT BY join! It tells Snowflake what – you guessed it – level our data is at hierchacly.
The “START WITH manager_id = 0” tells Snowflake that we want to set the root node to be where manager_id = 0.
And one of my favorite functions – split – is used to split apart our data in multiple ways. Look for me to talk about the split function in upcoming blog posts.
And with that, we’ve come to yet another end of a blog post. Hopefully, you’ve seen a few ways to get heirachical data from Snowflake and work with it. You’ve learned quite a few new functions in this post and I hope you come back soon to learn even more! Until next time – connect with us!
1 comment