When we’re working with fields that have websites in them, one of the things that we should be doing with them is to double-check that they’re well-formed and usable. Thankfully, Snowflake has a function just for that.
With the PARSE_URL function, we can check out if everything is properly formed – meaning that we have everything we would need for screen-scraping, pushing data to a site, etc…
Here we tried one of my favorite websites to see what kind of data the function will return for us.
Sadly, in this case, not much of this would be useful. It tells us the host and the scheme. Everything else is rather NULLish…
For our next example, I want to try a fake site on Snowflake.com. This function won’t be able to tell the difference but it will let me see some of these parameters filled in.
We see our host – Snowflake.com – and a path! That path tells me the actual webpage I should goto to use the parameter we were provided! Plus we have the query and port as additional information! Hopefully, we have a lot fewer exclamation points coming up though…
Now since email addresses are truly just another URL, we can do the same kind of thing with email addresses!
The scheme is “mailto” and the actual email address just goes into the path.
Now, let’s try an email that is just not right.
SELECT PARSE_URL('mailto:GoodGigglyWiggly^@NotWorking.') AS Parsing;
Notice that there is a ^ in the username of the email. Add on to that the site isn’t even complete – it is just NotWorking. – no .com, no .edu, no nothing. Let’s try this in Snowflake:
Sigh. It did parse out the email. It just can’t tell us that the email won’t work for sending an email to. Yes, my friends. That is truly what RegEx is for – or it will at least get us closer to seeing if the email is good. If anyone would like to work with the complete regular expression to test, you can find it at this URL. Let me know how it goes!
Now what do we do if we have hundreds (or in our case, 4) websites, emails, and whatnot that we need to see if they’re in the correct format. Then we create a table and use functions! (Heck, we’re data people – what did you expect me to say?)
CREATE OR REPLACE TABLE Public.QuestionableSites
(TheSite varchar(500) NOT NULL);
INSERT INTO Public.QuestionableSites
SELECT 'https://www.sherpaofdata.com/' unION
SELECT 'HTTP://sherpaofdata.COM:1234/ShowMe.PHP?USER=1' UNION
SELECT 'mailto:info@sherpaofdata.com' UNION
SELECT 'info@sherpaofdata.ed'
After we created our table of goodness (and not so goodness) we need to test it!
One of our URLs doesn’t work! How do we determine which one?
So PARSE_URL has a parameter we haven’t been using up until this point! It is called the permissive parameter and only allows 0 or 1. 0 is the default and says to fail completely if the input is invalid. 1 will have it return an object in JSON format.
Now, since we have the permissive parameter to 1, we get everything back and can determine which rows have errors.
And with that, my friends, we have a function that can help us determine if URLs are good to use. Now, please remember, that it will only tell if the URL is set up properly – not if the website or email address is actually real.
Hopefully, everyone will have a great time playing with this function and can get some usage out of it.