A lot of times, you’re asked to find where two different tables have the same data. Thankfully, this is a fairly normal ask.
Sometimes, though, you’re asked to see if they have the same data in 20 – 30 fields! I don’t know about y’all, but I definitely don’t want to have to type in that much.
Yet again – SQL Server to the rescue! We can use a function that is commonly used with security – HASHBYTES!
HASHBYTES is a function that is commonly used in SQL Server to store passwords in a format that stops someone from being able to see data, such as passwords, in cleartext. (Yes, seeing passwords as they would be typed in by a user is bad – even if you have access to everything that the user would see.)
Normally, it would be used like this:
CREATE TABLE dbo.Users
(UserID INT IDENTITY(1, 1) PRIMARY KEY,
FirstName VARCHAR(100) NOT NULL,
LastName VARCHAR(100) NOT NULL,
UserPassword VARCHAR(100) NOT NULL)
INSERT INTO dbo.Users
( FirstName,
LastName,
UserPassword )
SELECT 'Fred',
'Flintstone',
'ThisIsNotAGoodPassword'
INSERT INTO dbo.Users
( FirstName,
LastName,
UserPassword )
SELECT 'Sherpa',
'Data',
HASHBYTES('SHA2_512', 'ThisIsMuchBetter')
When we look at the data in the dbo.Users table, we see the following:
Now, I don’t know about you, but I have no idea what Sherpa’s password is, but Fred’s – I got that all day long and twice on Sundays.
When we are comparing multiple fields – we want to do something similar.
First, we’ll add a field to two of our existing tables that we’ve messed with for some time now:
ALTER TABLE dbo.Person3
ADD PersonHash VARBINARY(2000) NULL
ALTER TABLE dbo.MockPersonData
ADD PersonHash VARBINARY(2000) NULL
Notice that the Hash field is a varbinary – something that most SQL developers will never have worked with too often. This data type is not used often, but can very useful – especially with our new function HASHBYTES.
We then need to populate it with the script below:
UPDATE dbo.Person3
SET PersonHash =
HASHBYTES('SHA1',
CONCAT(FirstName, '|', LastName, '|', Email))
UPDATE dbo.MockPersonData
SET PersonHash =
HASHBYTES('SHA1',
CONCAT(FirstName, '|', LastName, '|', Email))
* A note here for the security-minded. You’ll notice that I’m using the SHA1 algorithm. I would not recommend that for any security uses – even a simple password can be broken within minutes with SHA1, but it definitely serves our purposes for this!
Now we can do a really simple JOIN between these two tables and we have actually found where we match based on FirstName, LastName, and Email!
SELECT *
FROM dbo.MockPersonData MPD
INNER JOIN dbo.Person3 P3 ON p3.PersonHash = mpd.PersonHash
Knocking it down from having to join on 3 fields, much less 10 fields, to only on 1, is a huge time saver for the developers!
The one thing you have to remember to do when using this trick, my friends, is to make sure it is up to date. If not, you could be missing out on lots of data. So, if you’re keeping the hash field in your table, be sure to UPDATE it every time you update any of the fields that make up your hash field.
With that my friends, we have come to the conclusion of yet another blog post. Come back soon when we’ll talk again about strings and some of the glorious work you can do with them.