SENSITIVE DATA! It’s an interesting topic! In this post I’m trying to explain how to hash data to increase security during ETL. Assume that we have sensitive data stored in several secured source systems. The source systems are located in different countries and different regions. As the source systems themselves are secured, how we can cover data security needs during ETL process to read data from source systems and load into staging area? Apart from using secured network infrastructure, VPN, network tunnelling etc. we need to cover data layer security to extract sensitive data. One of the best ways is hashing data when it is extracting from source databases. Hashbytes is a T-SQL function that is available in SQL Server 2005 and later. As you might know there are many hashing algorithms, but, different SQL Server versions are supporting different range of hashing algorithms. For instance SHA1 is supported by SQL Server 2005 and later, but, if you are looking more secure hashing systems like SHA2, 256 (32 bytes) or 512 (64 bytes), you should use SQL Server 2012. Actually the hashbytes function will return null in earlier versions of SQL Server. If you are looking for a higher level of security like SHA3 that is originally known as “Keccak” you should wait for it for a long time as based on my investigations it is not supported even in SQL Server 2014 OR you can write your own SHA3 code OR just rely on some third party codes available on the Internet! So let’s get our hands dirty with using hashbytes in different versions of SQL Server.
SQL Server 2005:
SELECT @@version [SQL Server Version]
, hashbytes(‘SHA1’, ‘123456’) [SHA1]
, hashbytes(‘SHA2_256’, ‘123456’) [SHA2_256]
, hashbytes(‘SHA2_512’, ‘123456’) [SHA2_512]
Let’s run the same query in SQL Server 2008 and see the results:
Again the result for SHA2 is null.
And know we are testing SQL Server 2012:
We will see the same results retrieved from SQL server 2014.
So, the idea is DO NOT LOADING SENSITIVE DATA AT ALL. Consequently, it seems the only way that the data might leak is that somebody sniffs the SQL codes that are retrieving data in memory (note that our assumption is we have a secure network infrastructure). Now we can put our T-SQL code into an “OLE DB Source” component in SQL Server Integration Services (SSIS) and we will have the hashed data (VarBinary) in the staging area.