Tag Archives: sql server

Data warehouse database collation: case sensitive or not?

Various cases illustrated
Even today, database installation is not an entirely trivial task. Especially for data warehouse databases. All settings need to be considered before you start. Fortunately it gets a lot easier in the cloud, right? But where some Cloud vendors claim that their solution removes the need for extensive configuration of settings, what usually happens is that they provide you with a lot of defaults that you can no longer control. Below the waterline, the same settings can still play a role. A setting that can trip you up this way, is collation.

The database collation defines how the database sorts and compares text in the database. It determines how two strings compare with each other, are sorted, and when they are equal or not. Since data warehouses do a lot of key and attribute comparisons when loading data, this is a pretty big deal – much more so than in a standard OLTP system. Not all databases allow you to choose your own collation, but Microsoft SQL Server and Oracle do. On-premise, at least.

Which collation you choose is not a decision to make lightly. While it’s easy to accept the default settings, please note that once you install a database with a certain collation you cannot change it afterwards: you need to rebuild the database from scratch, and moving data from one to the other is easy when going from case insensitive (CI) to case sensitive (CS), but the reverse may be quite hard when “Jones” and “JONES” suddenly refer to the same person instead of two different ones.

Various databases handle collation in different ways. I’ll discuss the broader differences between Oracle and SQL Server here. If you have a different database you should check in advance how collation works for your server.

In Oracle, the metadata of the database is not case sensitive by default. Collation refers to the data, but not the metadata. This ensures that "SELECT * FROM myschema.table" is the same as "select * from MYSCHEMA.TABLE". However, the data in the tables *is* case-sensitive by default. You can change this on a query-by-query basis, and per session since Oracle 10g , but it’s rare to do so because there are some pretty horrendous implications for performance if you use the wrong settings. These default settings make a lot of sense for most applications, however.

Once you’re used to Oracle’s way of doing things, SQL Server will come as a bit of a shock. On SQL Server you can install the database to be either case insensitive, or case sensitive. You can fine-tune this collation even further to be sensitive to diacritical characters (accents) or not. The default settings for the database depend on the system locale settings: https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms143508(v=sql.105). Microsoft recommends keeping the collation settings the same for all databases in the organisation, which is a wise recommendation. This recommendation fails to take into account the situation where the company has multiple database platforms, however, with one of them perhaps being an Oracle database. And even if you do not have on-premise databases other than SQL Server, you will still receive files from external vendors. These may come with their own expected collations, for instance when they identify records with a key that says that Microsoft and MICROSOFT are different companies. This is relatively easy to identify and deal with. It becomes a lot harder when you have hashed or encrypted fields where case turns out to be important. It may be a while before you spot the issue.

Going for a Cloud solution does not solve the issue. As indicated in https://azure.microsoft.com/en-us/blog/working-with-collations-in-sql-azure/, Azure sql server uses a default collation called "SQL_Latin1_General_CP1_CI_AS". That is, the western latin alphabet, case insensitive, but accent sensitive. If your own database was case sensitive, you are now in trouble.

Going case insensitive makes a lot of sense for a single application. It doesn’t have to deal with any other system than itself. If you need to import data, you can wrangle it in the desired shape. Users will probably expect that if they type “microsoft” it will also find “Microsoft” and “MICROSOFT”. But when you deal with a data warehouse, things change. You can and will receive input from various systems with their own ideas about the required collation, alphabets or treatment of accents. 

Case sensitive collations are also a requirement for any permanent data store with heterogeneous inputs (aka a data warehouse). They prevent hashed or encrypted fields from suddenly becoming “equal” when they are in fact different. When looking up hashed keys in a Data Vault 2.0 approach, this becomes a rather important point: Here the input to a hash function can be case sensitive, but the output can *also* be case sensitive. Your best option is to treat it as case sensitive and if it happens to be the same anyway, enforce the case either during loading or with a view.

Accent sensitive is easy to defend. It’s the default option most of the time, which makes good sense. You really don’t want to treat “la” the same as “là”, and that is not even going into the subject of customer names that you really don’t want to merge by accident.

So to summarize: for a data warehouse, I would use the case sensitive/accent sensitive collation wherever you have a choice. This prevents unforeseen data loss. But if you already know you will migrate to the cloud in the near future, you may have to check carefully how to deal with this issue in a case insensitive database.

If you have an opinion on this topic, feel free to leave a comment.

Encryption is not a silver bullet

Have I been pwned?Recently, well-known security researcher Troy Hunt, responsible for the website Have I been pwned? described how someone lost 324000 records with full creditcard details, including security codes, by posting them on a public server. There were two parties suspected of the data breach, but neither could find any breach at first. So both parties stated categorically that there was no breach, all data was 100% encrypted and completely secure on their servers so the problem had to lie elsewere. And they were right, all the data was encrypted.

Now, encrypted data should be safe. And to be honest, encryption is more and more the mainstay of securing your data. Firewalls can be breached, servers and companies infiltrated, but if the data is encrypted it should remain secure even if you publish it on the internet. This is somewhat correct – barring adversaries like national intelligence services, who are very likely to be able to decrypt most schemes at the moment. It’s well known that the Dutch National Intelligence and Security Service (AIVD) is investing heavily in quantum computing research, for instance, which means that the NSA probably has one working right now. But apart from those entities, it’s still quite hard to crack decently encrypted data.

That is why in the new SQL Server edition, SQL Server 2016, it is now possible to keep the data encrypted all the time. Only the client can decrypt the data with their own keys. Barring vulnerabilities in the implementation this is a huge step forward: it is impossible for the database administrators to access data they aren’t allowed to see and the loss of a key only affects data stored for that client. Both are very important steps forward to enable clients to trust databases in the cloud. Which is one reason why Microsoft is pressing forward on this, because they will become entirely dependent on Azure in less than a decade, according to their own predictions. This means that trust in Azure will be a make-or-break issue for the company and their focus on improvements in security reflects this knowledge.

And let me be clear: this is a huge leap forward. The old situation could encrypt some data with server-side keys, but when you made a backup it was decrypted. And in several other scenarios it didn’t work if your data was encrypted. But now it works all over the database, you can set it up quite easily and even choose whether columns are encrypted in a deterministic way that gives the same result every time you encrypt the same value, which enables searching and joining, or random: every time you encrypt the value is different. The latter gives more protection from attackers who encrypt “likely values” and see if they match, which is a classic attack against password-files (see: rainbow tables / dictionary attacks).

In the picture you can see how it works by storing the keys on the client:
Always Encrypted SQL Server 2016

This means we can now store creditcard information and sensitive information in the cloud while not having to rely solely on the goodwill of the Azure database administrator.

There is unfortunately also a downside. The fact that data is now safer does not mean it is safe in all circumstances. The way “always encrypted” works has consequences for your implementation that could blow your encryption scheme right out of the water if misused. So while the temptation to store sensitive but potentially very interesting data because hey, “it’s encrypted” and thus safe, can overcome common sense and even regulations, we should still firmly ignore that temptation.

Because the case I linked in the beginning showed everyone that even if data is encrypted, it is not always safe. In the case which I quoted at the start of the article, the data was encrypted too, and it still leaked. The reason was that the encryption keys were known to the organisation involved and used to decrypt data for analysis. That decrypted textfile was then stored on a publicly accessible server. Encryption cannot mitigate that scenario if the keys are part of the webapplication and the owner of the application can also access the data. Anyone who can get to the keys, can decrypt the information. After that, the security of the data once again depends on what that person does with it – such as putting it on a public server.

This is the reason that if you want to process creditcard information, for instance, you need to be PCI compliant. This is a set of regulations drafted by the financial industry that tell you what data you can store and how. Very sensitive details such as the security code should NEVER be stored. They don’t give security regulations for the storage of the security code: storing it violates all the rules, no matter what you do. The case with Regpack shows that this is still true. What you store will eventually leak, even with encryption. Once quantum computers become available widely, all current encryption schemes are broken and that nicely encrypted data on the internet that wasn’t a problem… is suddenly readable text.

So while “always encrypted” is a step forward, you still need to be very careful about what you store and it still needs to be secure – processing encrypted data on an insecure platform means your data is just as insecure, as the data can be intercepted in memory. While solutions are in the works (Philips, IBM and others are working on homomorphic encryption schemes) this is currently not an option.

recommendations

My recommendations on this subject are as follows.

  • Do not store any data you are not allowed to store.
    If you do this anyway and lose the data, you will get fined or even shut down when this comes to light.

  • Do not store any sensitive data you do not have to store.
    Everything you store is a security risk, if you don’t store anything there are no risks. Being smart about what data to store is a big part of any security strategy.

  • If you do store sensitive data, let the owner of the data hold the key to that data if at all possible.
    After all, a file where every line is encrypted with a different key you don’t have, is a file that will be pretty hard to decrypt and certainly can’t be decrypted by accident by one of your employees.

  • If you cannot do even that, and your application does the encrypting, make sure the decryption key is locked in hardware like a smart card that is NOT reachable on any computer without physical presence.
    Violating this simple rule was what destroyed the Dutch Public Key provider Diginotar.

Some companies prioritize time-to-market and lower cost over data security. But eventually, those companies will be destroyed over that practice. The current digital environment is just too hostile to survive such practices for very long.