Data warehouse database collation: case sensitive or not?

Various cases illustrated
Even today, database installation is not an entirely trivial task. Especially for data warehouse databases. All settings need to be considered before you start. Fortunately it gets a lot easier in the cloud, right? But where some Cloud vendors claim that their solution removes the need for extensive configuration of settings, what usually happens is that they provide you with a lot of defaults that you can no longer control. Below the waterline, the same settings can still play a role. A setting that can trip you up this way, is collation.

The database collation defines how the database sorts and compares text in the database. It determines how two strings compare with each other, are sorted, and when they are equal or not. Since data warehouses do a lot of key and attribute comparisons when loading data, this is a pretty big deal – much more so than in a standard OLTP system. Not all databases allow you to choose your own collation, but Microsoft SQL Server and Oracle do. On-premise, at least.

Which collation you choose is not a decision to make lightly. While it’s easy to accept the default settings, please note that once you install a database with a certain collation you cannot change it afterwards: you need to rebuild the database from scratch, and moving data from one to the other is easy when going from case insensitive (CI) to case sensitive (CS), but the reverse may be quite hard when “Jones” and “JONES” suddenly refer to the same person instead of two different ones.

Various databases handle collation in different ways. I’ll discuss the broader differences between Oracle and SQL Server here. If you have a different database you should check in advance how collation works for your server.

In Oracle, the metadata of the database is not case sensitive by default. Collation refers to the data, but not the metadata. This ensures that "SELECT * FROM myschema.table" is the same as "select * from MYSCHEMA.TABLE". However, the data in the tables *is* case-sensitive by default. You can change this on a query-by-query basis, and per session since Oracle 10g , but it’s rare to do so because there are some pretty horrendous implications for performance if you use the wrong settings. These default settings make a lot of sense for most applications, however.

Once you’re used to Oracle’s way of doing things, SQL Server will come as a bit of a shock. On SQL Server you can install the database to be either case insensitive, or case sensitive. You can fine-tune this collation even further to be sensitive to diacritical characters (accents) or not. The default settings for the database depend on the system locale settings: https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms143508(v=sql.105). Microsoft recommends keeping the collation settings the same for all databases in the organisation, which is a wise recommendation. This recommendation fails to take into account the situation where the company has multiple database platforms, however, with one of them perhaps being an Oracle database. And even if you do not have on-premise databases other than SQL Server, you will still receive files from external vendors. These may come with their own expected collations, for instance when they identify records with a key that says that Microsoft and MICROSOFT are different companies. This is relatively easy to identify and deal with. It becomes a lot harder when you have hashed or encrypted fields where case turns out to be important. It may be a while before you spot the issue.

Going for a Cloud solution does not solve the issue. As indicated in https://azure.microsoft.com/en-us/blog/working-with-collations-in-sql-azure/, Azure sql server uses a default collation called "SQL_Latin1_General_CP1_CI_AS". That is, the western latin alphabet, case insensitive, but accent sensitive. If your own database was case sensitive, you are now in trouble.

Going case insensitive makes a lot of sense for a single application. It doesn’t have to deal with any other system than itself. If you need to import data, you can wrangle it in the desired shape. Users will probably expect that if they type “microsoft” it will also find “Microsoft” and “MICROSOFT”. But when you deal with a data warehouse, things change. You can and will receive input from various systems with their own ideas about the required collation, alphabets or treatment of accents. 

Case sensitive collations are also a requirement for any permanent data store with heterogeneous inputs (aka a data warehouse). They prevent hashed or encrypted fields from suddenly becoming “equal” when they are in fact different. When looking up hashed keys in a Data Vault 2.0 approach, this becomes a rather important point: Here the input to a hash function can be case sensitive, but the output can *also* be case sensitive. Your best option is to treat it as case sensitive and if it happens to be the same anyway, enforce the case either during loading or with a view.

Accent sensitive is easy to defend. It’s the default option most of the time, which makes good sense. You really don’t want to treat “la” the same as “là”, and that is not even going into the subject of customer names that you really don’t want to merge by accident.

So to summarize: for a data warehouse, I would use the case sensitive/accent sensitive collation wherever you have a choice. This prevents unforeseen data loss. But if you already know you will migrate to the cloud in the near future, you may have to check carefully how to deal with this issue in a case insensitive database.

If you have an opinion on this topic, feel free to leave a comment.

Leave a Reply

Your email address will not be published.