data warehouse | Grundsätzlich IT

Several people have over the last few years advocated for a persistent staging area (PSA) as part of a data warehouse (DWH), for instance in this article by Roelant Vos. Roelant is the most notable advocate of this approach so I will use his article as the main argument in favor of a PSA. rightly says that some people (vehemently) disagree with this, as it introduces a second data warehouse. However, there are more reasons why I personally think that it *may* be a good idea in some situations, but a pretty bad idea if you apply it to every data warehouse.

So what *is* a persistent staging area? Simply put, it is your staging area, except you make all data historical, and never throw anything away. This has several benefits, as described in Roelant’s article:
1 – It allows you to re-create and fix the central Data Vault or Kimball model in case of issues.
2 – You can have an independent verification of what’s happening in the source systems.
3 – It allows you to start right away without having a central model in place, deferring (some) modelling, architectural, and design decisions to a later stage while still capturing history from day one.
4 – You can virtualize your data warehouse on top of the PSA.

However, I disagree that these are benefits. They are mainly stopgaps for issues you should have mitigated in another way, or issues arising from applying methods in a way that I would not do (and therefore, would not have the same problems with).

Let’s first take a look at the point that you can already capture history that you might need later on, and start right away. The only reason for leaving out data is when the person who is responsible for the data (which can *never* be IT or a project) tells you the data is irrelevant. So implementing a PSA for this part means that you (a) don’t trust the business to get it right, (b) don’t make them responsible for their own problems and (c) the historical data you didn’t capture is so important that it needed to be captured when you came in (which is a holy date on the business calendar somehow? Why is that data important, but the data from two months earlier is not?) and (d) there are no options to restore a backup of that data and (e) the source itself already tossed the data out. The combination of these factors is both unlikely and also not the problem of the DWH to solve. So why a DWH business owner has to tack on additional expenses to mitigate this is beyond me. There is of course the gap between when you start the project, and when you can first load data after you have designed and architected the data warehouse, as well as modelled it. This bespoke approach to data warehousing is fast becoming a serious issue in the market, and as antiquated as a horse-and-carriage. Given proper data warehouse automation tooling, a database to store the DWH in, and access to sources, it is now possible to have a rudimentary DWH up and running in the first week. Yes, if you spend six months setting it up you have a window where you don’t have historical data over those six months, but I fail to see why the data from those six months are suddenly more relevant than the decade in lost data where there was no data warehouse in place at all. The whole anxiousness about losing data is a bit strange: you are NOT the person responsible for this, the business is. I suggest you ask the business about it first – the usual answer is “who cares”.

Second, the whole ability to recreate the central persistence layer of your DWH by having another copy in place is just strange. All data that is stored is always available. I have made mistakes as well, but it was always possible to restore the mistakes since the data was captured in the Data Vault. Sure, maybe you didn’t capture a table that had changes and now someone notices it. The question is: if they notice it after a year or so, how important was that data exactly? And why do you feel a need to implement a complete on-line copy of the entire data warehouse just to mitigate against the chance of that happening? Isn’t it better to focus on automated testing, use cases, and spending time with the business owners to discuss the data in depth? After all, we can only spend our time and money once.
You may hear the counter that adding another persistence layer in staging is a very simple thing to do, that will hardly cost anything. But this is only partially right and therefore wholly wrong. The up front cost is low, sure, as it adds only a simple mechanism to the staging layer ETL. The maintenance cost on the other hand is very high, as every change in the data in every source needs to be implemented in your staging layer while maintaining the older data as well. Doing this negates a large part of the argument for data vault in the first place, which was “resilience to change due to modeling the business instead of the sources”. The PSA is a throwback to the bad old days, in that respect, and adds a lot more work than you intend. Just imagine you add twenty sources to the DWH, that has a single business model (or in some banks I know, five). If each source changes once every year, you have more than one change a month, sometimes drastic ones, even when the business does not change, purely for technical reasons. But the central persistence layer only changes drastically when the business changes, which reduces the maintenance costs. A staging area that is not persistent can deal with a source change in a few hours. A persistent staging area either needs to get a new version that is maintained from then on, or a costly migration. The versioning strategy is relatively low cost, but begs the question “how online *is* your data actually, and why is this better than a tape backup?”.

Third, and last, is the virtualization argument. Yes, the argument is correct that if you virtualize your Data Vault, you are in trouble without a persistent data store. And then you’re left with the PSA as the only option. You could also just build a good data vault model, and have none of those issues. After all: if you need to store the data somewhere in every case, what use is virtualization then? The only reasons I see for virtualization are:
– you are prototyping the reporting area
or
– you are prototyping the persistent storage area
The reason for doing the first is obvious. But this can be done on top of a data vault with precisely the same effort. The reason for doing the second is less obvious, but still valid, as the business can often see what you mean in a model much better with actual data. However, you can do this on top of the source system directly if you use Denodo’s or TIBCO’s virtualization software. Even a single one-time restore of the source system into your DWH will work for prototyping just as well. So I don’t think this is a compelling argument.

You might ask if there is no situation where the PSA is useful. Of course not, the PSA *is* useful, just not as often as you think it is. If you have to run the older data in your persistent storage area side by side with realtime current data in a combined virtual solution you suddenly need to either integrate on the natural keys, or on the source system keys. If you integrate on a data vault now, you suddenly have more work as one part of your solution is persisted data in a format that is different from the realtime data. In this case a persistent staging area may be the best solution – for that particular part of the solution.

Another situation where the PSA might be useful is the case where your sources are very stable, but your business model is not. For instance, a startup that uses existing solutions as sold by established vendors. In that case, the persistent staging area would, for now, probably be the option that has the least maintenance and thus the right choice, because as a rule of thumb, as much as eighty percent of the cost of any data warehouse is maintenance. Reducing the necessity of maintenance efforts reduces the technical debt and cost and therefore increases the longevity of the solution.

In general, I find the persistent staging area to be a (misguided) technical solution to issues that are the responsibility of the business to fix on the one hand, and an overlarge drain on the maintenance budget on the other hand. There are situations where it may fit, but if you don’t have to use it, I suggest you should not use it.

N.B.: this is a braindump for now. I will probably write a more considered post later on.

Grundsätzlich IT

information technology

Tag Archives: data warehouse

Some thoughts on the Persistent Staging Area