Category Archives: Development

This category contains all news on software development. This includes Android, Microsoft.Net, web front-end programming and of course all things related to Business Intelligence.

Legal form: a modelling journey, part III

Now that we have had the preliminaries in the previous articles in this series, it’s time to get down to the modelling itself. For that, I’m going to present a set of use cases and then work out the model.

We are going to work from the perspective of “Banks and financial institutions”, that need to assess the counterparty risk of counterparties. One of the aspects in that assessment is the risk associated with a particular legal form of not getting your money back. Specifically, we will look at world-wide counterparty risk on legal forms for an international bank.

Note: all images were generated using dbdiagram.io.

Conventions

In this model we will use “abbreviation” a lot. You could make the case that legal forms have acronyms rather than abbreviations (an acronym is an abbreviation that is formed from the initial letters of other words and pronounced as a word (e.g. NASA, EU, FIFA)), as most legal forms are acronyms. However, as I am not sure that all legal form name abbreviations are acronyms, we will use abbreviation for this model. If you feel that it is more appropriate to use acronym, you should do so.

Simple case: one country with one language

The simple case is as follows: we have a stakeholder (see previous article) that needs some information about legal person. I won’t get into the details of legal persons that may or may not exist, at this point we’ll just assume we have one. We want to register the legal form of the legal person, but:

  • We are in the same country
  • We have only one language
  • We understand the risks that we run for this specific legal form
  • We recognize a legal form by the abbreviation, like “LLC” or “Inc”.

This means we can restrict ourselves to only registering the following information:

The basic information required is the standard name for the legal form. For some legal forms, legal entities are required to communicate their legal form in their name by adding the standard abbrevation for that legal form. They are not allowed to communicate a misleading abbreviation: you cannot call yourself a Limited Liability company by adding Ltd. to your company name, if you are not in fact registered as such. So the official abbreviation you are required to carry in the name is also helpful to register. But since we only deal with one language, we are now using the abbreviation as our code and save an attribute. This will come back to haunt us, of course, as shortcuts usually do.

Do note that, depending on the specific rules and regulations and customs of the registrar, the full name of a legal person may include the legal form abbrevation itself or it may be something that is not part of the official name itself, but used as a postfix or prefix to the name in all official communications. We will ignore this bit from here on.

Also note that there are legal forms without abbreviation, and potentially there are legal forms without abbreviation or even a full name. We will ignore this for now, but we’ll get back to it later.

We can also add the registrar where the legal form was registered, but this is not part of the legal form itself. We could set this on the relationship between Legal Person and Legal Form, but that would imply you could have multiple Legal Forms for the same company with different registrars, which is unlikely to be a valid activity, so care must be taken to restrict the relationship to the appropriate form. What you should probably do is to add the information about the registrar to the Legal Person itself, because you create the legal person by registering it at a registrar.

This gives rise to the following diagram:

Next step: multiple languages

If there are multiple official languages in the country, you will have abbreviations for the legal form in multiple languages. They may be different, but the worst case is having two equal abbreviations that actually mean something different depending on the language. Fortunately very few countries let that happen. Still, it’s a theoretical possibility we need to cater for in our model.

Most of the time a legal person will probably carry only one abbrevation appended to the original name (if required) but not several. However, in the correspondence in other languages one may want to point out the legal form in a language the reader will understand. So having the abbrevation and full form name available in each official language is very helpful in those cases. This is why we now turn the Legal Form Code into a surrogate key and an actual code, and the Legal Form can then have several Abbreviations and Names per language.

We choose the ISO standard ISO 639, set 1 (ISO 639-1 alpha 2) for our languages. It contains the most used languages on the planet. But if you are dealing with “legal forms through the ages”, you should probably use the full ISO 639 standard. Since this is a bit out of scope, I won’t go into further detail. Rest assured that it is likely a new article to do it full justice.

This brings our diagram up to the following standard:

While the Legal Form entity looks a bit empty for now, any information that will be relevant to this legal form, regardless of the name or language, can be placed into it. An example might be a flag that indicates if there is transparent personal liability, which means that a given legal form allows a creditor to recuperate losses from the personal finances of the owners of the legal person, which is often the case for the “self-employed” legal form. You can have various kinds of transparancy, for instance from a tax-perspective, or from a risk-assessment perspective where you need to know if you can recover losses from (legal or private) persons that have ownership.

Transliterations: language and script

We have already included translations by incorporating language codes in the key for the Legal Form Name. But now we also want to deal with the situation where we can have multiple scripts within the same language.

If you only deal with the situation that you work for a single country in a single script that also happens to use only the characters of the Latin alphabet, good for you: you can skip this bit. But even when you stick to European countries, that’s just not the situation. The nations in Eastern Europe tend to have cyrillic scripts. For business purposes these are often transliterated into different scripts, which can give rise to certain interpretation issues. There are many official and unofficial ways to transliterate from Arabic to Latin, for instance. This gives rise to many misunderstandings. Therefore, we also want to maintain both the original and transliterated versions of the name and abbreviation.

Usually there is just one script per country, but there are exceptions to that rule. Like Mongolia, where they have two different scripts, and neither one of them is Latin. The former nations in Eastern Europe such as Serbia and Bulgaria also use two scripts, but one of them is Latin. And then there are the African languages that can be written in a Latin script (with some difficulty), but also have modern counterparts that use a different script, such as Dinka.

There is an ISO standard for script codes: ISO 15924. It contains a lot of script codes, but quite a few are of only historical interest – which could still be relevant for historical research into old legal forms.

However, since most cases for our current use case (which is to register legal forms that are actually in use, world-wide) can be dealt with by choosing an original script and a script that everything is transliterated to (for us, Latin), we are not going to add the additional complexity for one single remote use case. That is a trade-off we make here, and you should document such choices when modelling your data. So we will just store the original name and acronym or abbreviation in the original script in separate fields, and leave it at that for now.

At this point, our diagram should look like this:

However, we’re not done yet.

Legal basis for legal form

Every legal form has to have a basis in law. The problem (which we’ll get to in a later stage) is that you need the concept of jurisdiction to get it entirely correct, but for now it’s enough to say that every legal form refers to one ore more paragraphs in a law. And it should be referenced because you can have legal forms that have no name, but do have a legal basis. The legal basis can be done as a URL if you can refer to a stable website, or as a copy of the text. It is, however, on a per-language basis. So we cannot add it to the Legal Form entity, and we also cannot add it to the Legal Form Name entity.

We now have two options. Either we rename “Legal Form Name” to something that indicates it’s the language-specific legal form information, or we add another entity to hold the information about the “Legal Form Legal Basis”. But there is a problem with adding another entity: you could potentially run into the scenario that you have a legal basis for one given combination of legal form and language, and legal form names for another combination. Even if your fields are mandatory, you can still get null values if you combine the two entities into a coherent view. If you want to make sure you always have both at the same time, you must combine them into one entity.

However, in our case the fields CAN be empty. It is possible to have a legal form without abbreviation, such as “sole entrepreneurship”. It’s also possible (but highly unusual) to have a legal form with a legal basis, but no actual name. This happened in the Anacredit regulation where the European Central Bank had a different opinion on legal forms from the Dutch Central Bank and created new legal forms that had no name, based on different paragraphs in Dutch law.

This means we should be able to accomodate both scenarios and in that case you can either leave all fields optional, or separate the entities.

You could introduce a separate entity here, “Legal Form per Language”. But since we still need to accomodate various legal frameworks, we shall leave that for later. So we end up with the following model for now:

In our next chapter we shall discuss change over time, and the concept of jurisdictions. Legal forms relate to language, but are only valid within a certain geographic boundary. That leads to some fairly nasty complications.

Edits:
11-jun-2024: I slightly changed the text for transliteration (clarified difference with translation) and updated the image to reflect the fact that the original script usually needs a unicode datatype.

Legal form: a modelling journey, part I

Abstract vision of a legal form for a company, in a landscape of country and language.

The AnaCredit regulation is an interesting regulation. Having closely worked with the Dutch modeller who drafted the logical model for it at the DNB, it’s one of my favorite models to demonstrate. It compresses literally hundreds of pages of rules and regulations in a single model. To the dismay of most banks when they first saw it, but eventually to the benefit of all of its users.

However, one issue has always been a bit of a pain. And that is the concept of legal form. You are required to report it for the counterparties involved in loans. But at the time the regulation started, some banks had not used foreign legal forms but had mapped them onto their own home countries versions. Others had incomplete registrations. And most of them had issues with languages: in Belgium you can register a company in three different locations, with three different names for the legal form, but they are all the same. Even worse, you can potentially have the situation that you have the same code in different national languages but they mean something different. This means that using the legal form code without any other descriptive attributes is a problem.

Legal form is a bit of a weird duck in a way. Everyone knows it. Everyone uses it. But when asked to describe what it is, you get wildly varying answers. Obviously, the ECB doesn’t know what it is either, or they would not have made the list they use now (you can check out the ECB List of legal forms yourself).

Definition

On the internet it’s not that easy to find a definition of legal form. The ones I can find come close to the joke about an elephant, described by a number of blind people: they describe the visible attributes but not the core, the “ding an sich”. See for instance this definition: while it starts good with “the legal form (also known as legal status) is defined according to national legislation” it then goes on and on about its use.

So what is it then? The best I can come up with, is that the legal form under which a company operates, is defined in the national legislation. The name of the legal form and the abbreviation are sometimes given in that legislation, but not always. In the end, any legal form code or name is a shorthand for a referral to a particular paragraph in the national law of any given country or group of countries such as the EU.

For instance, in Dutch law we have a limited liability company. This is called a “besloten vennootschap” and abbreviated as “BV”. This is defined in het Burgerlijk Wetboek (civil law book), book nr. 2, article 175. This says that any BV is a legal person, with named shares, where the shareholders are not liable for any losses over the amount of capital they have put into the company. And that is the basic definition of this particular legal form (although there are more articles describing this legal form in more detail).

Do not make the mistake of assuming that this particular legal form is the same as the limited liability company in other countries: liabilities can be limited in very different ways and can be undone on very different grounds. That is why you really need to include the country when you define legal form.

A legal form is also unique not just by country but by official language in that country. In Belgium, as you can see in the ECB List of legal forms, they have a legal form that has three names (and corresponding acronyms):

  • Unité TVA – UTVA (French)
  • BTW-eenheid – BTWE (Dutch)
  • Mehrwertsteuereinheit – MWSE (German)

This particular legal form has a single surrogate key, meaning that to the ECB, it is just one single legal form. So far, so good though.

Complications

Now, so far things are relatively simple. But they’re getting a bit more difficult when you realize that some EU countries have no standardized set of legal forms. Try finding the ones for Portugal, for instance. On the official government website you can find nine legal forms. But the Anacredit list has twenty of them, and one even has no acronym. The thing the website does well, though, is that it refers you to the exact paragraph in the law that describes and regulates the legal form. Portugal is an example, but not an exception.

Things get funnier when you add the ISO standard into the mix. Yes, there is an official ISO standard (ISO 20275:2017 – Financial Services – Entity Legal Form) for legal form, because the mess has not escaped the attention of the standardization committee. The data model is described in the standard, but there is also a registrar for worldwide legal forms, the Global Legal Entity Identifier Foundation (GLEIF). I’ve described this standard in an older post.

Some of the differences between the lists are:

  • The GLEIF list contains legal forms on a country subdivision code level (ISO 3166-2), used in for instance the USA and Canada, where the ECB list does not. This becomes interesting in the case of Madeira, which is subdivided in the GLEIF list but part of Portugal in the ECB list. Where the GLEIF lists two legal forms for Madeira, none are mentioned in the ECB list;
  • The GLEIF list contains 31 legal forms for Portugal, the ECB list only has 20, and the Portuguese government lists 9. Have fun trying to find out which one is valid for your case. Portugal is just an example here, the deviations are similar for many countries;
  • The local name is nice, but do you also want it in the local script? This becomes a relevant question for cyrillic names in Europe, and for other non-Western scripts such as Chinese and Arabic that may not even be read from left to right. The GLEIF lists the local name in the local script and provides a transcription to the Western script as well. But the ECB list does not. If you are lucky you can get them from the GLEIF list, and otherwise, well, there’s always ChatGPT;
  • The GLEIF list sometimes lists an abbreviation, sometimes not. The abbreviations can be in the local script, in that case there are usually, but not always, transliterations into the Western script.
  • The GLEIF list sometimes lists multiple versions of abbreviations for the same legal form for the same language for the same country. The ECB list only does that for different languages in the same country. Good luck matching them up.

So how do we get out of this mess? Well, we don’t 🙂 The mess will remain. But you need to understand for what purpose you need the legal form of a legal entity. The main purpose is to determine the amount of risk and exposure you are taking on if the legal entity in question is your counterparty. If you do business globally, this matter becomes more urgent. A secondary purpose is to report this to the relevant authorities, notably the ECB with the AnaCredit reporting requirement, who does the same but on a higher (aggregate) level, except that exact legal forms are limited to European counterparties, and for global ones you are allowed to approximate them.

We can create a data model that will enable us to fulfill most of these requirements. Given the data issues I doubt it is possible to get full coverage for all countries and all legal forms, but we can certainly do much better than just add the nearest acronym to a legal entity, and hope for the best. How to do this data model is a subject for the next post, however.

Some thoughts on the Persistent Staging Area

Several people have over the last few years advocated for a persistent staging area (PSA) as part of a data warehouse (DWH), for instance in this article by Roelant Vos. Roelant is the most notable advocate of this approach so I will use his article as the main argument in favor of a PSA. rightly says that some people (vehemently) disagree with this, as it introduces a second data warehouse. However, there are more reasons why I personally think that it *may* be a good idea in some situations, but a pretty bad idea if you apply it to every data warehouse.

So what *is* a persistent staging area? Simply put, it is your staging area, except you make all data historical, and never throw anything away. This has several benefits, as described in Roelant’s article:
1 – It allows you to re-create and fix the central Data Vault or Kimball model in case of issues.
2 – You can have an independent verification of what’s happening in the source systems.
3 – It allows you to start right away without having a central model in place, deferring (some) modelling, architectural, and design decisions to a later stage while still capturing history from day one.
4 – You can virtualize your data warehouse on top of the PSA.

However, I disagree that these are benefits. They are mainly stopgaps for issues you should have mitigated in another way, or issues arising from applying methods in a way that I would not do (and therefore, would not have the same problems with).

Let’s first take a look at the point that you can already capture history that you might need later on, and start right away. The only reason for leaving out data is when the person who is responsible for the data (which can *never* be IT or a project) tells you the data is irrelevant. So implementing a PSA for this part means that you (a) don’t trust the business to get it right, (b) don’t make them responsible for their own problems and (c) the historical data you didn’t capture is so important that it needed to be captured when you came in (which is a holy date on the business calendar somehow? Why is that data important, but the data from two months earlier is not?) and (d) there are no options to restore a backup of that data and (e) the source itself already tossed the data out. The combination of these factors is both unlikely and also not the problem of the DWH to solve. So why a DWH business owner has to tack on additional expenses to mitigate this is beyond me. There is of course the gap between when you start the project, and when you can first load data after you have designed and architected the data warehouse, as well as modelled it. This bespoke approach to data warehousing is fast becoming a serious issue in the market, and as antiquated as a horse-and-carriage. Given proper data warehouse automation tooling, a database to store the DWH in, and access to sources, it is now possible to have a rudimentary DWH up and running in the first week. Yes, if you spend six months setting it up you have a window where you don’t have historical data over those six months, but I fail to see why the data from those six months are suddenly more relevant than the decade in lost data where there was no data warehouse in place at all. The whole anxiousness about losing data is a bit strange: you are NOT the person responsible for this, the business is. I suggest you ask the business about it first – the usual answer is “who cares”.

Second, the whole ability to recreate the central persistence layer of your DWH by having another copy in place is just strange. All data that is stored is always available. I have made mistakes as well, but it was always possible to restore the mistakes since the data was captured in the Data Vault. Sure, maybe you didn’t capture a table that had changes and now someone notices it. The question is: if they notice it after a year or so, how important was that data exactly? And why do you feel a need to implement a complete on-line copy of the entire data warehouse just to mitigate against the chance of that happening? Isn’t it better to focus on automated testing, use cases, and spending time with the business owners to discuss the data in depth? After all, we can only spend our time and money once.
You may hear the counter that adding another persistence layer in staging is a very simple thing to do, that will hardly cost anything. But this is only partially right and therefore wholly wrong. The up front cost is low, sure, as it adds only a simple mechanism to the staging layer ETL. The maintenance cost on the other hand is very high, as every change in the data in every source needs to be implemented in your staging layer while maintaining the older data as well. Doing this negates a large part of the argument for data vault in the first place, which was “resilience to change due to modeling the business instead of the sources”. The PSA is a throwback to the bad old days, in that respect, and adds a lot more work than you intend. Just imagine you add twenty sources to the DWH, that has a single business model (or in some banks I know, five). If each source changes once every year, you have more than one change a month, sometimes drastic ones, even when the business does not change, purely for technical reasons. But the central persistence layer only changes drastically when the business changes, which reduces the maintenance costs. A staging area that is not persistent can deal with a source change in a few hours. A persistent staging area either needs to get a new version that is maintained from then on, or a costly migration. The versioning strategy is relatively low cost, but begs the question “how online *is* your data actually, and why is this better than a tape backup?”.

Third, and last, is the virtualization argument. Yes, the argument is correct that if you virtualize your Data Vault, you are in trouble without a persistent data store. And then you’re left with the PSA as the only option. You could also just build a good data vault model, and have none of those issues. After all: if you need to store the data somewhere in every case, what use is virtualization then? The only reasons I see for virtualization are:
– you are prototyping the reporting area
or
– you are prototyping the persistent storage area
The reason for doing the first is obvious. But this can be done on top of a data vault with precisely the same effort. The reason for doing the second is less obvious, but still valid, as the business can often see what you mean in a model much better with actual data. However, you can do this on top of the source system directly if you use Denodo’s or TIBCO’s virtualization software. Even a single one-time restore of the source system into your DWH will work for prototyping just as well. So I don’t think this is a compelling argument.

You might ask if there is no situation where the PSA is useful. Of course not, the PSA *is* useful, just not as often as you think it is. If you have to run the older data in your persistent storage area side by side with realtime current data in a combined virtual solution you suddenly need to either integrate on the natural keys, or on the source system keys. If you integrate on a data vault now, you suddenly have more work as one part of your solution is persisted data in a format that is different from the realtime data. In this case a persistent staging area may be the best solution – for that particular part of the solution.

Another situation where the PSA might be useful is the case where your sources are very stable, but your business model is not. For instance, a startup that uses existing solutions as sold by established vendors. In that case, the persistent staging area would, for now, probably be the option that has the least maintenance and thus the right choice, because as a rule of thumb, as much as eighty percent of the cost of any data warehouse is maintenance. Reducing the necessity of maintenance efforts reduces the technical debt and cost and therefore increases the longevity of the solution.

In general, I find the persistent staging area to be a (misguided) technical solution to issues that are the responsibility of the business to fix on the one hand, and an overlarge drain on the maintenance budget on the other hand. There are situations where it may fit, but if you don’t have to use it, I suggest you should not use it.

N.B.: this is a braindump for now. I will probably write a more considered post later on.

Data warehouse database collation: case sensitive or not?

Various cases illustrated
Even today, database installation is not an entirely trivial task. Especially for data warehouse databases. All settings need to be considered before you start. Fortunately it gets a lot easier in the cloud, right? But where some Cloud vendors claim that their solution removes the need for extensive configuration of settings, what usually happens is that they provide you with a lot of defaults that you can no longer control. Below the waterline, the same settings can still play a role. A setting that can trip you up this way, is collation.

The database collation defines how the database sorts and compares text in the database. It determines how two strings compare with each other, are sorted, and when they are equal or not. Since data warehouses do a lot of key and attribute comparisons when loading data, this is a pretty big deal – much more so than in a standard OLTP system. Not all databases allow you to choose your own collation, but Microsoft SQL Server and Oracle do. On-premise, at least.

Which collation you choose is not a decision to make lightly. While it’s easy to accept the default settings, please note that once you install a database with a certain collation you cannot change it afterwards: you need to rebuild the database from scratch, and moving data from one to the other is easy when going from case insensitive (CI) to case sensitive (CS), but the reverse may be quite hard when “Jones” and “JONES” suddenly refer to the same person instead of two different ones.

Various databases handle collation in different ways. I’ll discuss the broader differences between Oracle and SQL Server here. If you have a different database you should check in advance how collation works for your server.

In Oracle, the metadata of the database is not case sensitive by default. Collation refers to the data, but not the metadata. This ensures that "SELECT * FROM myschema.table" is the same as "select * from MYSCHEMA.TABLE". However, the data in the tables *is* case-sensitive by default. You can change this on a query-by-query basis, and per session since Oracle 10g , but it’s rare to do so because there are some pretty horrendous implications for performance if you use the wrong settings. These default settings make a lot of sense for most applications, however.

Once you’re used to Oracle’s way of doing things, SQL Server will come as a bit of a shock. On SQL Server you can install the database to be either case insensitive, or case sensitive. You can fine-tune this collation even further to be sensitive to diacritical characters (accents) or not. The default settings for the database depend on the system locale settings: https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms143508(v=sql.105). Microsoft recommends keeping the collation settings the same for all databases in the organisation, which is a wise recommendation. This recommendation fails to take into account the situation where the company has multiple database platforms, however, with one of them perhaps being an Oracle database. And even if you do not have on-premise databases other than SQL Server, you will still receive files from external vendors. These may come with their own expected collations, for instance when they identify records with a key that says that Microsoft and MICROSOFT are different companies. This is relatively easy to identify and deal with. It becomes a lot harder when you have hashed or encrypted fields where case turns out to be important. It may be a while before you spot the issue.

Going for a Cloud solution does not solve the issue. As indicated in https://azure.microsoft.com/en-us/blog/working-with-collations-in-sql-azure/, Azure sql server uses a default collation called "SQL_Latin1_General_CP1_CI_AS". That is, the western latin alphabet, case insensitive, but accent sensitive. If your own database was case sensitive, you are now in trouble.

Going case insensitive makes a lot of sense for a single application. It doesn’t have to deal with any other system than itself. If you need to import data, you can wrangle it in the desired shape. Users will probably expect that if they type “microsoft” it will also find “Microsoft” and “MICROSOFT”. But when you deal with a data warehouse, things change. You can and will receive input from various systems with their own ideas about the required collation, alphabets or treatment of accents. 

Case sensitive collations are also a requirement for any permanent data store with heterogeneous inputs (aka a data warehouse). They prevent hashed or encrypted fields from suddenly becoming “equal” when they are in fact different. When looking up hashed keys in a Data Vault 2.0 approach, this becomes a rather important point: Here the input to a hash function can be case sensitive, but the output can *also* be case sensitive. Your best option is to treat it as case sensitive and if it happens to be the same anyway, enforce the case either during loading or with a view.

Accent sensitive is easy to defend. It’s the default option most of the time, which makes good sense. You really don’t want to treat “la” the same as “là”, and that is not even going into the subject of customer names that you really don’t want to merge by accident.

So to summarize: for a data warehouse, I would use the case sensitive/accent sensitive collation wherever you have a choice. This prevents unforeseen data loss. But if you already know you will migrate to the cloud in the near future, you may have to check carefully how to deal with this issue in a case insensitive database.

If you have an opinion on this topic, feel free to leave a comment.

PowerDesigner do’s and don’ts

Many people consider PowerDesigner to be the de facto standard datamodelling tool. Many people are right. However, that does not mean the tool is perfect. As many users can testify, the version 16 release has been quite buggy in the beginning, only stabilizing a bit more with 16.5. And this is not exceptional. The repository is still buggy, projects are a recipe for pain, and let’s not start a discussion on license prices – we’d still be here next year.

However, if you avoid some practices and adopt others, using PowerDesigner is a breeze. Here is my take on things.

Do Not:

  • Use the repository
    The repository is a major cause of bugs. It looks nice, like a venus flytrap, and then it sucks you in and eats you for breakfast. Avoid it like the plague. You are better off spending some money on creating an extension to generate a model export to your own repository. You can buy this from I-Refact or other parties. The other functionality can be done better, cheaper and with less frustration and bugs by just using standard version control software (TFS, git, etc.). If you must compare models, you can do that from within PowerDesigner with very little effort – without losing parts of your model on the check-in/check-out.
    There is only one part of the repository that is actually semi-useful, which is the indication whether your model is out of date versus the repository version. As this functionality does not cooperate with replication or extensions that use that, there is little point in it once you evolve beyond the basics. Also, it is much better to split up your models so as to avoid getting in a situation with 10 people working on the same model. Even potentially. If this is a risk, appoint a designated datamodeller for such a model. The rest can get a read-only version.

  • Hide attributes on entities by hiding them
    Unless you use an extension to automate setting/un-setting this and also indicate this visually, it can create no end of trouble when the model shows tables and columns but leaves out certain columns that then get deployed anyway. It takes ages to debug that. If you must do this, make sure it’s an all or nothing proposition: either hide all standard attributes, or none.

  • Create shortcuts to other models
    While PowerDesigner does this automatically once you start creating mappings, there is no need to refer to models outside the scope of the folder, as this will render the models almost impossible to deploy without heaps of pop-ups asking about other models that you have not yet stored in the right location (and don’t even know where they should be located). Only consider this if you have an agreed-upon folder structure and even then I recommend you don’t do this.

  • Create Projects
    Sure, they’re good for having a dependency graph view. But you can create those anyway. And projects are buggy, especially when interacting with the repository. Half the bugs I found in PowerDesigner went away when I stopped using projects and moved to workspaces. No more disappearing models, or graphics. No more models that are impossible to check out or check in.

  • Work for long periods without saving
    The PowerDesigner auto-save function is nonexistent. After you work with PowerDesigner for a while, you will learn to save often. It becomes a reflex. Because it hurts when you lose hours of work through a crash. It’s not as bad as it was when you were still using version 16.5.0, with repository and projects, but still.

  • Use auto-layout without a current back-up
    Your gorgeous, handcrafted model could use a minor improvement and you used auto-layout. And then you pressed “save” automatically, because by now it’s a reflex. And when the screams died down, you realized you didn’t have a current backup. Ouch. Backup often. If you use Git: commit often.

  • Model the entire Logical Data Model as a hierarchy of subtypes
    I have seen them, with their entity types derived from the Object supertype and each other, six hierarchical layers deep. I dare you to try it with a non-trivial model and then generate a working physical model out of it. Go ahead, make my day…

  • Create a unique data domain for each attribute
    This sort of misses the point of data domains. Because while they are rather limited in PowerDesigner (no entity types or attribute groups), they are most useful when they provide a single point to change definitions of common attributes. Use them freely, but let the data architect decide which ones are available for use. It’s best to create a single model for this, that you can use as a template for the other models you create.

But Do:

  • Add metadata to your models
    Especially metadata that describes the following items: Title, Subject Area, Author, Version, Data (Model) Owner, Modified Date, Modifications, Validation Status

  • Add domains
    Create a list of standard attribute domains, then create a template model containing them. People can either copy the model file directly and use it as a template (this creates havoc in a repository though, because the internal model ID will be the same as that of the template model), or copy the attribute definitions into your own model. The definitions should be controlled by the data architect(s).

  • Add attribute groups
    If you create attribute groups of commonly grouped attributes in keyless entities, you can then inherit from multiple of these entities in order to combine them. Most useful when you have things like “firstname/lastname” pairs of attributes that you do not want to separate out to their own entity, for some reason. Use with caution.

  • Tie models together with separate workspaces for each project
    Workspaces are small files with zero overhead that tie different models together. They have no impact on the repository check-in/check-out, they are files that can be under source control, and they are pretty much bug-free. You can even edit them when necessary. Much better than projects.

  • Store your models in version control systems
    Seriously, I should NOT have to say this, but I keep meeting people who don’t seem to realize that MODELS ARE CODE. And with a VCS I do not mean that abortion they call the repository. I mean TFS, Git or even Subversion. Anything that works, basically.

  • Save often
    If you don’t, you’ll regret it.

  • Store backups
    Having version control is not the same as having backups, unless you commit often.

  • Create a folder structure that is the same for everyone and make it mandatory
    If you don’t, you’ll create unending pop-ups whenever someone opens a model they did not create themselves. If they check it in, it’s your turn the next time you open it from the repository.

Is data modeling the next job to be replaced by AI?


The more I read, the more I am convinced that data modeling is an activity that will be supported, perhaps even replaced, by deep learning AI like Watson real soon now and that the true focus of data governance should be the business conceptual model. All other models derive from there.

After posting this on LinkedIn I received some interesting feedback. The main question was of course “how so?”. Well, here are a number of the things I read and experienced that led me to this conclusion.

1. Data Modeling Essentials (Simsion/Witt, 3rd ed. 2005). It’s not very strict but very readable and has a nice division between the three layers: conceptual, logical, physical. This is where I started. It explained clearly that these different layers are distinct, but not exactly *how*. Or rather: I didn’t understand it at the time.

2. Applied Mathematics for Database Professionals (2007, Lex de Haan/Toon Koppelaars). This explains how the logical model is built and differs from the conceptual and physical models. It starts and ends with the math behind it all (predicate logic and set theory) yet is easy to understand for most data professionals. It made the distinction between the different layers MUCH clearer to me and especially how the logical model is a fully mathematical activity. That is a hard separation between the conceptual and the logical model, even though the semantic part of the logical model is still derived from the conceptual model. Because if you suddenly use different terms and verbs, it is obviously not a derivation but instead a different model.

3. The FCO-IM book by Bakema/Zwart/van der Lek, which explains how you can have a conceptual model that is fact-oriented and can then be transformed into a logical model. This means that when you can impose an ordering on the language of the conceptual model, you should be able to derive a logical model from it.

4. My own experience with fully verbalized data models, i.e. data models written down in sentences. Most data models have a small vocabulary: “One Client orders one or more Products. Our Company has a Relationship with a Party. A Client is a Party. A Legal Entity is a Party.” The amount is in fact unlimited if you so desire, but in practice can be boiled down so far there is even a standard for it: Semantics of Business Vocabulary and Rules (SBVR).

5. Very influential: The “ISO TR 9007 – Information processing systems – Concepts and terminology”. This defined the conceptual model and was created by a number of very well-known information modeling scientists. It influenced me heavily because it defines a conceptual model as essentially something that conveys semantic meaning, i.e.: words on paper. It starts, for instance, with the Helsinki principle.

6. Discussions with Martijn Evers (and numerous other colleagues at the Dutch Central Bank and outside it) about data modeling.

7. The sorry state of modeling tools in general. If PowerDesigner is the best we have, we’re in trouble. Sparx Enterprise Architect is actually pretty good, but you can’t program it so you have to make do with what it is. E/R Studio just crashed when trying to evaluate it last time round. And https://www.vertabelo.com/ looks nice and is webbased, but it just does the physical modeling. None of those tools do conceptual modeling anywhere near right. Colibra has the textual part down pat, but only relates business terms to other terms. An ontology, while nice, is the start of the process, not the end. Conceptual modeling is a text-based activity that encompasses the business glossary, the visualization, the business rules etcetera and it’s not just about creating lines and boxes on a canvas. There is currently no tool that does conceptual modeling as a text-based activity.

8. The rapidly advancing state of the art in deep learning. I had artificial intelligence courses in the 90’s, but it wasn’t very advanced at the time. Nowadays, it’s much further. I’ve been looking into Watson (IBM) capabilities over the past week. See for instance https://www.ibm.com/watson/services/natural-language-understanding/ where it says that Watson can: “Analyze text to extract meta-data from content such as concepts, entities, keywords, categories, relations and semantic roles.”

9. The experience of the Watson team that a combination of skilled human and decent AI will generally beat even a higher skilled AI and higher skilled human.

10. The fact that the conceptual model is verbalized, and that we now have Wikipedia and Facebook. I think we can have “social modeling media” now. This should increase the speed of modeling immensely. Example: suppose that DNB owns definitions in the Dutch part of the global Finance Business Glossary. And that we can derive logical models from that glossary. And physical models from that logical model. Then that would speed up the process of building reports and applications to conform to regulations enormously. It would also nail the nasty little business model of IBM and Teradata with its hide to the wall (and not by incident).

11. Finally, I can see as well as anyone the incredibly urgent need for data modeling and the serious and increasing lack of data modeling expertise in the labor market. We can either give up on modeling, or we can make it so easy everyone *can* do it. Therefore, the latter *will* be done. Example: the car market was once considered to be limited to the number of available professional chauffeurs. So, the need for professional chauffeurs was removed. Otherwise, Henry Ford wouldn’t have had a market.

I cannot say with certainty that AI will be able to fully automate the entire modeling process (especially not reverse engineering). But what I *can* say is that a large amount of what a modeler does can be automated even today. An even larger part can be done with AI, and the final part can be done together.

In the comments, one person said that AI deep learning and modeling are (mathematically speaking) fundamentally different activities and that therefore you cannot automate the modeling. I agree that they are different, but I do not think it matters for the end result. Playing Go is fundamentally different from translating text, but both can be handled by the same type of algorithm. I foresee this will be similar to modeling, for most cases. The fact that you cannot handle all cases is irrelevant: see item 9.

What about source models without meaning, that can be modeled today by humans, another comment asked. I do not think that that is a relevant issue: reverse engineering will remain difficult, but it’s not the issue I’m concerned with. Because reverse engineering means turning an illegible set of fields and tables into a conceptual, textual model. My point is that the reverse process of turning text into tables and fields is easy to automate, the other way round is inherently just as difficult as turning mayonaise into eggs: the arrow of time, entropy, just does not fly that way.

Martijn Evers stated that “given sufficient quality in models that represent semantic concerns, and sufficient quality in deriving logical models we can nowadays already generate all kinds of implementation models. No AI is needed here. This could change for very large and complex implementation models, but by and large we can do a lot without AI in this respect.” I agree. I just think that using AI reduces the necessary quality of the model to the point where even people with moderate amounts of training can handle this competently. Which is the point of item 11.

Most commenters agreed that a human-machine hybrid approach was in the works. One even pointed out an existing (and very good!) recent article about this topic: http://www.dataversity.net/artificial-intelligence-vs-human-intelligence-hey-robo-data-modeler/. The question thus is “when” and “how”, not “if”.

Please note: this article was also posted on Linked: https://www.linkedin.com/pulse/data-modeling-next-job-replaced-ai-ronald-kunenborg

Image credits: created by Alejandro Zorrilal Cruz [public domain], via Wikimedia Commons. Source: https://commons.wikimedia.org/wiki/File:Artificial.intelligence.jpg

ISO standards for Finance business data

When I define a business glossary to prepare for the high-level corporate data model, I try to incorporate as much of the relevant standards as I can. Because usually, knowing up front about a standard will make it much easier later on to integrate with other parties in the value chain, to report to regulatory authorities that use the same standards, and to apply Master Data Management. The more data that adheres to international standards, the less work you have in managing it.

Below, I have provided a list of ISO standards that can be used to aid in the governance of your business glossary and data models, standards that provide metadata specific to Finance and standards that provide identification schemes for key entities.

Note that there are more finance data and metadata standards than just the ISO standards. These will be listed in a different post that I will then link from here (and vice versa).

ISO standardAreaDescriptionID?
ISO 00639GeneralISO 639 defines language codes, as opposed to country codes. The standard consists of 6 parts, some more detailed than others. The preferred standard is ISO 639-3, which is the most comprehensive substandard. Usually, we restrict ourselves to a subset of supported languages.

See for more information: https://en.wikipedia.org/wiki/ISO_639
ISO 03166GeneralISO 3166 is a standard published by the International Organization for Standardization (ISO) that defines codes for the names of countries, dependent territories, special areas of geographical interest, and their principal subdivisions (e.g. provinces or states). The official name of the standard is "Codes for the representation of names of countries and their subdivisions". It consists of three parts:

  • ISO 3166-1 contains all codes currently in use
  • ISO 3166-2 contains all codes for subdivisions
  • ISO 3166-3 contains all codes no longer in use

The three standards contain several codes: alpha-2, alpha-3 and alpha-4. The alpha-2 code is the recommended code for general use.

See for more information: https://www.iso.org/iso-3166-country-codes.html
ISO 04217GeneralISO 4217 is the standard that defines codes for currencies, as well as funds and minor currency units. The codes can be represented as a 3 letter code, or a numerical code with 3 positions, which is usually the same as the numerical country code from ISO 3166-1. The minor currency is given as an exponent for the division, by 10. I.e. if the minor currency is "3", the currency can be divided into 1000 minor units. The name of the minor unit is not part of this standard.

The current version of the standard is ISO 4217:2015.

See for more information: https://www.iso.org/iso-4217-currency-codes.html
ISO 06166FinanceThe ISO 6166 standard is called "Securities and related financial instruments -- International securities identification numbering system (ISIN)". This standard describes and defines the International Securities Identification Number. The number applies to fungible and non-fungible securities and financial instruments.

ISINs consist of two alphabetic characters, which are the ISO 3166-1 alpha-2 code for the issuing country, nine alpha-numeric digits (the National Securities Identifying Number, or NSIN, which identifies the security), and one numeric check digit. The NSIN is issued by a national numbering agency (NNA) for that country. Regional substitute NNAs have been allocated the task of functioning as NNAs in those countries where NNAs have not yet been established.

ISINs are slowly being introduced worldwide. At present, many countries have adopted ISINs as a secondary measure of identifying securities, but as yet only some of those countries have moved to using ISINs as their primary means of identifying securities.

The current version of the standard is ISO 6166:2013.

See for more information: https://www.iso.org/standard/44811.html
http://www.isin.net/iso-6166/
Yes
ISO 08601GeneralISO 8601 is about "Data elements and interchange formats - information interchange - representation of dates and times". It details how to represent dates and times when exchanging them with other systems in an unambiguous way.

The current version of the standard is ISO 8601:2014.

See for more information: https://www.iso.org/iso-8601-date-and-time-format.html
ISO 09362FinanceThis ISO standard defines the Business Identifier Code (BIC). BIC is an international standard for identification of institutions within the financial services industry. BICs are used in automated processing. They unambiguously identify a financial institution or a non-financial institution. The ISO 9362 standard specifies the elements and the structure of a BIC. A BIC consists of either eight or eleven contiguous characters. These characters comprise either the first three, or all four, of the following components: party prefix, country code, party suffix, and branch identifier. The ISO has designated SWIFT as the BIC registration authority.

The EU regulation 260/2012, also known as the IBAN only rule, requires financial institutions to add the BIC code to IBAN payments.

The rule has applied to any domestic EURO payment since February 2014, to any cross-border EURO payment between EU countries since February 2016, and to any cross-border EURO payment from non-euro countries since October 2016.

See for more information: https://www.iso.org/obp/ui/#iso:std:iso:9362:ed-4:v1:en
https://www2.swift.com/sitesearch/#/?q=BIC
ISO 10383FinanceISO 10383 is called "Codes for exchanges and market identification (MIC)". It defines the Market Identifier Code (MIC).

This International Standard specifies a universal method of identifying exchanges, trading platforms, regulated or non-regulated markets and trade reporting facilities as sources of prices and related information in order to facilitate automated processing. Each such exchange, platform etc. receives a unique code from the registrar.

See for the current list: https://www.iso20022.org/10383/iso-10383-market-identifier-codes
Yes
ISO 10962FinanceISO 10962 defines the structure and format for classification of financial instruments approved by the International Organization for Standardization (ISO). There are many types of Financial Instruments used for saving, investing, trading, hedging and speculating. These instruments are generally organized in groups called "asset classifications." The most common asset classifications are generally described using terms like "Equities (Stocks)," "Debt (Bonds)," "Derivatives (Contracts)," "Currencies," and a few other generalized terms.

ISO 10962 provides a global standard for these classifications in the form of specific codes. Classification of financial instrument (CFI) Code is used to define and describe financial instruments as a uniform set of codes for all market participants. The code is issued by the members of ANNA, the Association of National Numbering Agencies. The group is currently working to simplify the structure so that it can be adopted more widely by non-governmental market participants.

The letters from the ISO basic Latin alphabet in each position of this 6 character code reflect specific characteristics intrinsic to the financial instruments that are defined at the issue of the instrument, and which in most cases remain unchanged during the lifetime of the instrument (or by the market on which the instrument trades).

See for more information: https://en.wikipedia.org/wiki/ISO_10962 or visit the registrar homepage
ISO 11179MetadataThe ISO/IEC 11179 Metadata Registry (MDR) standard) is an international standard for representing metadata for an organization in a metadata registry. ISO/IEC 11179 claims that it is (also) a standard for metadata-driven exchange of data in an heterogeneous environment, based on exact definitions of data.

The ISO/IEC 11179 model is a result of two principles of semantic theory, combined with basic principles of data modelling. The first principle from semantic theory is the thesaurus type relation between wider and more narrow (or specific) concepts, e.g. the wide concept "income" has a relation to the more narrow concept "net income". The second principle from semantic theory is the relation between a concept and its representation, e.g., "buy" and "purchase" are the same concept although different terms are used.

The standard consists of six parts:
ISO/IEC 11179-1:2015 Framework (referred to as ISO/IEC 11179-1)
ISO/IEC 11179-2:2005 Classification
ISO/IEC 11179-3:2013 Registry metamodel and basic attributes
ISO/IEC 11179-4:2004 Formulation of data definitions
ISO/IEC 11179-5:2015 Naming and identification principles
ISO/IEC 11179-6:2015 Registration

Part 1 explains the purpose of each part. Part 3 specifies the metamodel that defines the registry. The other parts specify various aspects of the use of the registry. An additional part, Part 7: Datasets is currently under development.

For use in the creation of data models, part 4 and especially part 5 provide common standards that could be used in data governance to govern the creation of data models.

See for more information: https://en.wikipedia.org/wiki/ISO/IEC_11179
ISO 13616FinanceThe International Bank Account Number (IBAN) is an internationally agreed system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors.

The ISO standard was split in two parts in 2007. ISO 13616-1:2007 "specifies the elements of an international bank account number (IBAN) used to facilitate the processing of data internationally in data interchange, in financial environments as well as within and between other industries" but "does not specify internal procedures, file organization techniques, storage media, languages, etc. to be used in its implementation". ISO 13616-2:2007 describes "the Registration Authority (RA) responsible for the registry of IBAN formats that are compliant with ISO 13616-1 [and] the procedures for registering ISO 13616-compliant IBAN formats".

The official IBAN registrar under ISO 13616-2:2007 is SWIFT.

The IBAN consists of up to 34 alphanumeric characters comprising: a country code; two check digits; and a number called the Basic Bank Account Number (BBAN) that includes the domestic bank account number, branch identifier, and potential routing information. The check digits enable a sanity check of the bank account number to confirm its integrity before submitting a transaction.

The current version of the standard is ISO 13616:2007

See for more information: https://en.wikipedia.org/wiki/International_Bank_Account_Number
Yes
ISO 15022Metadata - FinanceISO 15022 is the precursor to (and superseded by) ISO 20022.

See for more information: https://www.iso20022.org/15022/uhb
ISO 17442BusinessThe International Organization for Standardization (ISO) 17442 standard defines a set of attributes or legal entity reference data that are the most essential elements of identification. The Legal Entity Identifier (LEI) code itself is neutral, with no embedded intelligence or country codes that could create unnecessary complexity for users.

Four key principles underlie the LEI:

  • It is a global standard.
  • A single, unique identifier is assigned to each legal entity.
  • It is supported by high data quality.
  • It is a public good, available free of charge to all users.
Once a legal entity has obtained an LEI, it will be published together with the related LEI reference data by the organization that has issued the LEI. This means the full data on the entire LEI population is publicly available for unrestricted use by any interested party at all times, including the set of reference data for each LEI code.

The LEI code is structured as follows:

  • Characters 1-4: Prefix used to ensure the uniqueness among codes from LEI issuers (Local Operating Units or LOUs).
  • Characters 5-18: Entity-specific part of the code generated and assigned by LOUs according to transparent, sound and robust allocation policies. As required by ISO 17442, it contains no embedded intelligence.
  • Characters 19-20: Two check digits as described in the ISO 17442 standard.
The current version of the standard is ISO 17442:2012.

See for more information: https://www.iso.org/standard/59771.html or visit the homepage of the registrar
Yes
ISO 18774FinanceISO 18774 defines the Financial Instrument Short Name. The new standard for the Financial Instrument Short Name (ISO 18774) standardizes short names and descriptions for financial instruments. The standard was approved in September 2014.

As of July 1 2017, the FISN will be globally assigned concurrently with the ISIN (ISO 6166) and CFI (ISO 10962) at the time of issuance of a new financial instrument.

The ISO 18774 standard incorporates the issuer short name and the abbreviated characteristics for the financial instrument. It has a maximum length of 35 alphanumeric characters.

Unlike other ISO-standard financial instrument identification codes, the FISN is not meant to be machine-readable, but to provide a short format for essential information about a security for human use.

See for more information: http://www.anna-web.org/standards/fisn-iso-18774/
ISO 19773MetadataThis International Standard specifies small modules of data that can be used or reused in applications. These modules have been extracted from ISO/IEC 11179-3, ISO/IEC 19763, and OASIS EBXML, and have been refined further. These modules are intended to harmonize with current and future versions of the ISO/IEC 11179 series and the ISO/IEC 19763 series.

Part of the standard are, amongst others:

  • a data structure for UPU postal data
  • a data structure for ITU T E.164 phone number data
The current version of the standard is ISO/IEC 19773:2011.

See for more information: https://www.iso.org/standard/41769.html
ISO 20022Metadata - FinanceISO 20022 is an ISO standard for electronic data interchange between financial institutions. It describes a metadata repository containing descriptions of messages and business processes, and a maintenance process for the repository content. The standard covers financial information transferred between financial institutions that includes payment transactions, securities trading and settlement information, credit and debit card transactions and other financial information.

The repository contains a huge amount of financial services metadata that has been shared and standardized across the industry. The metadata is stored in UML models with a special ISO 20022 UML Profile. Underlying all of this is the ISO 20022 meta-model - a model of the models. The UML profile is the meta-model transformed into UML. The metadata is transformed into the syntax of messages used in financial networks. The first syntax supported for messages was XML Schema.

The standard contains a number of external reference code lists, that are available on the website in the form of spreadsheets and documentation. The data dictionary present in ISO 15022 is no longer available as a spreadsheet, but can be downloaded as a 96MB xml-file.

See for more information: https://www.iso20022.org/
ISO 20275FinanceThe ISO 20275 standard defines Entity Legal Form (ELF) worldwide. The latest (and first) version is ISO 20275:2017 (en). It covers the legal forms available per country (or country grouping), as long as that country has an ISO 3166-1 alpha-2 code.

The standard can be obtained from the ISO but the codelist itself is maintained by the Global Legal Entity Identifier Foundation (GLEIF) and can be obtained here.

Interesting to note is that where the AnaCredit list goes slightly off the rails with the European Legal Form "Societas Europaea", this list solves it in a nicer way by repeating the SE legal form for all countries involved. Although using EU is actually allowed by ISO 3166-1 standards as it is an alpha-2 code reserved for the special use of the EU, this way is cleaner as you now only deal with countries. Since the AnaCredit list goes off the rails in more ways than this, you may want to use this ISO standard as your main reference data set and add a mapping to the ECB's rather ragtag list of legal forms.

Currently there are legal forms for 55 countries in the list so locations that are currently missing need to be added through a feedback form.

The ELF for companies in the LEI register has been made public as of March 1st, 2018.

Web development in 2017 – A Journey part II: VSCode and Git

The second part in this series is about what and how to install the tooling you need nowadays. No more Notepad, I’m afraid…

Please note: work in progress. This article will be updated to reflect new insights.

To install

 
Remember the previous article? Well, here are the first two tools to install. Just click on the link to jump to the relevant section if you’re impatient:

Make is no longer on the list, because Make isn’t worth the bother of installing Cygwin. Yes, I agree that if you do not install Git, you need to install Cygwin for the GNU Core Utilities. But as they come with Git who has its own version of them, I think that all the hassle you have to go through to get Git and Cygwin working together just isn’t worth it. However, for those foolhardy enough to want to experiment, I’ll explain how to run them side by side in a separate post.

Node.js and related tooling will be discussed in the next post. First, we discuss VS Code and Git.

 

Visual Studio Code

back to top
First, download and install Visual Studio Code . Just click on “Download for Windows (stable build)” and install the download. If for some arcane reason you’re not working on a Windows system, just click on the dropdown icon next to the download button and select a relevant installer or package.

During the installation of the fairly small (37.2 MB) package, you get a number of questions eventually that ask you whether to add context menu’s, register the editor for the relevant filetypes and add the path to Code to the Windows path. Something similar will likely happen on other platforms. My urgent advice is to check all the boxes, unless you already have another development IDE installed (such as Visual Studio). I’d still register Code for everything, but afterwards restart the other IDE and make sure you register the filetypes for that IDE again. Or don’t register and do this manually. I just register the editor for everything, because nothing is as annoying as clicking a .js file and starting an interpreter or worse, Visual Studio itself.

Once installed, verify that all is working by starting Code. If all went well, this can be done from the commmandline, shell or whatever you use. Type “code” and the editor should start.

It is possible that you get a warning about a typescript compiler. In that case install the correct typescript compiler in NPM (using the indicated version) with the command “npm install -g typescript@2.3.2”. This will install version 2.3.2, replace it if Code needs a different version. If there is an older version of typescript already installed, you can remove it with “npm uninstall -g typescript”.

But we will assume that Code starts just fine. In that case we will first set our preferences. Go to File/Preferences and select the Color Theme (“Tomorrow Night Blue” for me) and File Icon Theme (I use the VSCode Icons but Seti is the popular choice and it’s easy to see why). Just select File / Open Folder and open a folder with source code to check what your icons look like.

Then, we add extensions. Open them with the menu on the left side, or with Ctrl-Shift-X. I installed the following extensions:

  • Git Lens

    Git Lens helps you to easily see who changed what in your source code and also get some graphical information on Git commits. It gives you the ability to open older versions of a file, or the same file on Github, or compare it. It shows commits and annotations and a whole host of other items. More even than I currently know. So just install it.

  • Gitignore
    This plugin downloads gitignore files for your specific project. Very helpful, but usually only once.

  • Languange-stylus

    If you use Stylus for CSS, this add-in makes sure you get syntax coloring and checking.

  • Auto-Open Markdown Preview
    A very useful extension that just opens the preview of any given MarkDown-syntax file. Especially useful when editing open-source packages that almost always require a README.md file.

  • ESLint

    ESLint promotes best practices and syntax checking, is very flexible and can include your own rules. However, I found it to be pretty annoying to set up and get working without a gazillion errors (or none at all). If you do this, best follow the instructions on the website. It’s really quite good, but JSHint works out of the box, and ESLint doesn’t provide much value without changing the configuration file. See https://github.com/feross/eslint-config-standard/blob/master/eslintrc.json for an (overly complex) example. That said, it’s rapidly becoming THE linting tool of choice. So for futureproofing it might be your best bet.

    “To sum up, JSHint can offer a solid set of basic rules and relatively fast execution, whereas ESLint offers all that one can want from a linter and then some more as long as he’s willing to put an effort in setting it up.” – Quora

    An alternative option for ESLint is JSHint. This will give very good warnings about JavaScript issues in your code. However, you will also need to install the npm module as well (we’ll get to that later) with the command “npm install -g jshint” which will install the actual syntax checker globally as a commandline tool. It could be installed per project as well, see the website for more details.
    When using it, insert the following line in functions where you use var declarations:
    'use strict';
    If you use import and export commands in for instance d3 plugins, use
    /*jshint esversion: 6 */
    as your first line in any javascript file.

    If you use JSHint then you better add the JSHint default config extension as well: using the command palette in VS Code (Ctrl+Shift+P) you can type “generate” and then generate a JSHint configuration file. Very nifty!

That’s it for the VS Code plugins. If you need more plugins, visit the marketplace and type “@recommended” to see recommended plugins.

 

Git

back to top

Pfew. Git. The mammoth of version control. If you need documentation, here is an extremely nice and well done tutorial. I’m just going to put down some basic points and then leave this topic alone.

First, install Git after downloading it. It installs itself as both a commandline tool, and comes with its own shell. If you’re into Unix shells, Git BASH is nice, and compatible with many open-source projects out there that use shell commands in their taskrunners (like Make). Personally I just use the CMD from windows, or PowerShell for special projects. Whatever you choose, after installing Git you have access to an updated version of the GNU core utilities, giving you tools as wc, grep, cat, less, vi, rm -rf, and many more.

Each project has its own repository, because Git works per repository (and separating them prevents accidents). Creating one is easy: just type “git init” in a commandline in the folder you want to have in Git. Git will create a subdirectory where it stores the repository. With a .gitignore-file you can tell Git to ignore files and folders. The syntax for that file is all over the web, but for firebase projects this is my .gitignore:

# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*

# Runtime data
pids
*.pid
*.seed
*.pid.lock

# Directory for instrumented libs generated by jscoverage/JSCover
lib-cov

# Coverage directory used by tools like istanbul
coverage

# nyc test coverage
.nyc_output

# Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
.grunt

# Bower dependency directory (https://bower.io/)
bower_components

# node-waf configuration
.lock-wscript

# Compiled binary addons (http://nodejs.org/api/addons.html)
build/Release

# Dependency directories
functions/
/node_modules/
jspm_packages/

# Typescript v1 declaration files
typings/

# Optional npm cache directory
.npm

# Optional eslint cache
.eslintcache

# Optional REPL history
.node_repl_history

# Output of 'npm pack'
*.tgz

# Yarn Integrity file
.yarn-integrity

# dotenv environment variables file
.env

# firebase stuff
*.firebaserc

You can also get this from the Gitignore plugin in VS Code. Remember: .gitignore goes into the standard folder, not the .git repository folder…

Git commands

There are some very good Git manuals out there. A nice PDF to have is the Atlassian Git cheat sheet PDF. Atlassian also has a list of basic Git commands.
I recommend reading at least the basic manual if you haven’t worked with Git before, otherwise it will be difficult to understand what’s happening.

GitKraken

Something that will make Git easier to use is GitKraken. Once downloaded and installed, you can use this tool to visualize the Git branches and maintain them. For instance, you can combine a large number of commits into one single commit, to make your commit history much clearer. You can also operate on branches and create pull requests for open source software. In general, once you get into publishing open source software on GitHub, you really want to use this. Yes, you can do everything on the commandline, but it’s a hassle and GitKraken makes it much easier. You will still need to know the Git commands in order to use GitKraken, though.

Web Development in 2017 – A Journey part I

A few weeks ago, after publishing my Collatz calculator, I decided I was going to develop a small web application to practice modern web development. And along the way I quickly discovered that a WebApp in 2017 is not nearly the same as that same WebApp in 2007, or even 2015.

Please note: work in progress. This article will be updated to reflect new insights.

Choices, choices… and Firebase

Web Development in 2017 is not your father’s web development anymore. For one thing, it’s now completely dominated by JavaScript. There are applications where even the static HTML and CSS are rendered through three different frameworks. This even extends to the backend with Node.js – even thought it may not be optimal for your needs.

But that’s just the beginning. There are a ton of choices to be made – we are drowning in tools and frameworks, a surfeit of riches in fact. So we have to make choices fast and be prepared to change things around as we gain experience with the choices we made. The important part is to try and not paint ourselves into a corner. But… there are some choices that will have quite an impact.

The biggest impact is created by something I wanted to try for a while now, which is having my back-end not only hosted by another party, but developed and maintained as well: this is called Backend-As-A-Service (BaaS). With BaaS you don’t host your own back-end. You don’t even host it somewhere else. No, someone else is hosting a back-end for you somewhere, and you are allowed to use its standard functionality. This will usually include authorization and storage.


Facebook used to have a very nice BaaS-solution called Parse, but that one has been shut down because it was no longer part of Facebook’s strategy. They did offer a migration path to the open-sourced version though, and you can deploy that server on AWS or Heroku, so it is still a viable option. But I chose to go with a different platform.

Google is still in the BaaS business, with an offering called Firebase. I’m not going to detail Firebase, because extensive documentation is available on the Firebase website. I will however say that, just like Parse, it has (amongst others) the following functions:

  • Authentication
  • Database (with authorization rules)
  • Filestore
  • Message queues
  • Events

In the beginning I will limit myself to the use of Authentication and the Database.

Having made the choice for Firebase, we are now stuck with some others as well. Developing for the web in 2017 needs suitable tooling. And you cannot just buy Visual Studio and expect it to work. Firebase is based on Node.js, JavaScript and Web API’s. You need suitable tooling for that.

JavaScript, Typescript and ES6 compliancy

Funny as it sounds, we have to discuss the language we use first. We can chose TypeScript or JavaScript, and in JS we can choose ES6/ES2015 or ES5. The ES stands for ECMAScript, which is the actual name of JavaScript but noone calls it ECMASCript. If that sounds confusing it’s because it is, but here is a good explanation.

Typescript?

Typescript is nice. It checks your datatypes at compile time which prevents bugs. If you do back-end development in JavaScript, you should probably do TypeScript. But it also adds another compiler to an already unholy mess of libraries and supporting crutches. And it opens the door for things like the aptly named Babel. Before you know it, you start targeting ES6 JavaScript and you need *another* compiler, Babel, just because you wanted to use templating and arrow functions. But you have a lot of work to do before you can actually display “Hello, world!” on a page now. Getting that investment back is pretty hard on small applications. So I avoid TypeScript for now.

ES6/ES2015?

ES6 gives us all kinds of nice things like arrow functions, templating, and a whole array of syntactic sugar. But using it means using Babel to first compile the ES6 JavaScript into ES5, which can be understood by browsers and Node.js. It’s a dependency I can do without.

So my choice is standard ES5 JavaScript. Having settled that, we move on to the tooling to support this choice.

Mandatory tooling

Some tools are so important, doing without them means drastically reducing or even completely negating your development speed or ability to even develop anything at all. These are what I call the mandatory tools. Like a compiler and IDE for C++ development.

Text editor

So, you’re going to write a webpage. VI, notepad or notepad++ is what I used back in 1997. Actually, in 2015 as well. For application development in 2017 the choices are different though. It will be either Visual Studio Code, Atom or Sublime (paid software). Sure, you can try alternatives (try typing “code text editor” in Google for a pretty neat custom list), but chances are it’s one of those three. They all have integrated support for Git, syntax checking and coloring, starting terminals and debuggers from the editor, and extensive customization and plug-ins. I believe VS Code has about 12000 plugins and the other two are not far behind.

Node.js

You may not consider this a tool, but once you install Node.js you also get Node Package Manager and access to a gazillion extremely useful other packages. Like CSS precompilers, plugins, Firebase deployment software, task managers, source control software, and of course Node.js itself: a pretty powerful backend webserver. Install it, because frontend development in 2017 is impossible without it. And all web development is made easier once you have it.

Source Code Control System

Once you develop, you need source code control. You literally can’t do without. And while I do not particularly *like* Git, it’s so ubiquitous and integrated in almost any toolset nowadays, you have to have a really pressing argument to use something like Mercurial (which I like a lot better than Git, but sadly had to let go along with my teddybear when I grew up). Let’s not discuss TFS or Subversion – they’re dead except for special use cases. Web development is not a special use case. So, install Git.

Front-End JavaScript Development Libraries

One word: jQuery. Whatever you do, you probably want to include this. A lot of frameworks include this out of the box, but if they don’t you’d still want this. Tons of utility functions, loads of functions for manipulating the DOM and they work as fast as the current browser will allow, without having to worry about what browser you run on. Absolutely essential for fast development.

CSS Framework

To make page layout easier, you can use a library that will give you easy ways of making a page responsive to where it runs and on what media it runs. This might look like an optional choice, but given the amount of different browsers and mobile media nowadays, it is quite impossible to handcode everything for every platform yourself. That’s why choosing one of these frameworks is a must.

The classic package for this was bootstrap.js, but you can also choose foundation.js. They both provide widgets such as buttons, sliders, cards, dropdowns, tabs etcetera but also responsive and columnar layout, and often styling as well. Bootstrap is the most used and best supported library, but Foundation is a strong contender. Currently I will go with Foundation.

Noteworthy is that both options support Google’s new vision on how to design for the new internet, called Material Design. Material Design is a design philosophy that ties the styling for all components you can use on the web together in one design philosophy. Google has changed all its applications over to this design, and also has its own implementation to showcase how this works, called Material Design Lite. This can be used as a lightweight layout framework, but is limited in application and styles. Since it is simple to use and looks very good, however, this is becoming quite popular. You can see it in action on the standard login-screen of Firebase applications that use the default UI. For now, I go with Foundation when I need layout, because Material Design Lite is a bit *too* simple.

Optional tooling

There are also some tools you can live without, but have the potential to make your life a lot easier.

CSS precompiler

A CSS pre-compiler gives you the ability to write CSS in a slightly different language, that gives you smaller CSS that’s easier to understand. If you have just one small stylesheet, you can do without. But once your styles get more complex, a CSS precompiler is very helpful. They provide loops, conditionals, functions for cross-browser compatibility and usually a more readable CSS. Choices here are Less, SASS and Stylus. All can be installed using NPM. Personally, I think Stylus provides the best and cleanest syntax, so I have chosen Stylus.

Task runner

A task runner is software that can take care of the precompiler step, then combines files as needed, minifies them, uglifies them, uploads them to the server and opens and refreshes a browser window. While this can be done with (Gnu)Make or Node Package Manager scripts, it’s easier to do in tools like Grunt and Gulp. Tools like Bower and Webpack also serve slightly different purposes, like combining files into one big JS include, but with HTTP/2 this may actually hurt performance more than it helps. This means there is a whole zoo of task managers and no clear winner in sight.

At the moment I use Gnu Make (from the Cygwin project) to compile stylus files and deploy and run Firebase. NPM Scripting wasn’t powerful enough without serious JavaScript coding, so I can’t recommend it. And yes, what I do could all be done by just starting the tools with the right options, but I find Make easier to use. Should I disover that I need something more powerful, I’ll try that and update this section.

Even more optionally optional tooling

And then we have the section with tools you don’t want or need to install unless you suddenly have a pressing need. And even then you should reconsider this until you have run out of alternatives. For most applications these are overkill. Come back when you are supporting something as complex as Facebook.

JavaScript libraries

jQuery is often combined with Underscore.js or Lodash.js for utility functions. Lodash seems to be faster and more agile. However, I consider it an optional library and you can chose whichever you like.

Another potentially useful library is Immutable.js. This provides you with enforced immutable datastructures, that eliminate accidental side effects from functions, preventing errors and improving performance. However, I don’t use it currently.

Testing Frameworks

Mocha and Chai are frameworks that provide you with the ability to do easy unit and integration testing, with good reporting. However, I’m not developing a library used by dozens of people. And neither is my game in any way going to be mission critical for anyone. So while breaking things while fixing others does look unprofessional, I can live with that for now. My application will likely remain small and easy to bugfix, so I am not going to invest in these frameworks at this time.

Templating Libraries

Templating libraries help us with HTML-templates that we can fill with data. Very useful if you want to display a list, for instance. However, I will skip this subject for now. Mustache.js and Handlebar.js are great libraries for this, but we already have templating in jQuery. If we ever get to a framework like Angular2.js, React.js or Vue.js, things will have to change again anyway. For now, I think jQuery will be fine. For more information, you may want to look into this overview.

JavaScript Frameworks

I haven’t yet discussed the elephant(s) in the room: Angular2.js, React.js and Vue.js. These are very popular frameworks that bring you everything from design to state machines, and the kitchen sink as well. The choice however, can be difficult. I have not yet decided whether to actually use one, because it’s probably overkill for my needs. I do not currently intend to build a Single Page Application. However, it may well turn out to be a better option than building a lot of separate pages. In that case I intend to go with Vue.js. This is because Angular2.js has a Model-View-Controller architecture I don’t think meshes particularly well with my application or Firebase. I’m much happier with a Model-View-ViewController type of architecture with one-way databinding (updates flow from the model to the view, not vice versa). This would mean either React or Vue since both support the Flux architecture with Redux and VueX. React is a bit heavier than Vue and renders the HTML from JavaScript, which is something I’m not particularly fond of, so if it comes down to it, I’ll go with Vue. For now though, I will stick with Foundation and jQuery for layout and templating.

My choices

As this is a journey, I’m going to travel a bit. Currently I have packed the following tools for my journey:

  • Development environment: Node.js (+ Node Package Manager) + Cygwin on windows
  • Language: JavaScript/ES5
  • Text editor: Visual Studio Code
  • Back-end: Firebase
  • Source code control system: Git
  • Front-End JavaScript Library: jQuery
  • CSS pre-compiler: Stylus
  • CSS layout framework: Foundation.js
  • Task runner: Make

That concludes my first post in the journey for now. My second post will detail my setup, including installation and configuration.

How to learn JavaScript

I’ve been busy with JavaScript for some time now – with various degrees of succes – and I thought it would be nice to list a few resources that I found both quite helpful, and accessible.

Highly recommended, but not used by me because I only found out about it after the fact:

Once you know a bit more about JavaScript (or ECMAScript, as it is properly called) you probably want to use it in something interesting. I’ve built a few things with the JavaScript graphics library D3 that give immediate results in just a few lines of code, which is a great motivator.

If you have any suggestions for improvements or additions, feel free to let me know in the comments!