Category Archives: Publications

Articles, posters etc. in PDF format or otherwise, usually published in other media (such as print).

Legal form: a modelling journey, part II

If we want to understand more about how to model the entity legal form appropriately, we need to understand the stakeholders and various other aspects surrounding the legal form. There are several types of stakeholders involved in the administration of the legal form of an entity, and concerns for each may be different. These will be described here.

The company or legal entity itself

The first stakeholder for the legal form of a given legal entity, is the company/legal entity itself. The legal entity wants to have a very clear view of what its own legal form was at any given time. It also wants to communicate its current legal form to other stakeholders so they can estimate the risks attached to doing business and then proceed from there. Nobody wants to do business with a legal entity when you cannot tell if you have any recourse if you never get paid or if paid items are never delivered.

However, nobody will take the company’s word for their legal form, so you need an independent party, preferably embedded in a legal framework, to vouch for your current legal form. That party is called the registrar.

The registrar

The registrar of legal entities and legal forms for a given region is concerned with making sure that there is an official administration where you can examine the legal form of any given legal entity without having to rely on their word or examine the legal papers of their incorporation. Basically, the registrar wants businesses to be clear on each others legal form so they can do business with each other.

Registrars can be local, national or even international, such as the Global Legal Entity Identifier Foundation (GLEIF). The closer the registrar is located to the company, the more reliable will the registration be, in general. Often, the legal framework mandates a single registrar to be responsible for the registration of the company’s legal entity and legal form.

We enable smarter, less costly and more reliable decisions about who to do business with.
– Global Legal Entity Identifier Fondation


Banks and financial institutions

Banks and financial institutions are particularly interested in the legal form of any company that they are involved with, especially if they are providing a loan in any form to the legal entity. In the European Union, they have to report to the regulatory authorities on the loan, as well as on the counterparties, due to the fact that the bank now has a counterparty risk. Weighing those risks on a national and even systemwide level is the task of the central banks. The legal form is necessary if you want to estimate how much money you can recover in case of failure: can you recover from only the entity involved, or can you also recover money from the shareholders or owners?

Other vendors

Apart from financial services, a company may also contract other vendors for products or services. Most of the business of any company is done with these parties.

Vendors typically want to know the same thing as a bank, but are not usually required to report to regulatory authorities. They are mostly interested in the current legal form, because once the goods are delivered, the service is rendered or payments received, the interest in the legal form of the counterparty ends as well. Vendors are always other legal entities with their own legal form, and this means the company has an interest in their legal form as well.

Clients

Clients can be natural persons or legal entities. Natural persons are usually not overly concerned with the legal form of the company, as they are normally protected by consumer laws and the monetary amounts involved make it hard to recoup losses in court. However, clients can also be other large companies and they will certainly want to recoup any losses, for instance in case of malpractice or fraud. This means that they are very similar to a vendor in their interest in the current legal form of the company.

Tax authorities

Many legal forms have tax implications. When moving from one legal form to another, there can be tax implications as well. You can, for instance, have a “quiet” transfer from one form to another, or a “noisy” transition, where you pay everything off and basically start fresh. But even in setting up a company with a given legal form, there are often tax rules you have to follow. For instance, in the Netherlands the director of a limited liability company must have a certain minimum wage which must be approved by the tax authorities.

Other government agencies

While in many countries the legal forms are quite generic, there are also countries where certain professions or specific types of company have their own legal form. An example would be “trader” or “farm operator” in France (read more interesting details on French legal forms here). One can imagine that certain legal forms (like “chemical factory”) would come with a certain amount of paperwork and various stakeholders that would like to know more.

Local versus national versus international stakeholders

In a local environment (usually on the level of the ISO 3166-2 country subdivision such as a province or state), the legal form is known to both sides, as well as the approximate risks and rights that come with it. This usually (but perhaps not always) translates to the national legal framework in a way that makes a legal form in one area legally consistent with a similar legal form in another area, even if they have different names.

This does not always work however. The USA is well known for the way in which different states have set up legal forms with some very specific rights and obligations, such as in Delaware, which is known as a tax haven.

In any case, this translation breaks down whenever you cross national boundaries into other legal frameworks. To combat this inside the European Union, the EU has created a number of legal forms that are implemented exactly the same in any national legal framework, such as for instance the Societas Europaea (SE), the Societas Europaea Cooperativa (SCE) and the European Grouping of Territorial Cooperation (EGTC). They are defined in European law. But this mechanism does not extend to other legal frameworks, such as the one in the USA or China.

Note that it is possible to have a legal entity in one country with a given legal form, whih has a local company branch with a local legal form, that can be subtly different from the main one. The complexities of offices, branches and holdings are beyond the scope of this article however.

The next part

I hope you liked this part as well. In the next part, we are finally going to start modelling!

Legal form: a modelling journey, part I

Abstract vision of a legal form for a company, in a landscape of country and language.

The AnaCredit regulation is an interesting regulation. Having closely worked with the Dutch modeller who drafted the logical model for it at the DNB, it’s one of my favorite models to demonstrate. It compresses literally hundreds of pages of rules and regulations in a single model. To the dismay of most banks when they first saw it, but eventually to the benefit of all of its users.

However, one issue has always been a bit of a pain. And that is the concept of legal form. You are required to report it for the counterparties involved in loans. But at the time the regulation started, some banks had not used foreign legal forms but had mapped them onto their own home countries versions. Others had incomplete registrations. And most of them had issues with languages: in Belgium you can register a company in three different locations, with three different names for the legal form, but they are all the same. Even worse, you can potentially have the situation that you have the same code in different national languages but they mean something different. This means that using the legal form code without any other descriptive attributes is a problem.

Legal form is a bit of a weird duck in a way. Everyone knows it. Everyone uses it. But when asked to describe what it is, you get wildly varying answers. Obviously, the ECB doesn’t know what it is either, or they would not have made the list they use now (you can check out the ECB List of legal forms yourself).

Definition

On the internet it’s not that easy to find a definition of legal form. The ones I can find come close to the joke about an elephant, described by a number of blind people: they describe the visible attributes but not the core, the “ding an sich”. See for instance this definition: while it starts good with “the legal form (also known as legal status) is defined according to national legislation” it then goes on and on about its use.

So what is it then? The best I can come up with, is that the legal form under which a company operates, is defined in the national legislation. The name of the legal form and the abbreviation are sometimes given in that legislation, but not always. In the end, any legal form code or name is a shorthand for a referral to a particular paragraph in the national law of any given country or group of countries such as the EU.

For instance, in Dutch law we have a limited liability company. This is called a “besloten vennootschap” and abbreviated as “BV”. This is defined in het Burgerlijk Wetboek (civil law book), book nr. 2, article 175. This says that any BV is a legal person, with named shares, where the shareholders are not liable for any losses over the amount of capital they have put into the company. And that is the basic definition of this particular legal form (although there are more articles describing this legal form in more detail).

Do not make the mistake of assuming that this particular legal form is the same as the limited liability company in other countries: liabilities can be limited in very different ways and can be undone on very different grounds. That is why you really need to include the country when you define legal form.

A legal form is also unique not just by country but by official language in that country. In Belgium, as you can see in the ECB List of legal forms, they have a legal form that has three names (and corresponding acronyms):

  • Unité TVA – UTVA (French)
  • BTW-eenheid – BTWE (Dutch)
  • Mehrwertsteuereinheit – MWSE (German)

This particular legal form has a single surrogate key, meaning that to the ECB, it is just one single legal form. So far, so good though.

Complications

Now, so far things are relatively simple. But they’re getting a bit more difficult when you realize that some EU countries have no standardized set of legal forms. Try finding the ones for Portugal, for instance. On the official government website you can find nine legal forms. But the Anacredit list has twenty of them, and one even has no acronym. The thing the website does well, though, is that it refers you to the exact paragraph in the law that describes and regulates the legal form. Portugal is an example, but not an exception.

Things get funnier when you add the ISO standard into the mix. Yes, there is an official ISO standard (ISO 20275:2017 – Financial Services – Entity Legal Form) for legal form, because the mess has not escaped the attention of the standardization committee. The data model is described in the standard, but there is also a registrar for worldwide legal forms, the Global Legal Entity Identifier Foundation (GLEIF). I’ve described this standard in an older post.

Some of the differences between the lists are:

  • The GLEIF list contains legal forms on a country subdivision code level (ISO 3166-2), used in for instance the USA and Canada, where the ECB list does not. This becomes interesting in the case of Madeira, which is subdivided in the GLEIF list but part of Portugal in the ECB list. Where the GLEIF lists two legal forms for Madeira, none are mentioned in the ECB list;
  • The GLEIF list contains 31 legal forms for Portugal, the ECB list only has 20, and the Portuguese government lists 9. Have fun trying to find out which one is valid for your case. Portugal is just an example here, the deviations are similar for many countries;
  • The local name is nice, but do you also want it in the local script? This becomes a relevant question for cyrillic names in Europe, and for other non-Western scripts such as Chinese and Arabic that may not even be read from left to right. The GLEIF lists the local name in the local script and provides a transcription to the Western script as well. But the ECB list does not. If you are lucky you can get them from the GLEIF list, and otherwise, well, there’s always ChatGPT;
  • The GLEIF list sometimes lists an abbreviation, sometimes not. The abbreviations can be in the local script, in that case there are usually, but not always, transliterations into the Western script.
  • The GLEIF list sometimes lists multiple versions of abbreviations for the same legal form for the same language for the same country. The ECB list only does that for different languages in the same country. Good luck matching them up.

So how do we get out of this mess? Well, we don’t 🙂 The mess will remain. But you need to understand for what purpose you need the legal form of a legal entity. The main purpose is to determine the amount of risk and exposure you are taking on if the legal entity in question is your counterparty. If you do business globally, this matter becomes more urgent. A secondary purpose is to report this to the relevant authorities, notably the ECB with the AnaCredit reporting requirement, who does the same but on a higher (aggregate) level, except that exact legal forms are limited to European counterparties, and for global ones you are allowed to approximate them.

We can create a data model that will enable us to fulfill most of these requirements. Given the data issues I doubt it is possible to get full coverage for all countries and all legal forms, but we can certainly do much better than just add the nearest acronym to a legal entity, and hope for the best. How to do this data model is a subject for the next post, however.

Some thoughts on the Persistent Staging Area

Several people have over the last few years advocated for a persistent staging area (PSA) as part of a data warehouse (DWH), for instance in this article by Roelant Vos. Roelant is the most notable advocate of this approach so I will use his article as the main argument in favor of a PSA. rightly says that some people (vehemently) disagree with this, as it introduces a second data warehouse. However, there are more reasons why I personally think that it *may* be a good idea in some situations, but a pretty bad idea if you apply it to every data warehouse.

So what *is* a persistent staging area? Simply put, it is your staging area, except you make all data historical, and never throw anything away. This has several benefits, as described in Roelant’s article:
1 – It allows you to re-create and fix the central Data Vault or Kimball model in case of issues.
2 – You can have an independent verification of what’s happening in the source systems.
3 – It allows you to start right away without having a central model in place, deferring (some) modelling, architectural, and design decisions to a later stage while still capturing history from day one.
4 – You can virtualize your data warehouse on top of the PSA.

However, I disagree that these are benefits. They are mainly stopgaps for issues you should have mitigated in another way, or issues arising from applying methods in a way that I would not do (and therefore, would not have the same problems with).

Let’s first take a look at the point that you can already capture history that you might need later on, and start right away. The only reason for leaving out data is when the person who is responsible for the data (which can *never* be IT or a project) tells you the data is irrelevant. So implementing a PSA for this part means that you (a) don’t trust the business to get it right, (b) don’t make them responsible for their own problems and (c) the historical data you didn’t capture is so important that it needed to be captured when you came in (which is a holy date on the business calendar somehow? Why is that data important, but the data from two months earlier is not?) and (d) there are no options to restore a backup of that data and (e) the source itself already tossed the data out. The combination of these factors is both unlikely and also not the problem of the DWH to solve. So why a DWH business owner has to tack on additional expenses to mitigate this is beyond me. There is of course the gap between when you start the project, and when you can first load data after you have designed and architected the data warehouse, as well as modelled it. This bespoke approach to data warehousing is fast becoming a serious issue in the market, and as antiquated as a horse-and-carriage. Given proper data warehouse automation tooling, a database to store the DWH in, and access to sources, it is now possible to have a rudimentary DWH up and running in the first week. Yes, if you spend six months setting it up you have a window where you don’t have historical data over those six months, but I fail to see why the data from those six months are suddenly more relevant than the decade in lost data where there was no data warehouse in place at all. The whole anxiousness about losing data is a bit strange: you are NOT the person responsible for this, the business is. I suggest you ask the business about it first – the usual answer is “who cares”.

Second, the whole ability to recreate the central persistence layer of your DWH by having another copy in place is just strange. All data that is stored is always available. I have made mistakes as well, but it was always possible to restore the mistakes since the data was captured in the Data Vault. Sure, maybe you didn’t capture a table that had changes and now someone notices it. The question is: if they notice it after a year or so, how important was that data exactly? And why do you feel a need to implement a complete on-line copy of the entire data warehouse just to mitigate against the chance of that happening? Isn’t it better to focus on automated testing, use cases, and spending time with the business owners to discuss the data in depth? After all, we can only spend our time and money once.
You may hear the counter that adding another persistence layer in staging is a very simple thing to do, that will hardly cost anything. But this is only partially right and therefore wholly wrong. The up front cost is low, sure, as it adds only a simple mechanism to the staging layer ETL. The maintenance cost on the other hand is very high, as every change in the data in every source needs to be implemented in your staging layer while maintaining the older data as well. Doing this negates a large part of the argument for data vault in the first place, which was “resilience to change due to modeling the business instead of the sources”. The PSA is a throwback to the bad old days, in that respect, and adds a lot more work than you intend. Just imagine you add twenty sources to the DWH, that has a single business model (or in some banks I know, five). If each source changes once every year, you have more than one change a month, sometimes drastic ones, even when the business does not change, purely for technical reasons. But the central persistence layer only changes drastically when the business changes, which reduces the maintenance costs. A staging area that is not persistent can deal with a source change in a few hours. A persistent staging area either needs to get a new version that is maintained from then on, or a costly migration. The versioning strategy is relatively low cost, but begs the question “how online *is* your data actually, and why is this better than a tape backup?”.

Third, and last, is the virtualization argument. Yes, the argument is correct that if you virtualize your Data Vault, you are in trouble without a persistent data store. And then you’re left with the PSA as the only option. You could also just build a good data vault model, and have none of those issues. After all: if you need to store the data somewhere in every case, what use is virtualization then? The only reasons I see for virtualization are:
– you are prototyping the reporting area
or
– you are prototyping the persistent storage area
The reason for doing the first is obvious. But this can be done on top of a data vault with precisely the same effort. The reason for doing the second is less obvious, but still valid, as the business can often see what you mean in a model much better with actual data. However, you can do this on top of the source system directly if you use Denodo’s or TIBCO’s virtualization software. Even a single one-time restore of the source system into your DWH will work for prototyping just as well. So I don’t think this is a compelling argument.

You might ask if there is no situation where the PSA is useful. Of course not, the PSA *is* useful, just not as often as you think it is. If you have to run the older data in your persistent storage area side by side with realtime current data in a combined virtual solution you suddenly need to either integrate on the natural keys, or on the source system keys. If you integrate on a data vault now, you suddenly have more work as one part of your solution is persisted data in a format that is different from the realtime data. In this case a persistent staging area may be the best solution – for that particular part of the solution.

Another situation where the PSA might be useful is the case where your sources are very stable, but your business model is not. For instance, a startup that uses existing solutions as sold by established vendors. In that case, the persistent staging area would, for now, probably be the option that has the least maintenance and thus the right choice, because as a rule of thumb, as much as eighty percent of the cost of any data warehouse is maintenance. Reducing the necessity of maintenance efforts reduces the technical debt and cost and therefore increases the longevity of the solution.

In general, I find the persistent staging area to be a (misguided) technical solution to issues that are the responsibility of the business to fix on the one hand, and an overlarge drain on the maintenance budget on the other hand. There are situations where it may fit, but if you don’t have to use it, I suggest you should not use it.

N.B.: this is a braindump for now. I will probably write a more considered post later on.

Data warehouse database collation: case sensitive or not?

Various cases illustrated
Even today, database installation is not an entirely trivial task. Especially for data warehouse databases. All settings need to be considered before you start. Fortunately it gets a lot easier in the cloud, right? But where some Cloud vendors claim that their solution removes the need for extensive configuration of settings, what usually happens is that they provide you with a lot of defaults that you can no longer control. Below the waterline, the same settings can still play a role. A setting that can trip you up this way, is collation.

The database collation defines how the database sorts and compares text in the database. It determines how two strings compare with each other, are sorted, and when they are equal or not. Since data warehouses do a lot of key and attribute comparisons when loading data, this is a pretty big deal – much more so than in a standard OLTP system. Not all databases allow you to choose your own collation, but Microsoft SQL Server and Oracle do. On-premise, at least.

Which collation you choose is not a decision to make lightly. While it’s easy to accept the default settings, please note that once you install a database with a certain collation you cannot change it afterwards: you need to rebuild the database from scratch, and moving data from one to the other is easy when going from case insensitive (CI) to case sensitive (CS), but the reverse may be quite hard when “Jones” and “JONES” suddenly refer to the same person instead of two different ones.

Various databases handle collation in different ways. I’ll discuss the broader differences between Oracle and SQL Server here. If you have a different database you should check in advance how collation works for your server.

In Oracle, the metadata of the database is not case sensitive by default. Collation refers to the data, but not the metadata. This ensures that "SELECT * FROM myschema.table" is the same as "select * from MYSCHEMA.TABLE". However, the data in the tables *is* case-sensitive by default. You can change this on a query-by-query basis, and per session since Oracle 10g , but it’s rare to do so because there are some pretty horrendous implications for performance if you use the wrong settings. These default settings make a lot of sense for most applications, however.

Once you’re used to Oracle’s way of doing things, SQL Server will come as a bit of a shock. On SQL Server you can install the database to be either case insensitive, or case sensitive. You can fine-tune this collation even further to be sensitive to diacritical characters (accents) or not. The default settings for the database depend on the system locale settings: https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms143508(v=sql.105). Microsoft recommends keeping the collation settings the same for all databases in the organisation, which is a wise recommendation. This recommendation fails to take into account the situation where the company has multiple database platforms, however, with one of them perhaps being an Oracle database. And even if you do not have on-premise databases other than SQL Server, you will still receive files from external vendors. These may come with their own expected collations, for instance when they identify records with a key that says that Microsoft and MICROSOFT are different companies. This is relatively easy to identify and deal with. It becomes a lot harder when you have hashed or encrypted fields where case turns out to be important. It may be a while before you spot the issue.

Going for a Cloud solution does not solve the issue. As indicated in https://azure.microsoft.com/en-us/blog/working-with-collations-in-sql-azure/, Azure sql server uses a default collation called "SQL_Latin1_General_CP1_CI_AS". That is, the western latin alphabet, case insensitive, but accent sensitive. If your own database was case sensitive, you are now in trouble.

Going case insensitive makes a lot of sense for a single application. It doesn’t have to deal with any other system than itself. If you need to import data, you can wrangle it in the desired shape. Users will probably expect that if they type “microsoft” it will also find “Microsoft” and “MICROSOFT”. But when you deal with a data warehouse, things change. You can and will receive input from various systems with their own ideas about the required collation, alphabets or treatment of accents. 

Case sensitive collations are also a requirement for any permanent data store with heterogeneous inputs (aka a data warehouse). They prevent hashed or encrypted fields from suddenly becoming “equal” when they are in fact different. When looking up hashed keys in a Data Vault 2.0 approach, this becomes a rather important point: Here the input to a hash function can be case sensitive, but the output can *also* be case sensitive. Your best option is to treat it as case sensitive and if it happens to be the same anyway, enforce the case either during loading or with a view.

Accent sensitive is easy to defend. It’s the default option most of the time, which makes good sense. You really don’t want to treat “la” the same as “là”, and that is not even going into the subject of customer names that you really don’t want to merge by accident.

So to summarize: for a data warehouse, I would use the case sensitive/accent sensitive collation wherever you have a choice. This prevents unforeseen data loss. But if you already know you will migrate to the cloud in the near future, you may have to check carefully how to deal with this issue in a case insensitive database.

If you have an opinion on this topic, feel free to leave a comment.

PowerDesigner do’s and don’ts

Many people consider PowerDesigner to be the de facto standard datamodelling tool. Many people are right. However, that does not mean the tool is perfect. As many users can testify, the version 16 release has been quite buggy in the beginning, only stabilizing a bit more with 16.5. And this is not exceptional. The repository is still buggy, projects are a recipe for pain, and let’s not start a discussion on license prices – we’d still be here next year.

However, if you avoid some practices and adopt others, using PowerDesigner is a breeze. Here is my take on things.

Do Not:

  • Use the repository
    The repository is a major cause of bugs. It looks nice, like a venus flytrap, and then it sucks you in and eats you for breakfast. Avoid it like the plague. You are better off spending some money on creating an extension to generate a model export to your own repository. You can buy this from I-Refact or other parties. The other functionality can be done better, cheaper and with less frustration and bugs by just using standard version control software (TFS, git, etc.). If you must compare models, you can do that from within PowerDesigner with very little effort – without losing parts of your model on the check-in/check-out.
    There is only one part of the repository that is actually semi-useful, which is the indication whether your model is out of date versus the repository version. As this functionality does not cooperate with replication or extensions that use that, there is little point in it once you evolve beyond the basics. Also, it is much better to split up your models so as to avoid getting in a situation with 10 people working on the same model. Even potentially. If this is a risk, appoint a designated datamodeller for such a model. The rest can get a read-only version.

  • Hide attributes on entities by hiding them
    Unless you use an extension to automate setting/un-setting this and also indicate this visually, it can create no end of trouble when the model shows tables and columns but leaves out certain columns that then get deployed anyway. It takes ages to debug that. If you must do this, make sure it’s an all or nothing proposition: either hide all standard attributes, or none.

  • Create shortcuts to other models
    While PowerDesigner does this automatically once you start creating mappings, there is no need to refer to models outside the scope of the folder, as this will render the models almost impossible to deploy without heaps of pop-ups asking about other models that you have not yet stored in the right location (and don’t even know where they should be located). Only consider this if you have an agreed-upon folder structure and even then I recommend you don’t do this.

  • Create Projects
    Sure, they’re good for having a dependency graph view. But you can create those anyway. And projects are buggy, especially when interacting with the repository. Half the bugs I found in PowerDesigner went away when I stopped using projects and moved to workspaces. No more disappearing models, or graphics. No more models that are impossible to check out or check in.

  • Work for long periods without saving
    The PowerDesigner auto-save function is nonexistent. After you work with PowerDesigner for a while, you will learn to save often. It becomes a reflex. Because it hurts when you lose hours of work through a crash. It’s not as bad as it was when you were still using version 16.5.0, with repository and projects, but still.

  • Use auto-layout without a current back-up
    Your gorgeous, handcrafted model could use a minor improvement and you used auto-layout. And then you pressed “save” automatically, because by now it’s a reflex. And when the screams died down, you realized you didn’t have a current backup. Ouch. Backup often. If you use Git: commit often.

  • Model the entire Logical Data Model as a hierarchy of subtypes
    I have seen them, with their entity types derived from the Object supertype and each other, six hierarchical layers deep. I dare you to try it with a non-trivial model and then generate a working physical model out of it. Go ahead, make my day…

  • Create a unique data domain for each attribute
    This sort of misses the point of data domains. Because while they are rather limited in PowerDesigner (no entity types or attribute groups), they are most useful when they provide a single point to change definitions of common attributes. Use them freely, but let the data architect decide which ones are available for use. It’s best to create a single model for this, that you can use as a template for the other models you create.

But Do:

  • Add metadata to your models
    Especially metadata that describes the following items: Title, Subject Area, Author, Version, Data (Model) Owner, Modified Date, Modifications, Validation Status

  • Add domains
    Create a list of standard attribute domains, then create a template model containing them. People can either copy the model file directly and use it as a template (this creates havoc in a repository though, because the internal model ID will be the same as that of the template model), or copy the attribute definitions into your own model. The definitions should be controlled by the data architect(s).

  • Add attribute groups
    If you create attribute groups of commonly grouped attributes in keyless entities, you can then inherit from multiple of these entities in order to combine them. Most useful when you have things like “firstname/lastname” pairs of attributes that you do not want to separate out to their own entity, for some reason. Use with caution.

  • Tie models together with separate workspaces for each project
    Workspaces are small files with zero overhead that tie different models together. They have no impact on the repository check-in/check-out, they are files that can be under source control, and they are pretty much bug-free. You can even edit them when necessary. Much better than projects.

  • Store your models in version control systems
    Seriously, I should NOT have to say this, but I keep meeting people who don’t seem to realize that MODELS ARE CODE. And with a VCS I do not mean that abortion they call the repository. I mean TFS, Git or even Subversion. Anything that works, basically.

  • Save often
    If you don’t, you’ll regret it.

  • Store backups
    Having version control is not the same as having backups, unless you commit often.

  • Create a folder structure that is the same for everyone and make it mandatory
    If you don’t, you’ll create unending pop-ups whenever someone opens a model they did not create themselves. If they check it in, it’s your turn the next time you open it from the repository.

ISO standards for Finance business data

When I define a business glossary to prepare for the high-level corporate data model, I try to incorporate as much of the relevant standards as I can. Because usually, knowing up front about a standard will make it much easier later on to integrate with other parties in the value chain, to report to regulatory authorities that use the same standards, and to apply Master Data Management. The more data that adheres to international standards, the less work you have in managing it.

Below, I have provided a list of ISO standards that can be used to aid in the governance of your business glossary and data models, standards that provide metadata specific to Finance and standards that provide identification schemes for key entities.

Note that there are more finance data and metadata standards than just the ISO standards. These will be listed in a different post that I will then link from here (and vice versa).

ISO standardAreaDescriptionID?
ISO 00639GeneralISO 639 defines language codes, as opposed to country codes. The standard consists of 6 parts, some more detailed than others. The preferred standard is ISO 639-3, which is the most comprehensive substandard. Usually, we restrict ourselves to a subset of supported languages.

See for more information: https://en.wikipedia.org/wiki/ISO_639
ISO 03166GeneralISO 3166 is a standard published by the International Organization for Standardization (ISO) that defines codes for the names of countries, dependent territories, special areas of geographical interest, and their principal subdivisions (e.g. provinces or states). The official name of the standard is "Codes for the representation of names of countries and their subdivisions". It consists of three parts:

  • ISO 3166-1 contains all codes currently in use
  • ISO 3166-2 contains all codes for subdivisions
  • ISO 3166-3 contains all codes no longer in use

The three standards contain several codes: alpha-2, alpha-3 and alpha-4. The alpha-2 code is the recommended code for general use.

See for more information: https://www.iso.org/iso-3166-country-codes.html
ISO 04217GeneralISO 4217 is the standard that defines codes for currencies, as well as funds and minor currency units. The codes can be represented as a 3 letter code, or a numerical code with 3 positions, which is usually the same as the numerical country code from ISO 3166-1. The minor currency is given as an exponent for the division, by 10. I.e. if the minor currency is "3", the currency can be divided into 1000 minor units. The name of the minor unit is not part of this standard.

The current version of the standard is ISO 4217:2015.

See for more information: https://www.iso.org/iso-4217-currency-codes.html
ISO 06166FinanceThe ISO 6166 standard is called "Securities and related financial instruments -- International securities identification numbering system (ISIN)". This standard describes and defines the International Securities Identification Number. The number applies to fungible and non-fungible securities and financial instruments.

ISINs consist of two alphabetic characters, which are the ISO 3166-1 alpha-2 code for the issuing country, nine alpha-numeric digits (the National Securities Identifying Number, or NSIN, which identifies the security), and one numeric check digit. The NSIN is issued by a national numbering agency (NNA) for that country. Regional substitute NNAs have been allocated the task of functioning as NNAs in those countries where NNAs have not yet been established.

ISINs are slowly being introduced worldwide. At present, many countries have adopted ISINs as a secondary measure of identifying securities, but as yet only some of those countries have moved to using ISINs as their primary means of identifying securities.

The current version of the standard is ISO 6166:2013.

See for more information: https://www.iso.org/standard/44811.html
http://www.isin.net/iso-6166/
Yes
ISO 08601GeneralISO 8601 is about "Data elements and interchange formats - information interchange - representation of dates and times". It details how to represent dates and times when exchanging them with other systems in an unambiguous way.

The current version of the standard is ISO 8601:2014.

See for more information: https://www.iso.org/iso-8601-date-and-time-format.html
ISO 09362FinanceThis ISO standard defines the Business Identifier Code (BIC). BIC is an international standard for identification of institutions within the financial services industry. BICs are used in automated processing. They unambiguously identify a financial institution or a non-financial institution. The ISO 9362 standard specifies the elements and the structure of a BIC. A BIC consists of either eight or eleven contiguous characters. These characters comprise either the first three, or all four, of the following components: party prefix, country code, party suffix, and branch identifier. The ISO has designated SWIFT as the BIC registration authority.

The EU regulation 260/2012, also known as the IBAN only rule, requires financial institutions to add the BIC code to IBAN payments.

The rule has applied to any domestic EURO payment since February 2014, to any cross-border EURO payment between EU countries since February 2016, and to any cross-border EURO payment from non-euro countries since October 2016.

See for more information: https://www.iso.org/obp/ui/#iso:std:iso:9362:ed-4:v1:en
https://www2.swift.com/sitesearch/#/?q=BIC
ISO 10383FinanceISO 10383 is called "Codes for exchanges and market identification (MIC)". It defines the Market Identifier Code (MIC).

This International Standard specifies a universal method of identifying exchanges, trading platforms, regulated or non-regulated markets and trade reporting facilities as sources of prices and related information in order to facilitate automated processing. Each such exchange, platform etc. receives a unique code from the registrar.

See for the current list: https://www.iso20022.org/10383/iso-10383-market-identifier-codes
Yes
ISO 10962FinanceISO 10962 defines the structure and format for classification of financial instruments approved by the International Organization for Standardization (ISO). There are many types of Financial Instruments used for saving, investing, trading, hedging and speculating. These instruments are generally organized in groups called "asset classifications." The most common asset classifications are generally described using terms like "Equities (Stocks)," "Debt (Bonds)," "Derivatives (Contracts)," "Currencies," and a few other generalized terms.

ISO 10962 provides a global standard for these classifications in the form of specific codes. Classification of financial instrument (CFI) Code is used to define and describe financial instruments as a uniform set of codes for all market participants. The code is issued by the members of ANNA, the Association of National Numbering Agencies. The group is currently working to simplify the structure so that it can be adopted more widely by non-governmental market participants.

The letters from the ISO basic Latin alphabet in each position of this 6 character code reflect specific characteristics intrinsic to the financial instruments that are defined at the issue of the instrument, and which in most cases remain unchanged during the lifetime of the instrument (or by the market on which the instrument trades).

See for more information: https://en.wikipedia.org/wiki/ISO_10962 or visit the registrar homepage
ISO 11179MetadataThe ISO/IEC 11179 Metadata Registry (MDR) standard) is an international standard for representing metadata for an organization in a metadata registry. ISO/IEC 11179 claims that it is (also) a standard for metadata-driven exchange of data in an heterogeneous environment, based on exact definitions of data.

The ISO/IEC 11179 model is a result of two principles of semantic theory, combined with basic principles of data modelling. The first principle from semantic theory is the thesaurus type relation between wider and more narrow (or specific) concepts, e.g. the wide concept "income" has a relation to the more narrow concept "net income". The second principle from semantic theory is the relation between a concept and its representation, e.g., "buy" and "purchase" are the same concept although different terms are used.

The standard consists of six parts:
ISO/IEC 11179-1:2015 Framework (referred to as ISO/IEC 11179-1)
ISO/IEC 11179-2:2005 Classification
ISO/IEC 11179-3:2013 Registry metamodel and basic attributes
ISO/IEC 11179-4:2004 Formulation of data definitions
ISO/IEC 11179-5:2015 Naming and identification principles
ISO/IEC 11179-6:2015 Registration

Part 1 explains the purpose of each part. Part 3 specifies the metamodel that defines the registry. The other parts specify various aspects of the use of the registry. An additional part, Part 7: Datasets is currently under development.

For use in the creation of data models, part 4 and especially part 5 provide common standards that could be used in data governance to govern the creation of data models.

See for more information: https://en.wikipedia.org/wiki/ISO/IEC_11179
ISO 13616FinanceThe International Bank Account Number (IBAN) is an internationally agreed system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors.

The ISO standard was split in two parts in 2007. ISO 13616-1:2007 "specifies the elements of an international bank account number (IBAN) used to facilitate the processing of data internationally in data interchange, in financial environments as well as within and between other industries" but "does not specify internal procedures, file organization techniques, storage media, languages, etc. to be used in its implementation". ISO 13616-2:2007 describes "the Registration Authority (RA) responsible for the registry of IBAN formats that are compliant with ISO 13616-1 [and] the procedures for registering ISO 13616-compliant IBAN formats".

The official IBAN registrar under ISO 13616-2:2007 is SWIFT.

The IBAN consists of up to 34 alphanumeric characters comprising: a country code; two check digits; and a number called the Basic Bank Account Number (BBAN) that includes the domestic bank account number, branch identifier, and potential routing information. The check digits enable a sanity check of the bank account number to confirm its integrity before submitting a transaction.

The current version of the standard is ISO 13616:2007

See for more information: https://en.wikipedia.org/wiki/International_Bank_Account_Number
Yes
ISO 15022Metadata - FinanceISO 15022 is the precursor to (and superseded by) ISO 20022.

See for more information: https://www.iso20022.org/15022/uhb
ISO 17442BusinessThe International Organization for Standardization (ISO) 17442 standard defines a set of attributes or legal entity reference data that are the most essential elements of identification. The Legal Entity Identifier (LEI) code itself is neutral, with no embedded intelligence or country codes that could create unnecessary complexity for users.

Four key principles underlie the LEI:

  • It is a global standard.
  • A single, unique identifier is assigned to each legal entity.
  • It is supported by high data quality.
  • It is a public good, available free of charge to all users.
Once a legal entity has obtained an LEI, it will be published together with the related LEI reference data by the organization that has issued the LEI. This means the full data on the entire LEI population is publicly available for unrestricted use by any interested party at all times, including the set of reference data for each LEI code.

The LEI code is structured as follows:

  • Characters 1-4: Prefix used to ensure the uniqueness among codes from LEI issuers (Local Operating Units or LOUs).
  • Characters 5-18: Entity-specific part of the code generated and assigned by LOUs according to transparent, sound and robust allocation policies. As required by ISO 17442, it contains no embedded intelligence.
  • Characters 19-20: Two check digits as described in the ISO 17442 standard.
The current version of the standard is ISO 17442:2012.

See for more information: https://www.iso.org/standard/59771.html or visit the homepage of the registrar
Yes
ISO 18774FinanceISO 18774 defines the Financial Instrument Short Name. The new standard for the Financial Instrument Short Name (ISO 18774) standardizes short names and descriptions for financial instruments. The standard was approved in September 2014.

As of July 1 2017, the FISN will be globally assigned concurrently with the ISIN (ISO 6166) and CFI (ISO 10962) at the time of issuance of a new financial instrument.

The ISO 18774 standard incorporates the issuer short name and the abbreviated characteristics for the financial instrument. It has a maximum length of 35 alphanumeric characters.

Unlike other ISO-standard financial instrument identification codes, the FISN is not meant to be machine-readable, but to provide a short format for essential information about a security for human use.

See for more information: http://www.anna-web.org/standards/fisn-iso-18774/
ISO 19773MetadataThis International Standard specifies small modules of data that can be used or reused in applications. These modules have been extracted from ISO/IEC 11179-3, ISO/IEC 19763, and OASIS EBXML, and have been refined further. These modules are intended to harmonize with current and future versions of the ISO/IEC 11179 series and the ISO/IEC 19763 series.

Part of the standard are, amongst others:

  • a data structure for UPU postal data
  • a data structure for ITU T E.164 phone number data
The current version of the standard is ISO/IEC 19773:2011.

See for more information: https://www.iso.org/standard/41769.html
ISO 20022Metadata - FinanceISO 20022 is an ISO standard for electronic data interchange between financial institutions. It describes a metadata repository containing descriptions of messages and business processes, and a maintenance process for the repository content. The standard covers financial information transferred between financial institutions that includes payment transactions, securities trading and settlement information, credit and debit card transactions and other financial information.

The repository contains a huge amount of financial services metadata that has been shared and standardized across the industry. The metadata is stored in UML models with a special ISO 20022 UML Profile. Underlying all of this is the ISO 20022 meta-model - a model of the models. The UML profile is the meta-model transformed into UML. The metadata is transformed into the syntax of messages used in financial networks. The first syntax supported for messages was XML Schema.

The standard contains a number of external reference code lists, that are available on the website in the form of spreadsheets and documentation. The data dictionary present in ISO 15022 is no longer available as a spreadsheet, but can be downloaded as a 96MB xml-file.

See for more information: https://www.iso20022.org/
ISO 20275FinanceThe ISO 20275 standard defines Entity Legal Form (ELF) worldwide. The latest (and first) version is ISO 20275:2017 (en). It covers the legal forms available per country (or country grouping), as long as that country has an ISO 3166-1 alpha-2 code.

The standard can be obtained from the ISO but the codelist itself is maintained by the Global Legal Entity Identifier Foundation (GLEIF) and can be obtained here.

Interesting to note is that where the AnaCredit list goes slightly off the rails with the European Legal Form "Societas Europaea", this list solves it in a nicer way by repeating the SE legal form for all countries involved. Although using EU is actually allowed by ISO 3166-1 standards as it is an alpha-2 code reserved for the special use of the EU, this way is cleaner as you now only deal with countries. Since the AnaCredit list goes off the rails in more ways than this, you may want to use this ISO standard as your main reference data set and add a mapping to the ECB's rather ragtag list of legal forms.

Currently there are legal forms for 55 countries in the list so locations that are currently missing need to be added through a feedback form.

The ELF for companies in the LEI register has been made public as of March 1st, 2018.

Integrating Twitter in WordPress

twitter large logo

Last year Twitter decided to change the way Twitter interacts with the rest of the world, by making it more difficult to integrate its twitter-streams with your own website. While you can get around this if you can deploy server-side software and go through the hassle of signing up for a developer key, a lot of folks run websites without being interested in having to program just to get their own tweets to display.

Twitter does have a solution, but this just dumps the stream on your site with the lay-out and styling of Twitter. While this is understandable from a branding and marketing point of view, it’s incredibly annoying to have your website look like a hash of different styles just because Twitter doesn’t like you changing the lay-out. So there are a lot of people looking for alternatives.

The best alternative I’ve found for my purpose is http://jasonmayes.com/projects/twitterApi/. Jason Mayes twitter API just takes the formatted twitter-feed, removes the formatting and provides the stream with normal tags to the page. Using standard CSS you can then style the stream and presto, you have a nice looking twitter feed.

How it works in WordPress is as follows:
– Download the software from http://jasonmayes.com/projects/twitterApi/
– Upload the javascript file “twitterFetcher_min.js” to your website. This could be as media but I chose to use FTP to upload it into a theme. As long as it’s on your website it’s okay though, the location is unimportant.
– Add a Text widget to the page where you want the tweets to show up.
– Include the following text in the widget:


<script src="/{path}/twitterFetcher_min.js"></script>
<div id="tweet-element">Tweets by Ronald Kunenborg</div>

<script>
var configProfile = {
"profile": {"screenName": '{yourtwittername}'},
"domId": 'tweet-element',
"maxTweets": 10,
"enableLinks": true,
"showUser": true,
"showTime": true,
"showImages": true
};
twitterFetcher.fetch(configProfile);
</script>

Replace “{yourtwittername}” with your own twitter name (of that of someone whose timeline you wish to show), and the {path} with the path of the uploaded javascript and you’re good to go. However, this looks pants. So we need to style it. In order to do that, include the following text in the widget before the script:
<style>
/*
* Tweet CSS - on Jason Mayes tweetgrabber (http://jasonmayes.com/projects/twitterApi/)
*/

div#tweet-element ul {
list-style: none;
}

div#tweet-element h2 {
clear:both;
}

div#tweet-element p {
font-size: 9pt;
margin: 0 0 0 0;
}

div#tweet-element ul li {
list-style:none;
overflow:hidden;
border-top:1px solid #dedede;
margin: 5px 0 10px 0;
padding: 0px;
}

div#tweet-element ul li:hover {
background-color:#f0f3fb;
}

/* tekst of tweet */
.tweet {
clear: left;
}

.user {
clear:left;
float:left;
}

.user a {
}

/* hide the @twittername, which is the 3rd span in the user class */
.user span:nth-child(3) {
display: none;
}

.user a > span {
margin-left:2px;
}

.user a > span {
display: table-cell;
vertical-align: middle;
margin: 5px;
padding: 5px;
}

.widget-text img,
.user a span img {
display: block;
float:left;
max-width: 40px;
margin: 2px 2px 2px 2px;
}

div#tweet-element p.timePosted {
clear: left;
font-style: italic;
}

div#tweet-element p.timePosted a {
color: #444;
}

.interact {
float:left;
margin-top:-7px;
width: 100%;
}

.interact a {
margin-left: 0px;
margin-right: 5px;
width: 30%;
}

.interact a.twitter_reply_icon {
float:left;
text-align: center;
}

.interact a.twitter_retweet_icon {
float:left;
text-align: center;
}

.interact a.twitter_fav_icon {
float:right;
text-align: center;
}

/* show media on front-page - hide it with display:none if you don't want to show media included by others. */
.media img {
max-width:100%;
}

#linkage {
position:fixed;
top:0px;
right:0px;
background-color:#3d3d3d;
color:#ffffff;
text-decoration:none;
padding:5px;
width:10%;
font-family:arial;
}
</style>

Make sure the <style> part is first in the Text widget.

Of course you can also put the style (without the <style> tags) in a stylesheet (.css) file, upload it and then refer to it, instead of pasting the stylesheet in the Text widget. In that case use the following command:

<link rel='stylesheet' id='twitter-css' href='/{path}/twitter-style.css' type='text/css' media='all' />

And please replace {path} with the desired path.

I hope this helps you as much as it helped me.

DataVault Cheat Sheet Poster v1.0.9

This poster displays the most important rules of the Data Vault modelling method version 1.0.9 on one A3-size cheat sheet. I decided to not add personal interpretation and keep the sheet as close to the original specs as possible.

You can find the rules that were used for this poster on the website of Dan Linstedt.

DataVault Cheat Sheet v109 (A3) PDF

A version where the Colors of the Data Vault have been used, is available as well:
DataVault Cheat Sheet v109 (A3, color) PDF

Creating brilliant visualizations of graph data with D3 and Neo4j

Okay, so someone recommended I spice up the titles a bit. I hope you’re happy now!

Anyway, it really is the truth: you can create brilliant visualizations of data with the D3 javascript library, and when you combine it with Neo4j and the REST API that gives you acccess to its data, you can create brilliant visualizations of graph data.

Examples of d3 visualizations

Examples of d3 visualizations, laid out in a hexadecimal grid

So what’s D3? Basically, D3 is a library that enables a programmer to construct and manipulate the DOM (Document Object Model) in your webbrowser. The DOM is what lives in the memory of your computer once a webpage has been read from the server and parsed by your browser. If you change anything in the DOM, it will be reflected on the webpage immediately.

There are more libraries that can manipulate the DOM (such as JQuery), but D3 is focused towards ease of use when using data as the driver for such manipulations, instead of having code based on mouseclicks do some alterations. There are commands to read CSV or other formats, parse them and then feed them to further commands that tell D3 how to change the DOM based on the data. This focus on using data to drive the shape of the DOM is gives D3.js its name: Data Driven Documents.

An example of what you can achieve with minimal coding is for instance the Neo4j browser itself, and the force-connected network that is shown as the output for a query returning nodes and/or relationships. However, another visualization of a network of nodes and relationships is the Sankey diagram:

An example of a Sankey diagram

An example of a Sankey diagram

The Sankey diagram as shown above was created using d3.js, a Sankey plug-in (javascript) and the lines of code that control d3: about 70 lines of Javascript in all.

To demonstrate how easy it is to use d3.js and Neo4j as database to create a nice visualization, I’m not going to use the Sankey example, however. It’s too complex to use as an example for that, although I will write an article about that particular topic in the near future.

No, we’re going to create a bar chart. We’ll use the previous article Using Neo4j CYPHER queries through the REST API as a basis on which to build upon.

The bar chart, when done, will look like this:

Barchart showing the number of players per movie

Barchart showing the number of players per movie

You will need some understanding of JavaScript (ECMAscript), but this can be obtained easily by reading the quite good book, Eloquent Javascript.

You will also need to understand at least some of the basics of D3, or this article will be incomprehensible. You can obtain such understanding from d3js.org, and I recommend this tutorial (building a bar chart) that goes into much more detail than I do here. An even better introduction is the book “D3 tips and tricks” that starts to build a graph from the ground up, explaining everything while it’s done.

Please note that I used the d3.js library while developing, and it ran fine from the development server. However, when I used d3 with the standard Microsoft webserver, it mangled the Greek alphabet soup in the code and it didn’t work. The minified version (d3.min.js) does not have that issue, so if you run into it, just use the minified version.

We will use nearly the same code as in the previous article, but with a few changes.

First, we add a new include: the D3 library needs to be included. We use the minified version here.

<html>
<head>
<title>Brilliant visualization of graph data with D3 and Neo4j</title>
<script src="scripts/jquery-2.1.3.js"></script>
<script src="scripts/d3.min.js"></script>
</head>
<body>

Next, we add the function “post_cypherquery()” again, to retrieve data from Neo4j. We use exactly the same routine we used the last time.

    <script type="text/javascript">
        function post_cypherquery() {
            // while busy, show we're doing something in the messageArea.
            $('#messageArea').html('<h3>(loading)</h3>');

            // get the data from neo4j
            $.ajax({
                url: "http://localhost:7474/db/data/transaction/commit",
                type: 'POST',
                data: JSON.stringify({ "statements": [{ "statement": $('#cypher-in').val() }] }),                
                contentType: 'application/json',
                accept: 'application/json; charset=UTF-8',
                success: function () { },
                error: function (jqXHR, textStatus, errorThrown) { $('#messageArea').html('<h3>' + textStatus + ' : ' + errorThrown + '</h3>') },
                complete: function () { }
            }).then(function (data) {

Once we have obtained the data, we display the query we used to obtain the result, and clear the “(Loading)” message.

                $('#outputArea').html("<p>Query: '"+ $('#cypher-in').val() +"'</p>");
                $('#messageArea').html('');

Then, we create an empty array to hold the attribute-value pairs we want and push the rows from the resultset into the d3 array. Basically, we make a copy of the resultset in a more practical form.

                var d3_data = [];
                $.each(data.results[0].data, function (k, v) { d3_data.push(v.row); });

Then we determine how big our chart should be. We will be using Mike Bostocks margin convention for this.

We create a barchart that has a margin of 40 pixels on top and bottom, and 200 pixels on the right – because I want to add the movienames on that side of the chart. Our graphic will occupy half the display, so the real area we can draw in is half the window size, minus the horizontal margin. The height of the graph will be scaled to 3/4 of the height of the window, minus the margins. We scale the bars to fit in that size.

                var margin = { top: 40, right: 200, bottom: 40, left: 40 },
                    width = ($(window).width()/2) - margin.left - margin.right,
                    height = ($(window).height()/2) - margin.top - margin.bottom, 
                    barHeight = height / d3_data.length;

Here we use our very first D3 function: d3.max. It will run over the d3_data array and apply our selector function to each element, then find the maximum value of the set.

This will give us the highest amount of players on any movie. Then we add a bit of margin to that so our barchart will look nicer later on, when we use this value to drive the size of the bars in the chart.

                var maxrange = d3.max(d3_data, function (d) { return d[1]; }) + 3;

Next, we use an important part of the D3 library: scales. Scales are used everywhere. Basically, they transform a range of values into another range. You can have all kinds of scales, logarithmic, exponential, etcetera, but we will stick to a linear scale for now. We will use one scale to transform the number of players into a size of the bar (scale_x), and another to transform the position of a movie in the array into a position on the barchart (scale_y).

We use rangeRound at the end, instead of range, to make sure our values are rounded to integers. Otherwise our axis ticks will be on fractional pixels and D3 will anti-alias them, creating very fuzzy axis tickmarks.

                var scale_x = d3.scale.linear()
                    .domain([0, maxrange])
                    .rangeRound([0, width]);

                var scale_y = d3.scale.linear()
                    .domain([d3_data.length, 0])
                    .rangeRound([0, height]);

And once we have the scales, we define our axes. Note that this doesn’t “draw” anything, we’re just defining functions here that tell D3 what they are like. An axis is defined by its scale, the number of ticks we want to see on the axis, and the orientation of the tickmarks.

                var xAxis = d3.svg.axis()
                    .scale(scale_x)
                    .ticks(maxrange)
                    .orient("bottom");

                var yAxis = d3.svg.axis()
                    .scale(scale_y)
                    .ticks(d3_data.length)
                    .orient("left");      

So far, we’ve just loaded our data, and defined the graph area we will use. Now, we’ll start to manipulate the Document Object Model to add tags where we need them. We will start with the most important one: the SVG tag. SVG stands for Scalable Vector Graphics, and it’s a web standard that allows us to draw in the browser page, inside the area defined by this tag. And that is what we will do now, inside the already existing element with id = “outputArea”. This allows us to place the graphics right where we want them to be on the page.

The preserveAspectRatio attribute defines how the chart will behave when the area is resized. See the definition of PreserveAspectRatioAttribute for more information.

                var chart = d3.select("#outputArea")
                    .append("svg")
                    .attr("width", (width + margin.left + margin.right) + "px")
                    .attr("height", (height + margin.top + margin.bottom) + "px")
                    .attr("version", "1.1") 
                    .attr("preserveAspectRatio", "xMidYMid")
                    .attr("xmlns", "http://www.w3.org/2000/svg");

Note that we assign this manipulation to a variable. This variable will hold the position in the DOM where the tag “svg” is placed and we can just add to it, to add more tags.

The first svg element in the svg should have a title and a description, as per the standard. So that is what we will do. After the <svg> tag, we will append a <title> tag with a text.

                chart.append("title")
                    .text("Number of players per movie");

                chart.append("desc")
                    .text("This SVG is a demonstration of the power of Neo4j combined with d3.js.");

Now, we will place a grouping element inside the svg tag. This element < g > will be placed at the correct margin offsets, so anything inside it has the correct margins on the left- and top sides.

                chart = chart.append("g")
                    .attr("transform", "translate(" + (+margin.left) + "," + (+margin.top) + ")");

Now we place the x- and y-axis that we defined earlier on, in the chart. That definition was a function – and now we come CALLing. Here we will also add a class-attribute, that will later allow us to style the x and y-axis separately. We put the x-axis on the bottom of the graph, and the y-axis on the left side.

Since the axes are composed of many svg-elements, it makes sense to define them inside a group-element, to make sure the entire axis and all its elements will be moved to the same location.

Please note that the SVG-coordinates have the (0,0) point at the top left of the svg area.

                chart.append("g")
                    .attr("class", "x axis")
                    .attr("transform", "translate(0," + (+height) + ")")
                    .call(xAxis);
                chart.append("g")
                    .attr("class", "y axis")
                    .attr("transform", "translate(" + (-1) + ",0)")
                    .call(yAxis);

Finally, we get to the point where we add the bars in the chart. Now, this looks strange. Because what happens is that we define a placeholder element in the SVG for every data element, and then D3 will walk over the data elements and call all of the functions after the “data” statement for each data-element.

So everything after the data-statement will be called for EACH element. And if it is a new data-element that wasn’t yet part of the DOM, it will be added to it. And all of the statements that manipulate the DOM, will be called for it.

So, we define the bar as an SVG-group, with a certain class (“bar”) and a position, that is based on the position in the array of elements. We just display the elements ordered in the way we received them. So adding an ORDER BY statement to the CYPHER query will change the order of the bars in the chart.

                var bar = chart.selectAll("g.bar")
                    .data(d3_data)
                    .enter().append("g").attr("class","bar")
                    .attr("transform", function (d, i) { return "translate(0," + i * barHeight + ")"; });

Then, still working with the bar itself, we define a rectangle of a certain width and height. We add the text “players: ” to it, for display inside the rectangle. We define the text as having class “info”. Then, we add the text with the name of the movie for display on the right of the bar, and give it class “movie”. And that concludes our D3 script.

                bar.append("rect")
                    .attr("width", function (d) { return scale_x(d[1]) + "px"; }) 
                    .attr("height", (barHeight - 1) + "px" );

                bar.append("text")
                    .attr("class", "info")
                    .attr("x", function (d) { return (scale_x(d[1]) - 3) + "px"; })
                    .attr("y", (barHeight / 2) + "px")
                    .attr("dy", ".35em")
                    .text(function (d) { return 'players: ' + d[1]; });

                bar.append("text")
                    .attr("class","movie")
                    .attr("x", function (d) { return (scale_x(d[1]) + 3) + "px"; })
                    .attr("y", (barHeight / 2) + "px")
                    .attr("dy", ".35em")
                    .text(function (d) { return d[0]; });
            });
        };
    </script>

All that remains is to define the HTML of the page itself that will display at first. This is the same HTML as before, but with a different CYPHER query.

<h1>Cypher-test</h1>
<p>
<div id="messageArea"></div>
<p>
<table>
  <tr>
    <td><input name="cypher" id="cypher-in" value="MATCH (n:Movie)-[:ACTED_IN]-(p:Person) return n.title as movietitle, count(p) as players" /></td>
    <td><button name="post cypher" onclick="post_cypherquery();">execute</button></td>
  </tr>
</table>
<p>
<div id="outputArea"></div>
<p>
</body>
</html>

Unfortunately, at this point our barchart will look like this:

Unstyled d3 barchart in black and white with blocky axes

Unstyled d3 barchart

What happened was that we didn’t use ANY styling at all. That doesn’t look very nice, so we will add a stylesheet to the page. Note that you can style SVG-elements just as you can style standard HTML elements, but there is one caveat: the properties are different. Where you can use the color attribute (style="color:red") on an HTML element, you would have to use the stroke and fill attributes for SVG elements. Just the text element alone has a lot of options, as shown in this tutorial.

So, we now add a stylesheet at the end of the <head> section. We start with the definitions of the bars – they will be steelblue rectangles with white text. The standard text will be white, right-adjusted text that stands to the left of the starting point. The movie-text will be left-adjusted and stand to the right of its starting position, in italic black font.

<style>
#outputArea {
  height: 50px;
}

#outputArea rect {
  fill: steelblue; 
}

#outputArea text {
  fill: white;
  font: 10px sans-serif;
  text-anchor: end;
  color: white;
}

#outputArea text.movie {
  fill: black;
  font: 10px sans-serif;
  font-style: italic;
  text-anchor: start;
}

Now we define the axes. They will be rendered with very small lines (crispEdges), in black. The minor tickmarks will be less visible than the normal tickmarks.

.axis {
  shape-rendering: crispEdges;
  stroke: black;
}

.axis text {
  stroke: none;
  fill: black;
  font: 10px sans-serif;
}

.y.axis text {
  display: none;
}

.x.axis path,
.x.axis line,
.y.axis path,
.y.axis line {
  fill: none;
  stroke: black;
  stroke-width: 1px;
  shape-rendering: crispEdges;
}

.x.axis .minor,
.y.axis .minor {
  stroke-opacity: .5;
}
</style>

And now, we get this:

Styled d3 barchart in color with crisp axes

Styled d3 barchart

We can add more bells and whistles, such as animations and nice gradients for the bars, but that’s something I’ll leave to you.

By the way: we can add SVG elements, but in the same manner we could also just add plain HTML elements and create a nicely styled tabular lay-out for the same data. Or we could create a Sankey diagram. But that’s something for another post.

Using Neo4j CYPHER queries through the REST API

Lately I have been busy with graph databases. After reading the free eBook “Graph Databases” I installed Neo4j and played around with it. Later I went as far as to follow the introduction course as well as the advanced graph modeling course at Xebia. This really helped me start playing around with Neo4j in a bit more structured manner than I was doing before the course.

I can recommend installing Neo4j and just starting to use it, as it has a great user interface with excellent help manuals. For instance, this is the startscreen:

Neo4j-startscreen

Easy, right?

One of the things that struck me was the ease with which you could access the data from ECMAscript (or Javascript if you’re very old and soon-to-be obsoleted). Using the REST API you can access the graph in several ways, reading and writing data from and to the database. It’s the standard interface, actually. There’s a whole section in the Neo4j help dedicated to using the REST API, so I’ll leave most of it alone for now.

What’s important, is that you can also fire CYPHER queries at the database, receiving an answer in either JSON or XML notation, or even as an HTML page. This is important because CYPHER queries are *very* easy to write and understand. As an example, the following query will search the sample database that is part of the Neo4j database, with Movies and Actors.

Suppose we want to show all nodes that are of type Movie. Then the statement would be:

MATCH (m:Movie) RETURN m

A standard query to discover what’s in the database is
MATCH (m) RETURN m LIMIT 100
This is limited to 100 items (nodes and/or relationships), because it does return the entire database otherwise and in the user interface this starts to slow things down. It’s gorgeous, but when your resultsets are getting big it does slow things down. Here’s how it looks:

Neo4j-results-1

Very nice. But not that useful if we want a particular piece of data. However, if we want to show only the actors that played in movies, we could say:

MATCH (p:Person)-[n:ACTED_IN]->(m:Movie) RETURN p

This returns all nodes of type Person that are related to a node of type Movie through an edge of type ACTED_IN.

While I won’t go into more detail on Cypher, let’s just say it is a very powerful abstraction layer for queries on graphs that would be very hard to do with SQL. It’s not as performant as actually giving Neo4J explicit commands using the REST API, which you want to do if you build an application where sub-second performance is an issue, but for most day-to-day queries it’s pretty awesome.

So how do we use the REST API? That’s pretty easy, actually. There are two options, and one of them is now deprecated – that is the old CYPHER endpoint. So we use the new http://localhost:7474/db/data/transaction/commit endpoint, which starts a transaction and immediately commits it. And yes, you can delete and create nodes through this endpoint as well so it’s highly recommended to not expose the database to the internet, unless you don’t mind everyone using your server as a public litterbox.

You have to POST requests to the endpoint. There are endpoints you can access with GET, like http://localhost:7474/db/data/node/1 which returns the node with id=1 on a HTML page, but the transactional endpoint is accessed using POST.

The easiest way to use a REST API is to start a simple webserver, create a simple HTML-page, add Javascript to it that responds to user input and that calls the Neo4j REST API.

Since we’re going to use Javascript, be smart and use JQuery as well. It’s pretty much a standard include.

How to proceed:

  • First, start the Neo4j software. This opens a screen where you can start the database server, and in the bottom left of the screen you can see a button labeled “Options…”. Click that, then click the “Edit…” button in the Server Configuration section. Disable authentication for now (and make very sure you don’t do this on a server connected to the internet) by changing the code to the following:

    # Require (or disable the requirement of) auth to access Neo4j
    dbms.security.auth_enabled=false

    This makes sure we don’t have the hassle of authentication for now. Don’t do this on a connected server though.

  • Now, we start the Neo4j database. Otherwise we get strange errors.
  • Then, proceed to build a new HTML-page (I suggest index.html) on your webserver, that looks like this:
    <html>
    <head>
    <title>Cypher-test</title>
    <script src="scripts/jquery-2.1.3.js"></script>
    </head>
    <body>
        <script type="text/javascript">
            function post_cypherquery() {
                $('#messageArea').html('<h3>(loading)</h3>');
    
                $.ajax({
                    url: "http://localhost:7474/db/data/transaction/commit",
                    type: 'POST',
                    data: JSON.stringify({ "statements": [{ "statement": $('#cypher-in').val() }] }),
                    contentType: 'application/json',
                    accept: 'application/json; charset=UTF-8'                
                }).done(function (data) {
                    $('#resultsArea').text(JSON.stringify(data));
                    /* process data */
                    // Data contains the entire resultset. Each separate record is a data.value item, containing the key/value pairs.
                    var htmlString = '<table><tr><td>Columns:</td><td>' + data.results[0].columns + '</td></tr>';
                    $.each(data.results[0].data, function (k, v) {
                        $.each(v.row, function (k2, v2) {
                            htmlString += '<tr>';
                            $.each(v2, function (property, nodeval) {
                                htmlString += '<td>' + property + ':</td><td>' + nodeval + '</td>';
                            });
                            htmlString += '</tr>';
                        });
                    });
                    $('#outputArea').html(htmlString + '</table>');
                })
                .fail(function (jqXHR, textStatus, errorThrown) {
                    $('#messageArea').html('<h3>' + textStatus + ' : ' + errorThrown + '</h3>')
                });
            };
        </script>
    
    <h1>Cypher-test</h1>
    <p>
    <div id="messageArea"></div>
    <p>
    <table>
      <tr>
        <td><input name="cypher" id="cypher-in" value="MATCH (n) RETURN n LIMIT 10" /></td>
        <td><button name="post cypher" onclick="post_cypherquery();">execute</button></td>
      </tr>
    </table>
    <p>
    <div id="outputArea"></div>
    <p>
    </body>
    </html>
    

    Make sure you don’t forget to download JQuery and put the downloaded file in the scripts subdirectory below the directory in which you place this file. The line where you need to change the corresponding filename if you rename the file or place it somewhere else is highlighted in red.

While this doesn’t look very pretty, it gets the job done. It executes an AJAX call to Neo4j, using the transactional endpoint. After receiving a success-response, it writes the raw answer (JSON) into the resultsArea over the input box. Then, it parses the result and writes the results to a table in the dataArea.

The resultset from neo4j is returned as a data-object that looks like this:

{
  "results" : [ {
    "columns" : [ "n" ],
    "data" : [ 
      {"row" : [{"name":"Leslie Zemeckis"}]}, 
      {"row" : [{"title":"The Matrix","released":1999,"tagline":"Welcome to the Real World"}]}, 
      {"row" : [{"name":"Keanu Reeves","born":1964}]} 
      ]
  } ],
  "errors" : [ ]
}

Note the different row-variants. Since we did not limit ourselves to a single type of node, we got both Movie- and Actor-nodes in the result. And even within a single node-type, not every node has the same properties. The neo4j manual has more information about the possible contents of the resultset.

Please note that ANY valid Cypher-statement will be executed, including CREATE and DELETE statements, so feel free to play around with this.

– Ronald Kunenborg.