Is data modeling the next job to be replaced by AI?

The more I read, the more I am convinced that data modeling is an activity that will be supported, perhaps even replaced, by deep learning AI like Watson real soon now and that the true focus of data governance should be the business conceptual model. All other models derive from there.

After posting this on LinkedIn I received some interesting feedback. The main question was of course “how so?”. Well, here are a number of the things I read and experienced that led me to this conclusion.

1. Data Modeling Essentials (Simsion/Witt, 3rd ed. 2005). It’s not very strict but very readable and has a nice division between the three layers: conceptual, logical, physical. This is where I started. It explained clearly that these different layers are distinct, but not exactly *how*. Or rather: I didn’t understand it at the time.

2. Applied Mathematics for Database Professionals (2007, Lex de Haan/Toon Koppelaars). This explains how the logical model is built and differs from the conceptual and physical models. It starts and ends with the math behind it all (predicate logic and set theory) yet is easy to understand for most data professionals. It made the distinction between the different layers MUCH clearer to me and especially how the logical model is a fully mathematical activity. That is a hard separation between the conceptual and the logical model, even though the semantic part of the logical model is still derived from the conceptual model. Because if you suddenly use different terms and verbs, it is obviously not a derivation but instead a different model.

3. The FCO-IM book by Bakema/Zwart/van der Lek, which explains how you can have a conceptual model that is fact-oriented and can then be transformed into a logical model. This means that when you can impose an ordering on the language of the conceptual model, you should be able to derive a logical model from it.

4. My own experience with fully verbalized data models, i.e. data models written down in sentences. Most data models have a small vocabulary: “One Client orders one or more Products. Our Company has a Relationship with a Party. A Client is a Party. A Legal Entity is a Party.” The amount is in fact unlimited if you so desire, but in practice can be boiled down so far there is even a standard for it: Semantics of Business Vocabulary and Rules (SBVR).

5. Very influential: The “ISO TR 9007 – Information processing systems – Concepts and terminology”. This defined the conceptual model and was created by a number of very well-known information modeling scientists. It influenced me heavily because it defines a conceptual model as essentially something that conveys semantic meaning, i.e.: words on paper. It starts, for instance, with the Helsinki principle.

6. Discussions with Martijn Evers (and numerous other colleagues at the Dutch Central Bank and outside it) about data modeling.

7. The sorry state of modeling tools in general. If PowerDesigner is the best we have, we’re in trouble. Sparx Enterprise Architect is actually pretty good, but you can’t program it so you have to make do with what it is. E/R Studio just crashed when trying to evaluate it last time round. And looks nice and is webbased, but it just does the physical modeling. None of those tools do conceptual modeling anywhere near right. Colibra has the textual part down pat, but only relates business terms to other terms. An ontology, while nice, is the start of the process, not the end. Conceptual modeling is a text-based activity that encompasses the business glossary, the visualization, the business rules etcetera and it’s not just about creating lines and boxes on a canvas. There is currently no tool that does conceptual modeling as a text-based activity.

8. The rapidly advancing state of the art in deep learning. I had artificial intelligence courses in the 90’s, but it wasn’t very advanced at the time. Nowadays, it’s much further. I’ve been looking into Watson (IBM) capabilities over the past week. See for instance where it says that Watson can: “Analyze text to extract meta-data from content such as concepts, entities, keywords, categories, relations and semantic roles.”

9. The experience of the Watson team that a combination of skilled human and decent AI will generally beat even a higher skilled AI and higher skilled human.

10. The fact that the conceptual model is verbalized, and that we now have Wikipedia and Facebook. I think we can have “social modeling media” now. This should increase the speed of modeling immensely. Example: suppose that DNB owns definitions in the Dutch part of the global Finance Business Glossary. And that we can derive logical models from that glossary. And physical models from that logical model. Then that would speed up the process of building reports and applications to conform to regulations enormously. It would also nail the nasty little business model of IBM and Teradata with its hide to the wall (and not by incident).

11. Finally, I can see as well as anyone the incredibly urgent need for data modeling and the serious and increasing lack of data modeling expertise in the labor market. We can either give up on modeling, or we can make it so easy everyone *can* do it. Therefore, the latter *will* be done. Example: the car market was once considered to be limited to the number of available professional chauffeurs. So, the need for professional chauffeurs was removed. Otherwise, Henry Ford wouldn’t have had a market.

I cannot say with certainty that AI will be able to fully automate the entire modeling process (especially not reverse engineering). But what I *can* say is that a large amount of what a modeler does can be automated even today. An even larger part can be done with AI, and the final part can be done together.

In the comments, one person said that AI deep learning and modeling are (mathematically speaking) fundamentally different activities and that therefore you cannot automate the modeling. I agree that they are different, but I do not think it matters for the end result. Playing Go is fundamentally different from translating text, but both can be handled by the same type of algorithm. I foresee this will be similar to modeling, for most cases. The fact that you cannot handle all cases is irrelevant: see item 9.

What about source models without meaning, that can be modeled today by humans, another comment asked. I do not think that that is a relevant issue: reverse engineering will remain difficult, but it’s not the issue I’m concerned with. Because reverse engineering means turning an illegible set of fields and tables into a conceptual, textual model. My point is that the reverse process of turning text into tables and fields is easy to automate, the other way round is inherently just as difficult as turning mayonaise into eggs: the arrow of time, entropy, just does not fly that way.

Martijn Evers stated that “given sufficient quality in models that represent semantic concerns, and sufficient quality in deriving logical models we can nowadays already generate all kinds of implementation models. No AI is needed here. This could change for very large and complex implementation models, but by and large we can do a lot without AI in this respect.” I agree. I just think that using AI reduces the necessary quality of the model to the point where even people with moderate amounts of training can handle this competently. Which is the point of item 11.

Most commenters agreed that a human-machine hybrid approach was in the works. One even pointed out an existing (and very good!) recent article about this topic: The question thus is “when” and “how”, not “if”.

Please note: this article was also posted on Linked:

Image credits: created by Alejandro Zorrilal Cruz [public domain], via Wikimedia Commons. Source:

One thought on “Is data modeling the next job to be replaced by AI?

  1. Arjan Bos

    Sir, Madam,

    Thank you for you interesting post. I can, and do, agree to almost everything you state. When the semantics of the universe of discourse are clear, most models can be derived from it. The one thing that is still missing is that in the semantic layer – ontologies, taxonomies – you cannot, or should not, address the concerns pertinent to the logical, physical and implementation layer. For example, where do you put in the information that a bitemporal solution is needed on the phrase “Martijn is the first name of @dm_unseen”?
    When we separate concerns, as we should because it makes a lot of things easier, we need the models on all the layers, and we need to put in the solutions for the concerns for that specific layer.

    But otherwise, great post!


Leave a Reply

Your email address will not be published. Required fields are marked *