Model Driven Software Engineering for Data Warehousing  

Part 1: Code Generation 

“Everything should be made as simple as possible, but not simpler.” – Attributed to Albert Einstein 
April 2023, Maarten Manders

This article is the first part of a series of articles in which I want to give an overview of how I think Model Driven Software Engineering (MDSE) can be used for data warehousing. This article is about the concepts of model driven software engineering and why you should use it while developing your Data Warehouse (DWH). Finally, the topics described in this article are explained using a Python code example. 

What is Model Driven (Software) Engineering? 

According to Wikipedia (https://en.wikipedia.org/wiki/Model-driven_engineering, 18 November 2022): 

“Model-driven engineering (MDE) is a software development methodology that focuses on creating and exploiting domain models, which are conceptual models of all the topics related to a specific problem. Hence, it highlights and aims at abstract representations of the knowledge and activities that govern a particular application domain, rather than the computing (i.e., algorithmic) concepts.

I have used Model Driven Software Engineering throughout my professional career and in the last 8 years in combination with Datawarehouse automation. In my view, model driven software engineering can be expressed in terms of (meta-)models, model-to-model transformations and code generation. 

Models describe objects in a (problem) domain. Objects in the model should capture all important concepts of the domain. To make sure models do not become too complex, care must be taken not to overdo it: models should not contain too many details. When I apply MDSE, I use what is generally called meta-model (model describing models) to describe what objects can be used in models. The meta-model must be complete in the sense that it captures all important concepts of the domain, but not tooextensive as this tends to complicate matters. 

The concepts of model and meta-model translate directly to databases. One could say that the table structures are meta-models, they define what can be stored, for example, an email address is a string of at most 128 characters. Primary key constraints and foreign key constraints are also part of the meta-model as they limit the data that can be stored in a table. The data in the tables relates to models, it contains actual data such as the email address example@blog.com. 

What you often see in Enterprise Architecture tools, is that people develop logical data models. Eventually, these logical models must be implemented in a database, but to do this a number of extra steps are required to transform objects in the logical level to objects in the physical level. These tools support automation of this process. This is what is generally called model to model transformations or M2M. For example, entities in the logical domain are transformed into tables in the physical domain, attributes are transformed into columns and relations are transformed into foreign key constraints. Often, the model in the source domain of the transformation lacks information to fully transform into the target domain. An example of this is that the type of information of logical attributes is missing or not explicit enough to directly translate into column data types in the physical domain. The image below shows the conceptual picture of model-to-model transformations. 

Models are nice, but to use them it is often required to convert them into some textual format, such as code or documentation. To do this, I usually use templates. What this text is depends on my requirements. It can be executable code, such as ETL statements to load data into tables, or descriptive like documentation about data lineage, or anything else that can be represented in a textual format. A template engine is used to apply the template to the model and generating the textual result. What a template engine basically does is that it tries to replace parts of the provided template with parts of the provided model resulting in one or more textual outputs. This approach is referred to as code generation. The last section of this article contains an example that makes this more concrete. It is good practice to separate functionality and form: the model contains the concepts and templates define the output syntax, to keep templates simple it should purely focus on generating text. The image below shows the conceptual picture of model-to-text transformations. 

When to use Model Driven Software Engineering 

In my experience, the practical benefits of this methodology are threefold. 

  1. Reduce repetitive work 

I believe that it is important to work on ‘fun’ things. Doing the same thing over and over again does not qualify as being fun (at least for me). When developing a data warehouse, many tasks are repetitive. Maybe the most common example is creating a staging layer in which data from source systems is stored prior to be stored in the actual DWH. Often the structure of the source tables is copied and several common columns to store processing data are added. This activity has to be done for every relevant table in every source system. I already get bored writing about it! 

2. Reduce the risk of errors 

When doing manual work there is a high risk of introducing errors. Typos for instance. Often, finding and fixing these errors requires a lot of work (the later in the development process an error is found, the more it costs to fix it). When applying MDSE, all text that is generated is based on the same models and templates. This means that it is enough to validate the correctness of your models and your templates to make sure that the result of the code generation process will be correct. Especially when applied at large scale this will save you a lot of headaches. 

3. Increase efficiency 

One could argue that manual work is s-l-o-w. According to the Guinness World Records, since 2005 the official holder of the title of fastest typist in the world on a computer is Barbara Blackburn with a top typing speed of 212 words per minute. Although this is way, way faster than I ever type, it is far from the performance that you get when generating your code. When applying MDSE at large scale, the time it takes to develop and maintain your models and templates does not compare to the time it would take to write your code manually. This means that developers have more time to work on fun things (or to drink coffee!). 

The points in the list above indicate that it becomes profitable to apply MDSE when you have workloads that are highly repetitive. As the setup for MDSE i.e., defining your (meta-)models, transformations and templates takes time, doing it only for a small number of applications is not efficient. 

Code Generation Example

This section explains the concepts described in the previous section with a Python code example. In this example, I will show how to convert entities in logical model to tables in a physical model (RDBMS) using a model-to-model transformation. The tables in the physical model are then transformed into actual TSQL DML statements. The code in this section can be found on Github: https://github.com/Maarten-gh/mdsedwhblog/. 

The Logical Domain 

The first step is to create the model that we need. The model contains the order domain, that in turn contains three entities, Customer, Product and Order. Every entity has properties that define the state of an entity and relations to other entities. The classes LogicalModel, Domain, Entity, Property and Relation are custom classes that I created for readability of this code. These classes form the meta-model of the logical domain.

 

orders_domain_model = LogicalModel( 

    domains=[ 

        Domain( 

            name="orders", 

            entities=[ 

                Entity( 

                    name="Customer", 

                    properties=[ 

                        Property( 

                            name="address", 

                            datatype="string", 

                        ), 

                    ], 

                ), 

                Entity( 

                    name="Product", 

                    properties=[ 

                        Property( 

                            name="description", 

                            datatype="string", 

                        ), 

                    ], 

                ), 

                Entity( 

                    name="Order", 

                    properties=[ 

                        Property( 

                            name="orderTimestamp_utc", 

                            datatype="timestamp", 

                        ), 

                        Property( 

                            name="amount", 

                            datatype="int", 

                        ), 

                    ], 

                    relations=[ 

                        Relation( 

                            role="orderedBy", 

                            domain_name="orders", 

                            entity_name="Customer", 

                        ), 

                        Relation( 

                            role="orderFor", 

                            domain_name="orders", 

                            entity_name="Product", 

                        ), 

                    ], 

                ), 

            ], 

        ), 

    ], 

) 

 

Transforming from the Logical Domain to the Physical Domain 

Now, assume that we want to create a data model in a database to store data for this domain. To simplify this, we can transform the model above into a model that is fit to generate DML code from. This means that we need to: 

  • convert domains to schemas, 
  • convert entities to tables, 
  • convert properties to columns with correct data types, 
  • add primary key constraints by introducing an extra id column to each table, and 
  • add foreign key constraints for each relation 

We can define a model-to-model transformation (M2M) that implements this conversion. The Python function transform_domain_model_to_database_model(domain_model) below implements this transformation. The classes PhysicalModel, Schema, Table, Column and ForeignKeyConstraint are convenience classes. These classes from the meta-model of the physical domain. The functions shown below define transformations form logical objects to physical objects using their meta-model definitions. 

 

def logical_model_to_physical_model(logical_model: LogicalModel) -> PhysicalModel: 

    return PhysicalModel( 

        schemas=[domain_to_schema(d) for d in logical_model.domains] 

    ) 

  

  

def domain_to_schema(domain: Domain) -> Schema: 

    return Schema( 

        name=domain.name, 

        tables=[entity_to_table(e) for e in domain.entities], 

    ) 

  

  

def entity_to_table(entity: Entity) -> Table: 

    id_column = Column( 

        name="id", 

        datatype="uniqueidentifier", 

        nullable=False, 

    ) 

  

    primary_key_constraint = PrimaryKeyConstraint( 

        name=f"pk_{entity.name}", 

        column_names=["id"], 

    ) 

  

    return Table( 

        name=entity.name, 

        columns=[id_column] + [property_to_column(p) for p in entity.properties] + [ 

            relation_to_column(r) for r in entity.relations], 

        primary_key_constraint=primary_key_constraint, 

        foreign_key_constraints=[ 

            relation_to_foreign_key_constraint(r) for r in entity.relations], 

    ) 

  

  

def property_to_column(property: Property) -> Column: 

    return Column( 

        name=property.name, 

        nullable=True, 

        **property_datatype_to_database_datatype(property), 

    ) 

  

  

def relation_to_column(relation: Relation) -> Column: 

    return Column( 

        name=f"{relation.role}_{relation.entity_name}_id", 

        datatype="uniqueidentifier", 

        nullable=False, 

    ) 

  

  

def relation_to_foreign_key_constraint(relation: Relation) -> ForeignKeyConstraint: 

    return ForeignKeyConstraint( 

        name=f"fk_{relation.role}_{relation.entity_name}", 

        column_names=[f"{relation.role}_{relation.entity_name}_id",], 

        foreign_schema_name=relation.domain_name, 

        foreign_table_name=relation.entity_name, 

        foreign_column_names=["id"], 

    ) 

  

  

def property_datatype_to_column_datatype(property: Property) -> dict[str, any]: 

    """ 

    Returns a column datatype, length and scale based on the data type of the 

    provided property. If the type cannot be resolved, the datatype of the 

    property is used and length and scale are both `None`. 

    """ 

    result = dict( 

        datatype=property.datatype, 

        length=None, 

        scale=None, 

    ) 

  

    if "string" == property.datatype: 

        result["datatype"] = 'nvarchar' 

        result["length"] = 255 

    elif "timestamp" == property.datatype: 

        result["datatype"] = 'datetime2' 

  

    return result 

The transformation consists of several functions that convert different components of the logical data model into corresponding components of the physical data model. Here is a brief description of each function:  

  • logical_model_to_physical_model(logical_model: LogicalModel) -> PhysicalModel:  
    This function takes a LogicalModel object as input and returns a PhysicalModel object. It does so by iterating through each domain in the logical model, calling the domain_to_schema() function on each domain, and collecting the resulting schemas in a list. The list of schemas is then used to create the PhysicalModel object. 
  • domain_to_schema(domain: Domain) -> Schema: 
    This function takes a Domain object as input and returns a Schema object. It does so by creating a Schema object with the name of the domain, and by iterating through each entity in the domain, calling the entity_to_table() function, adding the resulting list of tables to the schema. 
  • entity_to_table(entity: Entity) -> Table: 
    This function takes an Entity object as input and returns a Table object. It creates an id Column object and a PrimaryKeyConstraint object. It iterates over the properties of the entity to create Column objects by calling the property_to_column function on each property. Moreover, it creates a Column object and a ForeignKeyConstraint object for each relation of the entity by calling the functions relation_to_column and relation_to_foreign_key_constraint respectively. 
  • property_to_column(property: Property) -> Column: 
    This function takes a Property object and transforms it to a Column object. It uses the name of the property as the name of the column. The created column will be nullable, and type information is added by calling the property_datatype_to_column_datatype function.  
  • relation_to_column(relation: Relation) -> Column: 
    This function takes a Relation object and returns a Column object. It does so by creating a column name based on the entity and role of the relation. The datatype of the created column is uniqueidentifier and the column is not nullable. 
  • relation_to_foreign_key_constraint(relation: Relation) -> ForeignKeyConstraint: 
    This function takes a Relation object as input and returns a ForeignKeyConstraint object. The name of the foreign key is based on the entity and the role of the relation. The column names of the foreign key object are the same as the name of the column created by calling the relation_to_column function on the provided Relation object. The foreign domain, entity and columns of the foreign key object refer to the corresponding schema, domain and id column respectively. 
  • property_datatype_to_column_datatype(property: Property) > dict[str, any]: 
    This function takes a Property object as input and returns a dictionary with the members datatype, length and scale. By default, the datatype is set to the datatype of the input Property object and length and scale are None. If the datatype of the Property object is string or timestamp the values for datatype and length are changed. Note that this function adds extra datatype information. For example, the length of the nvarchar columns is fixed to 255. 

 

Code Generation 

Now the database model is created, we can easily generate DDL code. The code below shows a simple Jinja2 template that is used to generate the code. There are a number of important things to notice (more details can be found here: https://jinja.palletsprojects.com/en/3.0.x/templates/). 

  • {% for value in sequence %} {% endfor %} loops through all values in sequence and enables the use each element in the sequence as value in the part between for and endfor in the template. The green bordered text is written in the host language, i.e., not part of the template language, but it can contain other template parts. 
  • {{ expression }} show the value of expression as output of the template. 
  • {# expression #} are comments and the contents will not be shown in the output. 

For example, the following template… 

{# Loop through the list of names and output a greeting. #} 

{% for name in ['World', 'Maarten', 'Reader'] %} 

Hello {{ name }}! 

{% endfor %} 

 

results in the following output. 

Hello World! 

 

Hello Maarten! 

 

Hello Reader! 

To generate DML code for the physical model that was created in the previous section, the code shown below can be used. 

dml_template_text = """ 

{# Define functionality to quote and join lists of names #} 

{% macro q(names) -%} 

[{{ names|join('].[') }}] 

{%- endmacro %} 

 

{# Define a macro to put commas ',' before items in a loop, 

   except before the first item #} 

{% macro c(loop) -%} 

{{ '  ' if loop.index == 0 else ', ' }} 

{%- endmacro %} 

 

{# Loop through all schemas and create SQL DML statements -#} 

{% for schema in schemas %} 

CREATE SCHEMA {{ q([schema.name]) }}; 

GO 

 

{# Loop through all tables in the schema and output  a CREATE TABLE statement -#} 

{% for table in schema.tables %} 

CREATE TABLE {{ q([schema.name, table.name]) }} ( 

{% for column in table.columns -%} 

{{ c(loop) }}{{ q([column.name]) }} {{ column.fulltype }} 

{% endfor -%} 

, CONSTRAINT {{ q([table.primary_key_constraint.name]) }} 

    PRIMARY KEY ({{ q(table.primary_key_constraint.column_names) }}) 

); 

GO 

{% endfor -%} 

  

{# Loop through tables with foreign keys and output FOREIGN KEY constraints -#} 

{% for table in schema.tables|selectattr('foreign_key_constraints') %} 

ALTER TABLE {{ q([schema.name, table.name]) }} 

ADD 

{% for fkey in table.foreign_key_constraints -%} 

{{ c(loop) }}CONSTRAINT {{ q([fkey.name]) }} 

    FOREIGN KEY  

      ({{ q(fkey.column_names) }}) 

    REFERENCES {{ q([fkey.foreign_schema_name, fkey.foreign_table_name]) }}  

      ({{ q(fkey.foreign_column_names) }}) 

{% endfor -%} 

; 

GO 

{% endfor %} 

  

{% endfor %} 

""" 

  

  

def generate_code(template_text: str, model: object) -> str: 

    """Generates code based on a model dictionary and a template string.""" 

    from jinja2 import Environment, BaseLoader 

    template = Environment(loader=BaseLoader).from_string(template_text) 

  

    return template.render(**model.__dict__).strip() + '\n' 

  

  

def generate_dml_code(physical_model: PhysicalModel) -> str: 

    """Generates DDL code based on database model.""" 

    return generate_code(dml_template_text, physical_model) 

  

  

dml_code = generate_dml_code(orders_database_model) 

The first step of the code above is to define a template text. This template text might look a little bit daunting, but I’ll try to explain it. Note that the bordered text on the green background is SQL code, the rest is Jinja template code. First, a macro q is defined. This macro is used throughout the template to quote names. Then the actual work begins. The template loops through schemas and for each schema a CREATE SCHEMA statement is created. Then the template loops over each table in the schema and creates a CREATE TABLE statement that contains column definitions for each column in the table and a primary key constraint definition. The last part of the template loops through each table in the schema and generates an ALTER TABLE statement that adds foreign key constraint to each table that has foreign key defined.  

The function generate_code takes a template_text string and a model object as input and returns a string. The resulting string is created by using a Jinja2 template object based on the provided template_text and applying it to the provided model object by using the render method. This is the basic implementation of our template engine. 

The generate_dml_code function applies the generate_code function on the dml_template_text and a PhysicalModel object. Executing the code results in the following TSQL output. 

CREATE SCHEMA [orders]; 

GO 

 

 

CREATE TABLE [orders].[Customer] ( 

, [id] uniqueidentifier NOT NULL 

, [address] nvarchar(255) NULL 

, CONSTRAINT [pk_Customer] 

    PRIMARY KEY ([id]) 

); 

GO 

 

CREATE TABLE [orders].[Product] ( 

, [id] uniqueidentifier NOT NULL 

, [description] nvarchar(255) NULL 

, CONSTRAINT [pk_Product] 

    PRIMARY KEY ([id]) 

); 

GO 

 

CREATE TABLE [orders].[Order] ( 

, [id] uniqueidentifier NOT NULL 

, [orderTimestamp_utc] datetime2 NULL 

, [amount] int NULL 

, [orderedBy_Customer_id] uniqueidentifier NOT NULL 

, [orderFor_Product_id] uniqueidentifier NOT NULL 

, CONSTRAINT [pk_Order] 

    PRIMARY KEY ([id]) 

); 

GO 

 

ALTER TABLE [orders].[Order] 

ADD 

, CONSTRAINT [fk_orderedBy_Customer] 

    FOREIGN KEY 

      ([orderedBy_Customer_id]) 

    REFERENCES [orders].[Customer] 

      ([id]) 

, CONSTRAINT [fk_orderFor_Product] 

    FOREIGN KEY 

      ([orderFor_Product_id]) 

    REFERENCES [orders].[Product] 

      ([id]) 

; 

GO 

Conclusion 

Based on the code in the example, you could wonder why you should go through all the trouble of defining a logical model, defining a model-to-model transformation and a model-to-text transformation to generate 45 lines of SQL code. Hand coding the SQL would have been more efficient. 

Of course, you are right. For this example. In the next article I will discuss the MDSE principle to generate actual data warehousing artifacts such as DML and ETL code and lineage documentation. The fact that the same models will be the source for all these artifacts show that the benefits mentioned in this article will be more apparent if the number of models objects and output objects increase.