Standards and Conventions
These standards are opinionated and designed to ensure consistency, governance, and automation across an organisation. They largely reflect an Azure Databricks environment, however can be adapted to other platforms. Organisations should adapt these standards to fit their existing internal conventions.
Table of Contents
Mesh
Domain Names
All lower case: {optional:organisation_}{functional area/domain}_{subdomain}
e.g: intuitas_corporate
Platform
Environment
- Environment name: dev/test/prod/sandbox/poc (pat - production acceptance testing is optional as preferred)
VNET
- Name:
vn-{organisation_name}-{domain_name} - e.g: vn-intuitas-corporate
Resource Groups
- Name:
rg-{organisation_name}-{domain_name} - e.g: rg-intuitas-corporate
Databricks workspace
- Name:
ws-{organisation_name}-{domain_name} - e.g: ws-intuitas-corporate
Key vault
- Name:
kv-{organisation_name}-{domain_name} - e.g: kv-intuitas-corporate
Secrets
- Name:
{secret_name}
Entra Group Names
- Name:
eg-{organisation_name}-{domain_name} - e.g: eg-intuitas-corporate
Azure Data Factory (ADF)
- Name:
adf-{organisation_name}-{domain_name} - e.g: adf-intuitas-corporate
SQL Server
- Name:
sql-{organisation_name}-{domain_name} - e.g: sql-intuitas-corporate
SQL Database
- Name:
sqldb-{purpose}-{organisation_name}-{domain_name}-{optional:environment} - e.g: sqldb-metadata-intuitas-corporate
Storage
The section describes naming standards and conventions for cloud storage resources.
Lakehouse storage
- Lakehouse storage account name:
dl{organisation_name}{domain_name}
Lakehouse storage containers
- Name:
{environment} (dev/test/preprod/prod)
Lakehouse storage folders
- Level 1 Name:
{zone} (raw/edw/infomart)// (bronze/silver/gold)` if using medallion approach - Level 2 Name:
{layer_name} -
e.g:
-
raw/landing
- raw/ods
- raw/pds
- raw/schema (for autoloader metadata)
- raw/checkpoints (for autoloader metadata)
- edw/automatically determined by unity catalog
- infomart/automatically determined by unity catalog
Generic Blob storage
Generic Blob storage can be used for all non-lakehouse data; or alternatively within the lakehouse storage account in the appropriate container and folder.
- Resource: ADLSGen2
- Generic storage account name:
sa{organisation_name}{domain_name}{functional_description} - Tier: Standard/Premium (depends on workload)
-
Redundancy:
-
Minimum ZRS or GRS for prod
- Minimum LRS for poc, dev, test and preprod
Generic Blob files and folders
No standard naming conventions for files and folders.
Databricks
This section provides naming standards and conventions for Databricks.
Workspace and cluster names
- Workspace name:
ws-{organisation_name}_{domain_name} - Cluster name:
{personal_name/shared_name} Cluster - Workflow name:
{dev/test} {workflow_name}
Jobs and Pipelines
Job names
- Job names:
{domain}__{zone}__{purpose}__{source}{optional: __target}{optional: __schedule}{optional: __version}__{env} - e.g. clinical__raw__ingest__fhircdr__dev
For Delta Live Table (DLT) Pipelines
- Pipeline names:
{domain_name}__{zone}__pipeline__{dataset}{optional: __schedule}{optional: __version}__{env} - e.g. clinical__raw__pipeline__fhircdr__dev
- e.g. supplychain__infomart__pipeline__inventorymart__prod
Note on zone terminology:
- This framework uses raw/edw/infomart terminology (aligns with enterprise data warehousing concepts)
- Databricks documentation uses bronze/silver/gold terminology (medallion architecture)
- Both are valid; choose one and use consistently:
- raw = bronze (raw ingested data)
- edw = silver (cleaned, conformed data)
- infomart = gold (business-ready aggregated data)
-
Note on {zone}: If the 'business outcome' is Infomart, you call it Infomart, even if it produces Raw + EDW on the way. i.e "This is the production DLT pipeline in the supply chain domain, which builds and maintains the curated Infomart zone dataset called Inventory Mart"
-
Include pipeline so it’s distinguishable from ad hoc jobs.
- Dataset can be a logical grouping (e.g., patient, encounter, claims).
Orchestration job names
- Orchestration job names:
{domain}__orchestration__{workflow-name}{optional: __schedule}{optional: __version}__{env} - e.g. clinical__orchestrate__fhirworkflow__daily__dev
which then orchestrates:
- clinical__raw__pipeline__fhircdr__prod (or
clinical__bronze__pipeline__fhircdr__prodif using medallion terminology) - clinical__edw__pipeline__fhirclean__prod (or
clinical__silver__pipeline__fhirclean__prodif using medallion terminology) - clinical__infomart__pipeline__clinicalmart__prod (or
clinical__gold__pipeline__clinicalmart__prodif using medallion terminology)
Job logging
- event log catalog:
{domain}__audit__{env} - event log schema:
audit__event_log
Optional
- Versioning (if needed): add v1, v2 if a job is redesigned but old one stays around.
- Scheduling frequency (optional): suffix with _hourly, _daily, _weekly if relevant.
Catalog naming and conventions
Refer to Data zones and layers for further context and definitions applicable to this section.
Catalog name:
The choice of granularity depends on domain topology, zone and layer conventions, and desired level of segregation for access and sharing controls (i.e. catalog or layer level)
- Minimum granularity (domain level):
{domain_name}{__environment (dev/test/pat/prod)}(prod is implied optional) e.g: corporate__dev - Optional granularity (domain-zone level):
{domain_name}{_zone: (raw/edw/infomart)}{__environment (dev/test/pat/prod)}e.g: corporate_raw__dev - Optional granularity (domain-zone and layer level):
{domain_name}{_zone: (raw/edw/infomart)}{_layer: (ods/pds/stg/ref/mart)}{__environment (dev/test/pat/prod)}e.g: corporate_raw_ods__dev - Optional granularity (domain-layer level):
{domain_name}{_layer: (ods/pds/stg/ref/mart)}{__environment (dev/test/pat/prod)}e.g: corporate_ods__dev - Optional granularity (subdomain-zone level):
{domain_name}{_descriptor (subdomain/subject/project*)}{_zone: (raw/edw/infomart)}{__environment (dev/test/pat/prod)}e.g: corporate_finance_raw__dev
In the examples provided - we have opted for domain level - with schema separation for the lower levels of grain via prefixes. i.e engineering__dev.ods__fhirhouse__dbo__lakeflow.encounter
Note that projects are temporary constructs, and hence are not recommended for naming
- Catalog storage root:
abfss://{environment}@dl{organisation_name}{domain_name}.dfs.core.windows.net/{domain_name}_{environment}_catalog
Externally mounted (lakehouse federation) Catalog Names
- Foreign Catalog name:
{domain_name (owner)}__fc__{source_system}{optional:__other_useful_descriptors e.g:__environment} - e.g: corporate__fc__sqlonpremsource
- e.g: corporate__fc__sqlonpremsource__prod
Catalog Metadata tags:
The following metadata should be added when creating a catalog:
- Key: domain (owner):
{domain_name} - Key: environment:
{environment} - Key: managing_domain:
{domain_name}e.g: if delegating to engineering domain
Schema and object conventions
Refer to Data zones and layers for further context and definitions applicable to this section.
Refer to Physical Models for the standard relating to column naming, types, and conventions.
Terminology Hierarchy:
Data is organized in a three-level hierarchy: - Zone = Top-level organizational boundary (Raw, EDW, Infomart) - Within each zone, data flows through... - Layer = Processing layers within a zone (landing, ods, pds, stg, ref, mart, etc.) - Each layer typically maps to... - Schema = Unity Catalog schema that stores objects (ods, pds, edw_stg, edw_ref, edw, im_stg, im)
Example: The Raw zone contains the ODS layer which maps to the ods schema.
Schema level external storage locations
Recommendations:
- For managed tables (default): do nothing. Let dbt create schemas without additional configuration. Databricks will manage storage and metadata.Objects will then be stored in the catalog storage root. e.g: abfss://dev@dlintuitas{domain}.dfs.core.windows.net/engineering__dev_catalog/__unitystorage/catalogs/catalog-guid/tables/object-guid
- For granular control over schema-level storage locations: Pre-create schemas with LOCATION mapped to external paths or configure the catalog-level location.
- Ensure dbt's dbt_project.yml and environment variables align with storage locations.
Metadata Schemas and Objects
Refer to Data zones and layers for further context and definitions applicable to this section.
Contains metadata that supports engineering and governance. This will vary depending on engineering and governance toolsets
Engineering - ingestion framework:
- Schema naming convention:
meta {optional: __function} - Naming convention:
{function/descriptor} - e.g: corporate__dev.meta.ingestion_control
- e.g: corporate__dev.meta__ingestion.ingestion_control
Raw (aka Bronze) (Data according to systems)
The Raw zone stores raw, immutable data as it is ingested from source systems. See Data zones and layers for definitions and context.
The Raw zone uses two primary schemas:
- ods - Operational Data Store (current state)
- pds - Persistent Data Store (historical snapshots)
In the examples provided - we have opted for domain level catalogs - with schema separation for the lower levels of grain via prefixes. i.e engineering__dev.ods__fhirhouse__dbo__lakeflow.encounter
Note: Using consistent double underscore (__) delimiters throughout ensures compatibility with Databricks and avoids parsing issues. Parentheses notation (e.g., dbo(adf)) should be avoided in schema and object names.
Persistent Landing:
Persistent Landing uses Unity Catalog Volumes for storing raw files and unstructured data as they arrive from source systems.
Volume naming convention: [domain][__env].[layer][__source_system][__source_schema][__channel].[source_object/volume_name]
[domain][__env]: Domain and environment identifier (e.g.,corporate_dev,engineering_prod)[layer]: Data layer identifier within the Raw zone (e.g.,landing)[__source_system]: Source system identifier (e.g.,workdayapi,saphr)[__source_schema]: Optional source schema or subsystem identifier[__channel]: Optional ingestion channel/method (e.g.,__adf,__fivetran,__api)-
[source_object/volume_name]: Descriptive volume name or source object identifier -
e.g: corporate__dev.landing__workdayapi.schedule_volume
- e.g: engineering__prod.landing__fhirapi__patients.raw_data
Operational Data Store (ODS):
The objective of raw zone conventions is to provide clarity over which zone, layer, and schema it belongs to, what the data relates to, where it was sourced from, and via what channel it arrived (as there may be nuances in data depending on its channel).
ODS can be replicated from source systems, or prepared for use from semi/unstructured data via hard-transformation and hence will have these associated conventions:
Database replicated ODS (structured sources like SQL Server):
Format: [domain][__env].[layer][__source_system][__source_schema][__channel].[source_object/volume_name]
[domain][__env]: Domain and environment (e.g.,clinical__dev)[layer]: Data layer within the Raw zone (e.g.,ods)[__source_system]: Source system identifier[__source_schema]: Source schema (if applicable, e.g.,__dbo,__reporting)[__channel]: Optional ingestion channel (e.g.,__adf,__fivetran,__lakeflow)-
[source_object/volume_name]: Table name as per source -
e.g: clinical__dev.ods__patientflowmanager01__reporting.patients
- e.g: clinical__dev.ods__fhirhouse__dbo__adf.encounter
Prepped semi/unstructured ODS data:
Format: [domain][__env].[layer][__source_system][__source_descriptor][__channel].[source_object/volume_name]
[domain][__env]: Domain and environment (e.g.,clinical__dev)[layer]: Data layer within the Raw zone (e.g.,ods)[__source_system]: Source system identifier[__source_descriptor]: Source descriptor or subsystem identifier[__channel]: Optional ingestion channel (e.g.,__kafka,__databricks,__api)-
[source_object/volume_name]: Named as per source or unique assigned name (e.g., topic/folder name) -
e.g: clinical__dev.ods__ambosim__confluent__kafka.encounter
- e.g: corporate__dev.ods__workdayapi__employees.raw_data
Persistent Data Store (PDS):
PDS conventions will mirror ODS conventions:
Database replicated PDS (structured sources like SQL Server):
Format: [domain][__env].[layer][__source_system][__source_schema][__channel].[source_object/volume_name]
[domain][__env]: Domain and environment (e.g.,clinical__dev)[layer]: Data layer within the Raw zone (e.g.,pds)[__source_system]: Source system identifier[__source_schema]: Source schema (if applicable, e.g.,__dbo,__reporting)[__channel]: Optional ingestion channel (e.g.,__adf,__fivetran,__lakeflow)-
[source_object/volume_name]: Table name as per source -
e.g: clinical__dev.pds__patientflowmanager01__reporting.patients
- e.g: clinical__dev.pds__fhirhouse__dbo__adf.encounter
Prepped semi/unstructured PDS data:
Format: [domain][__env].[layer][__source_system][__source_descriptor][__channel].[source_object/volume_name]
[domain][__env]: Domain and environment (e.g.,clinical__dev)[layer]: Data layer within the Raw zone (e.g.,pds)[__source_system]: Source system identifier[__source_descriptor]: Source descriptor or subsystem identifier[__channel]: Optional ingestion channel (e.g.,__kafka,__databricks,__api)-
[source_object/volume_name]: Named as per source or unique assigned name (e.g., topic/folder name) -
e.g: clinical__dev.pds__ambosim__confluent__kafka.encounter
- e.g: corporate__dev.pds__workdayapi__employees.raw_data
EDW (aka Silver) (Data according to business entities)
The EDW (Enterprise Data Warehouse) zone focuses on transforming raw data into cleaned, enriched, and validated datasets that are the building blocks for downstream consumption and analysis.
Refer to Data zones and layers for further context and definitions applicable to this section.
The EDW zone uses three primary schemas:
- edw_stg - Source-centric and entity-centric staging objects (transformations, cleaning, normalization, integration)
- edw_ref - Reference data (curated, conformed reference entities and code sets intended for broad reuse)
- edw - Base marts (full-featured dimensions with SCD logic, facts - curated entities)
These marts are objects that are aligned to business entities and broad requirements, hence they must contain source-specific objects at the lowest grain. There may be further enrichment and joins applied across sources.
In the examples provided - we have opted for domain level catalogs - with schema separation for the lower levels of grain via prefixes. i.e engineering__dev.edw.dim_customer
- All schemas may be optionally prefixed with additional descriptors if not already decomposed at domain-level i.e.
edw_stg__,edw_ref__,edw__ - All
entitynames which align to facts should be named in plural. - All
entitynames which align to dims should be named in singular.
EDW Staging Objects: Staging models serve as intermediary models that transform source data into EDW models. According to dbt best practices, there is a distinction between Staging and Intermediate models. Under this blueprint the use of Intermediate models is optional. Reference
Two Complementary Staging Approaches:
EDW staging supports two patterns that work together:
- Source-Centric Staging - Organized by source system:
- Cleans and normalizes data from each source independently
- May feed data quality test results to reports for actioning
- Prepares reference data for broad use and maps it for mart conformance
-
Creates mappings of keys to resolve system keys to universal surrogate or business keys
-
Entity-Centric Staging - Organized by business entity:
- Integrates data across multiple source-centric staging models
- Applies cross-source deduplication and conflict resolution
- Establishes unified entity keys and relationships
- Creates a single, conformed view of each business entity
Typical Flow:
Raw Sources → Source-Centric Staging → Entity-Centric Staging → Base Marts (Dimensions/Facts)
Source-Centric Staging Approach:
Staging objects are organized around source systems and source objects, cleaning and normalizing data according to expectations/standards. This approach: - Cleans and normalizes data from each source independently - May feed data quality test results to reports for actioning - Prepares reference data for broad use and maps it for mart conformance - Creates mappings of keys to resolve and map system keys to universal surrogate keys or business keys, which can then be reused downstream for integration
Naming Convention:
[domain][__env] . [schema] . stg__[source_system](optional__[source_channel])__[source_object](optional__[other_descriptors])__[id]__[transformation_descriptors]
Where:
[domain][__env]- Domain and environment (e.g.,corporate__prod,corporate__dev)[schema]- EDW schema (edw_stgfor staging,edw_reffor reference staging)stg__- Staging object prefix[source_system]- Source system identifier (e.g.,sap,workday,finance_system)(optional__[source_channel])- Optional source channel (e.g.,__adf,__api)[source_object]- Source object/table name(optional__[other_descriptors])- Optional additional descriptors[id]- Unique identifier suffix[transformation_descriptors]- Numbered transformation step(s)
Standard Transformation Steps (optional, applied as needed):
00_keyed- Key assignment/generation01_renamed_and_typed- Column renaming and type casting02_deduped- Deduplication03_cleaned- Data cleaning04_filteredor04_split- Filtering or splitting data05_column_selected- Column selection06_business_validated- Business rule validation07_desensitised- Sensitive data masking/removal
Reference Data Staging:
Reference data follows the same two-stage approach as marts when coming from multiple sources.
Use Entity-Centric Staging when: - Multiple source systems provide the same business entity/code set, and you need a single conformed view - You need cross-source mapping, hierarchy standardisation, or conflict resolution (e.g., code collisions, different naming, different rollups)
Source-Centric Staging (with transformations in edw_stg schema):
Cleans individual source system reference data:
- e.g:
corporate__prod.edw_stg.stg__sap__facilities__01_renamed_and_typed - e.g:
corporate__prod.edw_stg.stg__sap__facilities__02_cleaned - e.g:
corporate__prod.edw_stg.stg__finance_system__adf__account_codes__01_renamed_and_typed - e.g:
corporate__prod.edw_stg.stg__workday__locations__01_renamed_and_typed - e.g:
corporate__prod.edw_stg.stg__workday__locations__02_deduped
Entity-Centric Staging (when unifying across sources in edw_stg schema):
Integrates reference data from multiple sources:
- e.g:
corporate__prod.edw_stg.stg__facility__01_unified(combines SAP and Workday facilities) - e.g:
corporate__prod.edw_stg.stg__facility__02_mapped(standardizes codes/hierarchies) - e.g:
corporate__prod.edw_stg.stg__account_code__01_unified(combines old/new finance systems) - e.g:
corporate__prod.edw_stg.stg__location__01_unified(integrates location references)
Final Reference Data (cleaned and conformed in edw_ref schema):
Reference data in edw_ref is materialized as curated, conformed, reusable reference entities and code sets.
Typically, edw_ref models:
- Are keyed by natural/business keys (and may also include stable internal IDs where useful)
- Are lightweight and broadly reusable (avoid mart-specific denormalisation and calculations)
- Are often current-state, but may include effective dating / limited historisation where the reference itself changes over time (e.g., organisational hierarchies)
- e.g:
corporate__prod.edw_ref.ref_facility(facility_id, facility_name, facility_code, location_key - simple key-value) - e.g:
corporate__prod.edw_ref.ref_location(location_id, location_name, country_code - basic attributes) - e.g:
corporate__prod.edw_ref.ref_account_code(account_code, account_name, hierarchy_level - minimal structure) - e.g:
corporate__prod.edw_ref.ref_country_codes(country_code, country_name - simple lookup)
Note on Reference Data vs Dimensions:
Reference data in edw_ref serves as a foundation and can be shaped into dimensions in the edw (marts) schema:
edw_ref.ref_*- Conformed reference entities/code sets intended for broad reuse (generally natural/business keys, minimal mart-specific logic)edw.dim_*- Analytics-facing dimensional models that may add:- Surrogate keys (and persisted natural/business keys as attributes)
- SCD Type 1 and/or Type 2 behaviour as required (e.g.,
valid_from,valid_to,is_current) - Denormalisation, role-playing dimensions, and mart-specific hierarchies/business rules
Examples of the relationship:
edw_ref.ref_facility(conformed reference) →edw.dim_facility(often Type 1 if facility attributes are treated as current-state)edw_ref.ref_account_code(conformed codes/hierarchy) →edw.dim_account(often Type 2 if names/rollups change over time)edw_ref.ref_product_category(conformed categories) →edw.dim_product(often Type 2 if classifications change over time)
Base Mart (EDW) Staging:
EDW staging can follow two complementary patterns:
- Source-Centric Staging: Cleans and transforms data from individual sources
- Entity-Centric Staging: Integrates and unifies data across multiple sources for a business entity
Source-Centric Staging (with transformations in edw_stg schema):
Focused on cleaning individual source systems:
- e.g:
corporate__prod.edw_stg.stg__sap__employees__01_renamed_and_typed - e.g:
corporate__prod.edw_stg.stg__workday__employees__01_renamed_and_typed - e.g:
corporate__prod.edw_stg.stg__workday__employees__02_cleaned - e.g:
corporate__prod.edw_stg.stg__crm__adf__customers__01_renamed_and_typed - e.g:
corporate__prod.edw_stg.stg__crm__adf__customers__02_cleaned - e.g:
corporate__prod.edw_stg.stg__new_finance_system__adf__payments__01_renamed_and_typed - e.g:
corporate__prod.edw_stg.stg__old_finance_system__adf__payments__01_renamed_and_typed - e.g:
corporate__prod.edw_stg.stg__finance_system__api__accounts__01_renamed_and_typed
Entity-Centric Staging (with transformations in edw_stg schema):
Focused on integrating and unifying data across sources for a specific business entity. These models: - Union, join, or merge data from multiple source-centric staging models - Apply cross-source deduplication and conflict resolution - Establish entity-level keys and relationships - Prepare unified entity views for final base marts
Naming Convention for Entity-Centric Staging:
[domain][__env] . [schema] . stg__[entity]__[transformation_descriptors]
Examples:
- e.g:
corporate__prod.edw_stg.stg__employee__01_unified(unions SAP and Workday employees) - e.g:
corporate__prod.edw_stg.stg__employee__02_deduped(resolves duplicates across sources) - e.g:
corporate__prod.edw_stg.stg__employee__03_keyed(applies unified business keys) - e.g:
corporate__prod.edw_stg.stg__customer__01_unified(merges multiple CRM sources) - e.g:
corporate__prod.edw_stg.stg__customer__02_conflict_resolved(resolves conflicting attributes) - e.g:
corporate__prod.edw_stg.stg__payment__01_unified(combines old and new finance systems) - e.g:
health__prod.edw_stg.stg__patient__01_unified(integrates patient data across clinical systems)
These entity-centric staging models then feed into the final base marts (dimensions and facts).
Base Mart Keysets:
Keysets are created in the edw schema to resolve and map system keys to universal surrogate keys or business keys. These can be reused downstream for integration and conforming/deduplicating entities across multiple sources:
- e.g:
corporate__prod.edw.keys__employee(conforms sap and workday employee keys) - e.g:
corporate__prod.edw.keys__customer(conforms multiple crm source keys) - e.g:
corporate__prod.edw.keys__account(conforms finance system keys) - e.g:
health__prod.edw.keys__patient(conforms clinical system keys)
Base Marts (Final Dimensional Models):
Base marts represent full-featured dimensional models with SCD logic, business rules, and complete attributes. These may be built from entity-centric staging, source-centric staging, or by shaping reference data from edw_ref.
Source-specific dimensions (preserving source identity):
- e.g:
corporate__prod.edw.dim_employee__sap - e.g:
corporate__prod.edw.dim_employee__workday - e.g:
corporate__prod.edw.dim_account__old_finance_system - e.g:
corporate__prod.edw.dim_account__new_finance_system
Conformed dimensions (unified, non-source specific):
- e.g:
corporate__prod.edw.dim_employee(Type 1 SCD - current state only) - e.g:
corporate__prod.edw.dim_employee__type2(Type 2 SCD - with historical tracking) - e.g:
corporate__prod.edw.dim_customer - e.g:
corporate__prod.edw.dim_account(unified across finance systems)
Dimensions built from reference data (shaped with SCD logic):
- e.g:
corporate__prod.edw.dim_facility(built fromedw_ref.ref_facility, adds Type 1 or Type 2 SCD logic) - e.g:
corporate__prod.edw.dim_product_category(built fromedw_ref.ref_product_category, with hierarchies) - e.g:
corporate__prod.edw.dim_location(built fromedw_ref.ref_location, with geographic hierarchies)
Fact tables:
- e.g:
corporate__prod.edw.fact_payment - e.g:
corporate__prod.edw.fact_transaction - e.g:
health__prod.edw.fact_encounter
Common Transformation Suffixes:
__01_renamedor__01_typed__02_cleaned__03_deduped__04_filtered__05_split__06_validated__07_desensitised
Additional Notes:
For further context on reference data modeling, refer to Reference Data Standards and Conventions.
Raw Vault: Optional warehousing construct.
- Schema naming convention:
edw_rv - Object naming convention:
{vault object named as per data vault standards} - e.g: corporate__dev.edw_rv.hs_payments__finance_system__adf
Business Vault: Optional warehousing construct.
- Schema naming convention:
edw_bv - Object naming convention:
{vault object named as per data vault standards} - e.g: corporate__dev.edw_bv.hs_late_payments__finance_system__adf
Infomart (aka Gold) (Data according to requirements)
The Infomart zone focuses on requirement-aligned products (datasets, aggregations, and reporting structures). Products are predominantly source agnostic, however optionality exists when source-specific views are needed.
Refer to Data zones and layers for further context and definitions applicable to this section.
The Infomart zone uses two primary schemas:
- im_stg - Product mart staging (intermediate transformations)
- im - Information marts (final business-aligned products)
Naming Convention:
[domain][__env] . [schema] . [object_type][product_name/descriptor](__source)(__transformation)
Where:
[domain][__env]- Domain and environment (e.g.,corporate__prod,health__dev)[schema]- Infomart zone schema (im_stgfor staging,imfor final products)[object_type]- Object type prefix (e.g.,stg_,fact_,dim_,obt_)[product_name/descriptor]- Product or business-aligned name(__source)- Optional source system identifier (only when source-specific view is needed)(__transformation)- Optional transformation step (for staging only)
Naming Guidelines:
- All
entitynames which align to facts should be named in plural - All
entitynames which align to dimensions should be named in singular - Product marts are business/requirement-aligned, not technical constructs
Infomart Staging Models:
Staging models serve as intermediary models that transform base marts (EDW) into requirement-aligned product marts. According to dbt best practices, there is a distinction between Staging and Intermediate models. Under this blueprint the use of Intermediate models is optional. Reference
These models exist in the im_stg schema to stage information marts with business-specific transformations:
Common Transformations:
- Pivoting
- Aggregation
- Joining across multiple base marts
- Business rule application
- Desensitization
- Complex calculations
Examples:
- e.g:
corporate__prod.im_stg.stg_downtime_by_region__01_pivoted - e.g:
corporate__prod.im_stg.stg_downtime_by_region__02_desensitised - e.g:
corporate__prod.im_stg.stg_late_payments__01_aggregated - e.g:
corporate__prod.im_stg.stg_late_payments__02_joined - e.g:
corporate__prod.im_stg.stg_fte_calculations__01_pivoted - e.g:
corporate__prod.im_stg.stg_fte_calculations__02_aggregated - e.g:
health__prod.im_stg.stg_patient_outcomes__01_joined - e.g:
health__prod.im_stg.stg_patient_outcomes__02_desensitised
Information Marts (Final Products):
Final business-aligned data products ready for consumption by analytics, reporting, and BI tools in the im schema.
Examples:
Fact tables (aggregated/business-focused):
- e.g:
corporate__prod.im.fact_downtime_by_region - e.g:
corporate__prod.im.fact_late_payments - e.g:
corporate__prod.im.fact_regional_account_payments - e.g:
health__prod.im.fact_patient_outcomes
Dimension tables (product-specific):
- e.g:
corporate__prod.im.dim_region - e.g:
corporate__prod.im.dim_payment_category
One Big Tables (OBT) - Denormalized aggregates:
- e.g:
corporate__prod.im.obt_fte_calculations - e.g:
corporate__prod.im.obt_financial_summary - e.g:
corporate__prod.im.obt_executive_dashboard
Source-specific products (when needed):
- e.g:
corporate__prod.im.fact_account_payments__old_finance_system - e.g:
corporate__prod.im.fact_account_payments__new_finance_system - e.g:
corporate__prod.im.obt_employee_metrics__sap - e.g:
corporate__prod.im.obt_employee_metrics__workday
Delta Sharing
- Share names: {domain_name}{optional:subdomain_name}{optional:purpose}{schema_name or description}{object_name or description}__{share_purpose and or target_audience} e.g: intuitas_corporate__finance__reporting__account_payments__payments
Azure Data Factory
- Linked service names: ls_{database_name}(if not in database_name:{organisation_name}{domain_name}) e.g: ls_financedb_intuitas_corporate
- Dataset names: ds_{database_name}_{object_name}
- Pipeline names: pl_{description: e.g copy_{source_name}to{destination_name}}
- Trigger names: tr_{pipeline_name}_{optional:start_time / frequency}
Streaming
- Cluster name:
{domain_name}__cluster__{optional:environment} - Topic names:
{domain_name}__{object/entity?}__{optional:source_system}___{optional:source_channel}__{optional:environment} - Consumer group names:
{domain_name}__{unique_group_name}__{optional:environment}
dbt
The following standards and conventions relate to dbt projects.
Documentation and model metadata
Within each respective model folder (as needed)
- md: _{path to model folder using _ separators}__docs.md
-
e.g: models/edw/ambo_sim__kafka__local/_edw__ambo_sim__kafka__local__docs.md
-
model yml: _{path to model folder using _ separators}__models.yml
- e.g: models/edw/ambo_sim__kafka__local/_edw__ambo_sim__kafka__local__models.yml
Sources
- Folder: models/sources/{raw/edw/infomart}
- yml: {schema}__sources.yml (one for each source schema)
- e.g: ods__ambo_sim__kafka__sources.yml
Model and Folder Names
dbt model names are verbose (inclusive of schema and domain) to ensure global uniqueness and better traceability to folders. Actual object names should be aliased to match object naming standards.
Raw
Raw zone contains operational data store (ODS) for current state and persistent data store (PDS) for historical tracking
ODS (Operational Data Store) - Current State:
- Folder: seeds/{optional: domain name}{optional: __subdomain name(s)}/ → materializes to [domain__env].ods
- Folder: sources/raw/{optional: domain name}{optional: __subdomain name(s)}/ → references external sources to [domain__env].ods
Seed Naming Convention:
Seeds (CSV files) should follow a descriptive naming pattern and materialize to the ODS schema:
- ref_[entity]_[descriptor].csv for reference/lookup data
- config_[descriptor].csv for configuration data
- e.g: ref_country_codes.csv → corporate__dev.ods.ref_country_codes
- e.g: ref_currency_rates.csv → corporate__dev.ods.ref_currency_rates
- e.g: config_holidays.csv → corporate__dev.ods.config_holidays
PDS (Persistent Data Store) - Historical Tracking:
- Folder: models/raw/{optional: domain name}{optional: __subdomain name(s)}/pds/ → materializes to [domain__env].pds
- Models in PDS capture type 2 SCD (slowly changing dimensions) or snapshots of ODS data
Type 2 Snapshot Model Examples:
- e.g: models/raw/pds/ref_country_codes_snapshot.sql → corporate__dev.pds.ref_country_codes_snapshot
- e.g: models/raw/pds/ref_currency_rates_snapshot.sql → corporate__dev.pds.ref_currency_rates_snapshot
- e.g: models/raw/pds/config_holidays_snapshot.sql → corporate__dev.pds.config_holidays_snapshot
These models typically reference their corresponding seeds from ODS and add historical tracking columns (valid_from, valid_to, is_current, etc.)
EDW
EDW staging follows two complementary patterns:
- Source-Centric Staging: Organized by source system, cleaning individual sources
- Entity-Centric Staging: Organized by business entity, integrating across sources
Staging Source-Centric:
Focused on cleaning and transforming data from individual source systems:
- Folder:
models/edw/{optional: domain name}{optional: __subdomain name(s)}/source/{source_system}{optional:__{source_channel}}/stg/ - Models:
{optional: domain name} {optional: __subdomain name(s)} {optional:__edw__} stg__{source_system} __{source_channel} __{source_object} __{ordinal}_{transformation description}
*e.g:*
- *edw\source\new_finance_system__adf\stg\corporate__edw__stg__new_finance_system__adf__accounts__01_renamed_and_typed.sql*
- or *edw\source\new_finance_system__adf\stg\stg__new_finance_system__adf__accounts__01_renamed_and_typed.sql*
- materialises to: *corporate__dev.edw_stg.stg__new_finance_system__adf__accounts__01_renamed_and_typed*
Staging Entity-Centric:
Focused on integrating and unifying data across multiple sources for a specific business entity:
- Folder:
models/edw/{optional: domain name}{optional: __subdomain name(s)}/mart/{entity}/stg/ - Models:
{optional: domain name} {optional: __subdomain name(s)} {optional:__edw__} stg__{entity} __{ordinal}_{transformation description}
These models: - Union, join, or merge data from multiple source-centric staging models - Apply cross-source deduplication and conflict resolution - Establish unified entity keys
*e.g:*
- *edw\mart\employee\stg\stg__employee__01_unified.sql* (unions SAP and Workday)
- *edw\mart\employee\stg\stg__employee__02_deduped.sql* (resolves duplicates)
- *edw\mart\employee\stg\stg__employee__03_keyed.sql* (applies unified keys)
- materialises to: *corporate__dev.edw_stg.stg__employee__01_unified*
Mart (Entity-Centric Base Marts):
Base marts are organized by business entity. Entity-centric staging (if needed) lives in a nested stg/ folder within each entity folder.
- Folder:
models/edw/{optional: domain name}{optional: __subdomain name(s)}/mart/{entity}/ - Entity-centric staging subfolder:
models/edw/.../mart/{entity}/stg/ - Models:
{optional: domain name} {optional: __subdomain name(s)} {optional:__edw__} {dim_/fact_}{__entity /_object_description}{optional:__{source_system}__{source_channel}}
Source-specific marts (when preserving source identity):
- Models:
{optional:__edw__} dim_{entity}__{source_system} - e.g: dim_employee__sap.sql, dim_employee__workday.sql
Non-source specific (unified) marts:
- Models:
{optional:__edw__} dim_{entity}orfact_{entity_plural} - e.g: dim_employee.sql, fact_payments.sql
Reference Data:
Reference data follows the same pattern as marts - organized by reference entity with optional entity-centric staging.
- Folder:
models/edw/{optional: domain name}{optional: __subdomain name(s)}/ref/{reference_entity}/ - Entity-centric staging subfolder:
models/edw/.../ref/{reference_entity}/stg/(when unifying multiple sources) - Models:
{optional: domain name} {optional: __subdomain name(s)} {optional:__edw__} ref_{__reference data set name} {optional:__{source_system}__{source_channel}}
Examples:
- e.g: models/edw/ref/facility/ref_facility.sql → corporate__dev.edw_ref.ref_facility
- e.g: models/edw/ref/facility/stg/stg__facility__01_unified.sql → corporate__dev.edw_stg.stg__facility__01_unified (combines SAP + Workday)
- e.g: models/edw/ref/account_code/ref_account_code.sql → corporate__dev.edw_ref.ref_account_code
Raw Vault:
- Folder:
models/edw/{optional: domain name}{optional: __subdomain name(s)}/edw/rv - Models:
edw_rv__{vault object named as per data vault standards}
Business Vault:
- Folder:
models/edw/{optional: domain name}{optional: __subdomain name(s)}/edw/bv - Models:
edw_bv__{vault object named as per data vault standards}
Infomart
Staging:
- Folder:
models/infomart/{optional: domain name}{optional: __subdomain name(s)}/mart/{entity / product description}/stg - Models:
{optional: domain name} {optional: __subdomain name(s)} {optional:__im__} stg{__entity / product description} __{ordinal}_{transformation description} {optional: __{source_system} __{source_channel}}
Dimensions:
- Folder:
models/infomart/{optional: domain name}{optional: __subdomain name(s)}/mart/{entity / product description} - Models:
{optional: domain name} {optional: __subdomain name(s)} {optional:__im__} dim_{__entity / product description} (optional: __{source_system} __{source_channel})
Facts:
- Folder:
models/infomart/{optional: domain name}{optional: __subdomain name(s)}/mart/{entity / product description} - Models:
{optional: domain name} {optional: __subdomain name(s)} {optional:__im__} fact_{__entity / product description} (optional: __{source_system} __{source_channel})
Denormalized (One Big Table):
- Folder:
models/infomart/{optional: domain name}{optional: __subdomain name(s)}/mart/{entity / product description} - Models:
{optional: domain name} {optional: __subdomain name(s)} {optional:__im__} obt_{__entity / product description} {optional: __{source_system} __{source_channel}}
Example dbt model structure:
The model structure below reflects a single catalog for domain+environment and schema separation for zones and layers:
{{domain/enterprise} _project_name}
├── README.md
├── analyses
├── seeds <ods>
│ └── ref_country_codes.csv <e.g: corporate__dev.ods.ref_country_codes>
│ └── config_holidays.csv <e.g: corporate__dev.ods.config_holidays>
├── dbt_project.yml
├── macros
│ └── custom_macro.sql
│ ├── utilities
│ └── all_dates.sql
├── models/raw
│ /{optional: domain and subdomains}
│ ├── _raw.md
│ ├── ods <ods>
│ | └── (typically empty - ODS populated by seeds/sources)
│ └── pds <pds>
│ └── ref_country_codes_snapshot.sql <e.g: corporate__dev.pds.ref_country_codes_snapshot>
│ └── config_holidays_snapshot.sql <e.g: corporate__dev.pds.config_holidays_snapshot>
├── models/edw/
│ ├── _edw.md
│ ├── source (source-centric staging - organized by source system) <edw_stg>
│ | /{optional: domain and subdomains}
│ | ├── source_system_1
│ | | ├── _source_system_1__docs.md
│ | | ├── _source_system_1__models.yml
│ | | └── stg
│ | | ├── stg__source_system_1__object__01_renamed_and_typed.sql <e.g: corporate__dev.edw_stg.stg__source_system_1__object__01_renamed_and_typed>
│ | | └── stg__source_system_1__object__02_cleaned.sql <e.g: corporate__dev.edw_stg.stg__source_system_1__object__02_cleaned>
│ | └── source_system_2__adf
│ | ├── _source_system_2__adf__docs.md
│ | ├── _source_system_2__adf__models.yml
│ | └── stg
│ | ├── stg__source_system_2__adf__accounts__01_renamed_and_typed.sql <e.g: corporate__dev.edw_stg.stg__source_system_2__adf__accounts__01_renamed_and_typed>
│ | └── stg__source_system_2__adf__accounts__02_cleaned.sql <e.g: corporate__dev.edw_stg.stg__source_system_2__adf__accounts__02_cleaned>
│ ├── mart (entity centric base marts and entity-centric staging) <edw> and <edw_stg>
│ | /{optional: domain and subdomains}
│ | └── employee
│ | ├── _employee__docs.md
│ | ├── _employee__models.yml
│ | ├── dim_employee.sql <e.g: corporate__dev.edw.dim_employee (built from stg__employee__03_keyed)>
│ | └── stg (entity-centric staging) <edw_stg>
│ | ├── stg__employee__01_unified.sql <e.g: corporate__dev.edw_stg.stg__employee__01_unified (unions source_system_1 & source_system_2)>
│ | ├── stg__employee__02_deduped.sql <e.g: corporate__dev.edw_stg.stg__employee__02_deduped>
│ | └── stg__employee__03_keyed.sql <e.g: corporate__dev.edw_stg.stg__employee__03_keyed>
│ | └── account
│ | └── dim_account.sql <e.g: corporate__dev.edw.dim_account>
│ | └── date
│ | └── dim_date.sql <e.g: corporate__dev.edw.dim_date>
│ ├── ref (reference data with optional entity-centric staging) <edw_ref> and <edw_stg>
│ | ├── _reference_data__docs.md
│ | ├── facility
│ | | ├── _facility__models.yml
│ | | ├── ref_facility.sql <e.g: corporate__dev.edw_ref.ref_facility>
│ | | └── stg (entity-centric staging when unifying multiple sources) <edw_stg>
│ | | ├── stg__facility__01_unified.sql <e.g: corporate__dev.edw_stg.stg__facility__01_unified>
│ | | └── stg__facility__02_mapped.sql <e.g: corporate__dev.edw_stg.stg__facility__02_mapped>
│ | └── country_code
│ | ├── _country_code__models.yml
│ | └── ref_country_code.sql <e.g: corporate__dev.edw_ref.ref_country_code (single source, no staging)>
│ └── sources
│ └── {optional: domain}
│ └── {optional: raw/edw/infomart}
│ └── _source_system_1__sources.yml
├── models/infomart <im>
│ /{optional: domain and subdomains}
│ ├── _infomart.md
│ └── domain_name e.g: finance
│ └── mart
│ ├── _finance__models.yml
│ ├── orders.sql <e.g: corporate__dev.im.orders>
│ └── payments.sql <e.g: corporate__dev.im.payments>
│ └── stg <im_stg>
│ └── stg_payments_pivoted_to_orders.sql <e.g: corporate__dev.im_stg.stg_payments_pivoted_to_orders>
├── packages.yml
├── snapshots
└── tests
└── assert_positive_value_for_total_amount.sql
dbt_project.yml
The yml structure below reflects a single catalog for domain+environment and schema separation for zones and layers:
models:
health_lakehouse__engineering__dbt:
+persist_docs: #enables injection of metadata into unity catalog
relation: true
columns: true
raw:
ods:
+description: "Operational data store - current state"
+schema: ods
+materialized: table
pds:
+description: "Persistent data store - historical snapshots (Type 2 SCD)"
+schema: pds
+materialized: table
edw:
ref:
+description: "Reference data (conformed lookups, master data)"
+schema: edw_ref
+materialized: table
stg:
+description: "Entity-centric reference staging (when unifying multiple sources)"
+schema: edw_stg
+materialized: view
source:
+description: "Source-centric staging organized by source system"
+schema: edw_stg
+materialized: view
mart:
+description: "Entity-centric base marts (dimensions, facts)"
+schema: edw
+materialized: table
stg:
+description: "Entity-centric staging (integrates across sources)"
+schema: edw_stg
+materialized: view
infomart:
+materialized: view # default for speed
mart:
+description: "Information marts (business-aligned products)"
+schema: im
+materialized: table
stg:
+description: "Information mart staging (intermediate transformations)"
+schema: im_stg
+materialized: view
seeds:
health_lakehouse__engineering__dbt:
+schema: ods
+description: "Static reference and configuration data loaded from CSV files to ODS"
CI/CD
The following standards and conventions relate to Continuous Improvement and Continuous Delivery constructs.
Repository naming
- All lowercase with hyphens as separators
- Format:
{org}-{domain}-{purpose}-{optional:descriptor}
Examples:
- intuitas-corporate-dbt
- intuitas-corporate-ingestion-framework
- intuitas-corporate-cicd-templates
- intuitas-corporate-infrastructure
Branch naming
- All lowercase with hyphens as separators
- Naming convention:
{type}-{optional:ticket-id}-{description}
Types: - feature: New functionality - bugfix: Bug fixes - hotfix: Critical fixes for production - release: Release branches - docs: Documentation updates only - refactor: Code improvements with no functional changes - test: Test-related changes
Examples: - feature-eng123-add-new-data-source - bugfix-eng456-fix-null-values - hotfix-prod-outage-fix - release-v2.1.0 - docs-update-readme - refactor-optimize-transforms - test-add-integration-tests
Branch lifecycle
Simple branch lifecycle:
- main/master: Primary branch
- branch: Short-lived branches development branch, merged or rebased to main/master
Comprehensive team branch lifecycle:
- Main/master: Primary branch
- Development: Active development branch
- Feature/bugfix: Short-lived branches merged to development
- Release: Created from development, merged to main/master
- Hotfix: Created from main/master for urgent fixes
Databricks Asset Bundles
Databricks asset bundles are encouraged for all Databricks projects.
- project/bundle name:
{domain_name}__databricks(for general databricks projects)-
{domain_name}__dbt(for dbt databricks bundles) -
Bundle tags:
- Key:
environment: {environment} - Key:
manager: {team_name}and or{email_address} - Key:
managing_domain:{domain_name}` e.g: if delegating to engineering domain - Key:
owner: {owner_name} - Key:
owning_domain: {domain_name} - Key:
dab: {bundle_name} - Key:
project: {project_name}
e.g:
yml
tags:
environment: ${bundle.target}
project: health-lakehouse
dab: health_lakehouse__engineering__databricks
owning_domain: intuitas_engineering
owner: engineering-admin@intuitas.com
manager: engineering-engineer@intuitas.com
managing_domain: intuitas_engineering
- Resources:
- Folder level 1:
{meaningful sub-project name} -
Folder level 2:
- notebooks
- workflows
-
Databricks.yml
- For both dev and prod:
root_path: /Workspace/Users/engineering-engineer@intuitas.com/.bundle/${bundle.name}/${bundle.target}
Example databricks.yml
# This is a Databricks asset bundle definition for health_lakehouse__engineering.
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
bundle:
name: health_lakehouse__engineering__databricks
variables:
default_cluster_id:
value: "-----"
include:
- resources/*.yml
- resources/**/*.yml
- resources/**/**/*.yml
targets:
dev:
mode: development
default: true
workspace:
host: https://------.15.azuredatabricks.net
root_path: /Workspace/Users/engineering-engineer@intuitas.com/.bundle/${bundle.name}/${bundle.target}
prod:
mode: production
workspace:
host: https://------.15.azuredatabricks.net
root_path: /Workspace/Users/engineering-engineer@intuitas.com/.bundle/${bundle.name}/${bundle.target}
permissions:
- user_name: engineering-engineer@intuitas.com
level: CAN_MANAGE
run_as:
user_name: engineering-engineer@intuitas.com
Security
Security standards and conventioned provided here provide a starter set, however existing organisational and applicable industry standards should take precedence. Consult with your cybersecurity advisor.
Entra
Under development. (Contact us to know more).
Most organisations will already have an established set of groups and conventions. Where there are gaps, the following can still be considered.
Recommended areas to align to organisational governance and cyber requirements:
- Naming conventions for admin, service, and user groups
- Role-based access alignment (least privilege, separation of duties)
- Alignment to domains - Cross-domain vs. domain-specific group patterns
Entra Group Names:
- Pattern:
grp-<org>-<domain>-<plat>-<scope>-<role>-<env>[-<region>][-ext-<partner>][-idx] - Lowercase, hyphen-separated; no spaces.
- Keep to ≤ 120 chars total.
- No PII in names.
- Use Security groups (not M365) for RBAC; enable PIM where appropriate e.g. Admins.
role:
- owner — full control of the named scope
- admin — administrative (non-ownership) rights
- contrib — create/modify within scope
- editor — modify data/artifacts, not permissions
- reader — read-only
- steward — governance/metadata rights
- custodian — key/secret/storage control
- operator — run/ops rights (pipelines, jobs)
- viewer — read dashboards/reports
plat:
- dbx (Databricks), uc (Unity Catalog), pbi (Power BI), adf (Data Factory),
- dlk (Data Lake), sql (Azure SQL), kva (Key Vault), syn (Synapse)
scope (or object):
- Databricks Workspace: ws-
- Unity Catalog: uc-meta (metastore), uc-cat-
, uc-sch- . , uc-obj- . . - Power BI: pbi-ws-
- Data Lake: dlk-path-/datalake//
Examples:
- GRP-INTUITAS-CLIN-DBX-WS-Analytics-ADMIN-PRD
- GRP-INTUITAS-CLIN-UC-UC-CAT-Claims-OWNER-PRD
- GRP-INTUITAS-CLIN-UC-UC-SCH-Claims.Curated-READER-UAT
- GRP-INTUITAS-FIN-PBI-PBI-WS-ExecDash-VIEWER-PRD
- GRP-INTUITAS-ENT-KVA-KVA-Keys-CUSTODIAN-PRD
- GRP-INTUITAS-CLIN-DLK-DLK-PATH-/curated/claims/READER-PRD-AUE
- GRP-INTUITAS-CLIN-DBX-WS-PartnerLake-READER-PRD-EXT-ACME
Policies
Under development. (Contact us to know more).
Recommended areas to align to non-functional requirements:
- Data retention (duration, archival, legal hold)
- Key retention and rotation cycles
- Backup and recovery standards
- Incident response and escalation procedures
- Access review and recertification
Frameworks
Under development. (Contact us to know more).
Recommended areas to align to industry and cyber compliance:
- Engineering standards (e.g., code repositories, CI/CD security, IaC policies)
- Security frameworks (e.g., NIST, ISO 27001, CIS Benchmarks, Zero Trust)
- Compliance mappings (HIPAA, GDPR, SOC2, local regulatory obligations)