dbt style guide - Data Voyager 📊

## Project structure Our projects are structured in a way that the following principles are easy to observe: - Limit references to raw data - Rename and recast fields once - Group models in directories - Add tests to your models - Consider the information architecture of your data warehouse - Separate source-centric and business-centric transformations More principles can be found in dbt’s [Best practices for workflows | dbt Developer Hub](https://docs.getdbt.com/best-practices/best-practice-workflows) guide #### Data Transformation 101 The data in our projects have three distinct checkpoints: 1. **Sources**: Schemas and tables in a source-conformed structure (i.e. tables and columns in a structure based on what an API returns), loaded by a third party tool. 2. **Staging models**: The atomic unit of data modeling. Each model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed, recast, or usefully reconsidered into a consistent format. 3. **Marts models**: Models that represent business processes and entities, abstracted from the data sources that they are based on. In a simple project, these may be the only models you build; more complex projects may have a number of intermediate models that help along this journey, as well as accessories to these models. ## Folder structure In our dbt project, we organise our `models` directory into three distinct folders, each representing one of the three layers. Below is the file tree of an example project. ``` models/ ├── intermediate/ │ ├── _int__models.yml │ ├── fct_payments.sql │ ├── fct_stock_movements.sql │ ├── dim_t1_warehouses.sql │ └── dim_t1_customers.sql ├── marts/ │ ├── finance/ │ │ ├── _finance__models.yml │ │ └── payments.sql │ └── supply-chain/ │ ├── _supply-chain__models.yml │ └── stock_movements.sql └── staging/ ├── manhattan/ │ ├── _manhattan__docs.md │ ├── _manhattan__models.yml │ ├── _manhattan__sources.yml │ ├── base/ │ │ ├── base_manhattan__stock_movements.sql │ │ └── base_manhattan__deleted_stock_movements.sql │ ├── stg_manhattan__stock_movements.sql │ └── stg_manhattan__warehouses.sql └── infopoint/ ├── _infopoint__docs.md ├── _infopoint__models.yml ├── _infopoint__sources.yml ├── stg_infopoint__payments.sql └── stg_infopoint__customers.sql ``` ## Staging layer #### Folders - ✅ Subdirectories based on the source system - ❌ Subdirectories based on loader - ❌ Subdirectories based on business grouping #### File names - ✅ `stg_{source}__{entity}s.sql` - the double underscore between source system and entity helps visually distinguish the separate parts in the case of a source name having multiple words - ❌ `stg_{entity}.sql` - adding the source system into the file name aids in discoverability, and allows understanding where a component model came from even if you aren't looking at the file tree - ✅ `base_{source}__{entity}s.sql` - base models are prefixed with `base_` - ✅ Sources are located inside their source directory and adhere to the following naming convention - `_{source}__sources.yml` #### Models - ✅ One-to-one relationship with source tables - Standard types of staging model transformations: - ✅ Renaming - ✅ Type casting - ✅ Basic computations (e.g. cents to dollars) - ✅ Categorizing (using conditional logic to group values into buckets or booleans) - ❌ Joins - create immediate duplicated computation and confusing relationships that ripple downstream — there are occasionally exceptions though (refer to [base models](https://docs.getdbt.com/best-practices/how-we-structure/2-staging#staging-other-considerations "https://docs.getdbt.com/best-practices/how-we-structure/2-staging#staging-other-considerations") for more info) - ❌ Aggregations - if we start changing the grain of our tables by grouping in this layer, we’ll lose access to source data that we’ll likely need at some point - ✅ Materialized as views ## Intermediate layer #### Folders - ✅ Subdirectories based on business groupings #### File names - ✅ `int_{entity}s_{verb}s.sql` The variety of transformations that can happen inside of the intermediate layer makes it harder to dictate strictly how to name them. The best guiding principle is to think about _verbs_ (e.g. `pivoted`, `aggregated_by_user`, `customer_and_location_joined`, `fanned_out_by_quantity`, `funnel_created`, etc.) in the intermediate layer. #### Models - **Fact** (`fct`) refers to any data that represents something that occurred or is occurring - **Dimension** (`dim`) refers to data that represents a person, place or thing - ✅ Built on staging models rather than directly on sources to leverage the data cleaning that was done in staging - ✅ Materialized as views ## Mart layer #### Folders - ✅ Group by department or area of concern #### File names - ✅ `{entity}s.sql` - marts are categorized between fact (immutable, verbs) and dimensions (mutable, nouns) with a prefix that indicates either, such as: `payments`or `stock_movements` #### Models - ✅ Materialized as view then tables or incremental models as needed (sorted by size small to big) A good general rule of thumb regarding materialization is to always start with a view (as it takes up essentially no storage and always gives you up-to-date results), once that view takes too long to practically _query_, build it into a table, and finally once that table takes too long to _build_ and is slowing down your runs, [configure it as an incremental model](https://docs.getdbt.com/docs/build/incremental-models/ "https://docs.getdbt.com/docs/build/incremental-models/"). As always, start simple and only add complexity as necessary. The models with the most data and compute-intensive transformations should absolutely take advantage of dbt’s excellent incremental materialization options, but rushing to make all your marts models incremental by default will introduce superfluous difficulty. We recommend reading this [classic post from Tristan on the limits of incremental modeling](https://discourse.getdbt.com/t/on-the-limits-of-incrementality/303 "https://discourse.getdbt.com/t/on-the-limits-of-incrementality/303"). [Data modeling techniques for more modularity](https://www.getdbt.com/analytics-engineering/modular-data-modeling-technique) If your end user will be writing SQL to pull the data they need, or if they’re using an analysis tool that joins tables for them, then your final data modeling output can be generalized, standardized fact and dimension tables. They can then freely mix and match these to analyze various aspects of the business, without you needing to pre-model the answer to every question in your transformation project. ## General naming and field conventions - `{entity}s` - all entities should be plural - Schema, table and column names should be in `snake_case` - Use names based on the _business_ terminology, rather than the source terminology. For example, if the source database uses `user_id` but the business calls them `customer_id`, use `customer_id` in the model - Each model should have a primary key - The primary key of a model should be named `{entity}_id` e.g. `account_id` - this makes it easier to know what `id` is being referenced in downstream joined models - For base/staging models, fields should be ordered in categories, where identifiers are first and technical timestamps at the end. - Timestamp columns should be named `{event}_at`, e.g. `created_at`, and should be in UTC. If a different timezone is being used, this should be indicated with a suffix, e.g `created_at_pt`. - Booleans should be prefixed with `is_` or `has_` - Price/revenue fields should be in decimal currency (e.g. `19.99` for $19.99; many app databases store prices as integers in cents). If non-decimal currency is used, indicate this with suffix, e.g. `price_in_cents`. - Avoid reserved words as column names - Consistency is key! Use the same field names across models where possible, e.g. a key to the `customers` table should be named `customer_id` rather than `user_id` - Do not use abbreviations or aliases. Emphasize readability over brevity. For example, do not use `cust` for `customer` or `o` for `orders` ## YAML #### Folders It’s important to note that while the top-level YAML files (`dbt_project.yml`, `packages.yml`) need to be specifically named and in specific locations, the files containing your `sources` and `models` dictionaries can be named, located, and organized however you want Model-level configuration settings override project-level settings for the specific models where they are defined - Config per folder - `_{directory}__models.yml` per directory (source or domain) in your models folder that configures all the models in that directory. For staging folders, also include a `_{directory}__sources.yml` per directory - The leading underscore ensures your YAML files will be sorted to the top of every folder - ❌ Config per project - ❌ Config per model only for specific cases (for ex in the mart layer if we have different materialization strategy than the default view) #### Style - Indents should be two spaces - List items should be indented - Use a new line to separate list items that are dictionaries where appropriate - Lines of YAML should be no longer than 80 characters ## Model configuration - Model-specific attributes (like sort/dist keys) should be specified in the model - If a particular configuration applies to all models in a directory, it should be specified in the `dbt_project.yml` file. - In-model configurations should be specified like this: `{{ config( materialized = 'view', alias = 'customers' ) }} select * from ...` - Marts should always be configured as tables ## dbt conventions - Only `stg_` and `base_` models should select from `source` data - All other models should only select from other models ## Testing - At a minimum, `unique` and `not-null` tests should be applied to the primary key of the model. ## CTEs - All `{{ ref('...') }}` statements should be placed in CTEs at the top of the file - Where performance permits, CTEs should perform a single, logical unit of work. - CTE names should be as verbose as needed to convey what they do - CTEs with confusing or noteable logic should be commented - CTEs that are duplicated across models should be pulled out into their own models - create a `final` or similar CTE that you select from as your last line of code. This makes it easier to debug code within a model (without having to comment out code!) - Limit the data scanned by CTEs as much as possible. Where possible, only select the columns you're actually using and use `where` clauses to filter out unneeded data - CTEs should be formatted like this: `with events as ( ... ), -- CTE comments go here filtered_events as ( ... ) select * from filtered_events` ## SQL style guide - Use trailing commas - Indents should be four spaces (except for predicates, which should line up with the `where` keyword) - Lines of SQL should be no longer than 80 characters - Field names and function names should all be lowercase - An ambiguous field name such as `id`, `name`, or `type` should always be prefixed by what it is identifying or naming (eg. `account_id`, `account_name`, `account_type`) - Dates should end with `_date` - The `as` keyword should be used when aliasing a field or table - Fields should be stated before aggregates / window functions - Aggregations should be executed as early as possible before joining to another table. - Ordering and grouping by a number (eg. group by 1, 2) is preferred over listing the column names (see [this rant](https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/ "https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/") for why). Note that if you are grouping by more than a few columns, it may be worth revisiting your model design. - Prefer `union all` to `union` [*](http://docs.aws.amazon.com/redshift/latest/dg/c_example_unionall_query.html "http://docs.aws.amazon.com/redshift/latest/dg/c_example_unionall_query.html") - Avoid table aliases in join conditions (especially initialisms) – it's harder to understand what the table called "c" is compared to "customers". - If joining two or more tables, _always_ prefix your column names with the table alias. If only selecting from one table, prefixes are not needed. - Be explicit about your join (i.e. write `inner join` instead of `join`). `left joins` are normally the most useful, `right joins` often indicate that you should change which table you select `from` and which one you `join` to. - When making single line comments in a model use the `--` syntax - When making multi-line comments in a model use the `/* */` syntax - _**DO NOT OPTIMIZE FOR A SMALLER NUMBER OF LINES OF CODE. NEWLINES ARE CHEAP, BRAIN TIME IS EXPENSIVE**_ ## Jinja Style Guide - When using Jinja delimiters, use spaces on the inside of your delimiter, like `{{ this }}` instead of `{{this}}` - Use newlines to visually indicate logical blocks of Jinja ### References - [dbt Style Guide](https://asadventure.atlassian.net/wiki/spaces/B2/pages/3634921534)