dbt: Data Build Tool for Analytics Engineering
Transform raw data into analytics-ready datasets using dbt's modular SQL-based workflow. Build, test, and document your data transformations with version control and CI/CD.
- Step 1
Install Python and pip
dbt requires Python 3.8 or higher. Verify your installation before proceeding.
python --version pip --version - Step 2
Install dbt-core and adapter
Install dbt-core along with the appropriate database adapter. For Databricks, use dbt-databricks. Other popular adapters include dbt-postgres, dbt-snowflake, dbt-bigquery, and dbt-redshift.
# For Databricks pip install dbt-core dbt-databricks # For PostgreSQL pip install dbt-core dbt-postgres # For Snowflake pip install dbt-core dbt-snowflake # For BigQuery pip install dbt-core dbt-bigquery - Step 3
Verify dbt installation
Check that dbt is installed correctly and view the version.
dbt --version - Step 4
Initialize a new dbt project
Create a new dbt project with a starter directory structure. This creates models/, tests/, macros/, and configuration files.
dbt init my_analytics_project cd my_analytics_project - Step 5
Configure database connection (Databricks)
Edit profiles.yml to configure your Databricks connection. This file is typically located at ~/.dbt/profiles.yml. You'll need your Databricks workspace URL, HTTP path, and personal access token.
my_analytics_project: target: dev outputs: dev: type: databricks host: <your-workspace>.cloud.databricks.com http_path: /sql/1.0/warehouses/<warehouse-id> token: <your-personal-access-token> schema: analytics_dev threads: 4 prod: type: databricks host: <your-workspace>.cloud.databricks.com http_path: /sql/1.0/warehouses/<warehouse-id> token: <your-personal-access-token> schema: analytics_prod threads: 8⚠ Heads up: Never commit your profiles.yml file with credentials. Use environment variables in CI/CD. - Step 6
Configure database connection (PostgreSQL)
Example configuration for PostgreSQL. Adjust the profiles.yml for your database type.
my_analytics_project: target: dev outputs: dev: type: postgres host: localhost port: 5432 user: <your-username> password: <your-password> dbname: analytics schema: analytics_dev threads: 4 - Step 7
Test your connection
Verify that dbt can connect to your database successfully.
dbt debug - Step 8
Understand the project structure
A dbt project contains models (SQL transformations), tests, macros (reusable SQL), and configuration files. The dbt_project.yml file configures project-level settings.
my_analytics_project/ ├── dbt_project.yml # Project configuration ├── models/ # SQL transformation files │ ├── staging/ # Raw data cleaning │ ├── intermediate/ # Business logic │ └── marts/ # Final analytics tables ├── tests/ # Custom data tests ├── macros/ # Reusable SQL snippets ├── seeds/ # CSV files to load ├── snapshots/ # SCD Type 2 tracking └── analyses/ # Ad-hoc queries - Step 9
Create your first model
Models are SELECT statements saved as .sql files. dbt will materialize them as views or tables in your database. Create a simple staging model.
-- models/staging/stg_customers.sql with source as ( select * from raw.customers ), cleaned as ( select id as customer_id, lower(email) as email, first_name, last_name, created_at from source where email is not null ) select * from cleaned - Step 10
Configure model materialization
Use the config block to specify how dbt should materialize your model. Options include view (default), table, incremental, or ephemeral.
-- models/marts/fct_orders.sql {{ config( materialized='table', tags=['daily'] ) }} select order_id, customer_id, order_date, total_amount, status from {{ ref('stg_orders') }} where status != 'cancelled' - Step 11
Run your models
Execute all models in your project. dbt will build them in dependency order based on ref() calls.
# Run all models dbt run # Run a specific model dbt run --select stg_customers # Run models matching a tag dbt run --select tag:daily # Run models in a directory dbt run --select models/staging - Step 12
Use the ref() function for dependencies
The ref() function references other models and creates a dependency graph. This ensures models run in the correct order.
-- models/marts/customer_orders.sql select c.customer_id, c.email, count(o.order_id) as total_orders, sum(o.total_amount) as lifetime_value from {{ ref('stg_customers') }} c left join {{ ref('stg_orders') }} o on c.customer_id = o.customer_id group by 1, 2 - Step 13
Add schema tests
Tests validate your data quality. Built-in tests include unique, not_null, accepted_values, and relationships. Define tests in a schema.yml file.
# models/staging/schema.yml version: 2 models: - name: stg_customers description: "Cleaned customer records from raw data" columns: - name: customer_id description: "Primary key" tests: - unique - not_null - name: email tests: - not_null - unique - name: created_at tests: - not_null - Step 14
Run tests
Execute all configured tests to validate data quality.
# Run all tests dbt test # Test a specific model dbt test --select stg_customers # Run models and tests together dbt build - Step 15
Create incremental models
Incremental models only process new or updated records, improving performance for large datasets. Use the is_incremental() macro to add a where clause on subsequent runs.
-- models/marts/fct_events.sql {{ config( materialized='incremental', unique_key='event_id' ) }} select event_id, user_id, event_type, event_timestamp from {{ ref('stg_events') }} {% if is_incremental() %} where event_timestamp > (select max(event_timestamp) from {{ this }}) {% endif %} - Step 16
Add documentation
Document your models, columns, and tests using descriptions in schema.yml files. Generate a documentation website with dbt docs.
# Generate documentation dbt docs generate # Serve documentation locally dbt docs serve - Step 17
Create reusable macros
Macros are Jinja functions that generate SQL. They promote DRY principles and can accept arguments.
-- macros/cents_to_dollars.sql {% macro cents_to_dollars(column_name) %} ({{ column_name }} / 100.0)::decimal(10,2) {% endmacro %} -- Usage in a model: select order_id, {{ cents_to_dollars('amount_cents') }} as amount_dollars from {{ ref('stg_orders') }} - Step 18
Load seed data
Seeds are CSV files in the seeds/ directory. Use them for small reference datasets like country codes or status mappings.
# Load all CSV files from seeds/ dbt seed # Reference in models using ref() select * from {{ ref('country_codes') }} - Step 19
Create snapshots for SCD Type 2
Snapshots track historical changes using slowly changing dimension (SCD) Type 2 logic. They add dbt_valid_from and dbt_valid_to timestamps.
-- snapshots/customers_snapshot.sql {% snapshot customers_snapshot %} {{ config( target_schema='snapshots', unique_key='customer_id', strategy='timestamp', updated_at='updated_at' ) }} select * from {{ source('raw', 'customers') }} {% endsnapshot %} - Step 20
Run snapshots
Execute snapshots to capture the current state and track changes over time.
dbt snapshot - Step 21
Use sources for raw data
Define sources in schema.yml to reference raw tables and add freshness checks. Use source() instead of hardcoding table names.
# models/staging/sources.yml version: 2 sources: - name: raw database: production schema: raw_data tables: - name: customers freshness: warn_after: {count: 12, period: hour} error_after: {count: 24, period: hour} loaded_at_field: _loaded_at - name: orders - Step 22
Check source freshness
Verify that your source data is up-to-date based on the freshness rules defined in sources.yml.
dbt source freshness - Step 23
Set up dbt Cloud (optional)
dbt Cloud provides a hosted IDE, scheduler, and job orchestration. Sign up at getdbt.com and connect your repository.
1. Create account at cloud.getdbt.com 2. Connect your git repository 3. Configure database credentials 4. Schedule jobs for production runs 5. Set up CI checks on pull requests - Step 24
Set up CI/CD (GitHub Actions example)
Automate dbt runs and tests in your CI/CD pipeline. This example runs tests on pull requests.
# .github/workflows/dbt-ci.yml name: dbt CI on: pull_request: branches: [main] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - uses: actions/setup-python@v4 with: python-version: '3.11' - name: Install dbt run: pip install dbt-databricks - name: Run dbt tests env: DBT_DATABRICKS_TOKEN: ${{ secrets.DBT_DATABRICKS_TOKEN }} DBT_DATABRICKS_HOST: ${{ secrets.DBT_DATABRICKS_HOST }} run: | dbt deps dbt build --target ci - Step 25
Install dbt packages
Extend dbt with community packages for common patterns. Packages are defined in packages.yml and installed with dbt deps.
# packages.yml packages: - package: dbt-labs/dbt_utils version: 1.1.1 - package: calogica/dbt_expectations version: 0.10.0 - Step 26
Install packages
Download and install dbt packages defined in packages.yml.
dbt deps - Step 27
Common dbt commands reference
Quick reference of frequently used dbt commands for daily workflows.
dbt run # Run all models dbt test # Run all tests dbt build # Run models + tests dbt compile # Compile SQL without running dbt clean # Delete target/ and dbt_packages/ dbt debug # Test database connection dbt deps # Install packages dbt seed # Load CSV seeds dbt snapshot # Run snapshots dbt source freshness # Check source freshness dbt docs generate # Generate docs dbt docs serve # Serve docs locally # Selection syntax dbt run --select model_name dbt run --select models/staging dbt run --select tag:daily dbt run --select +model_name # model + upstream dbt run --select model_name+ # model + downstream
Feature requests
Sign in to suggest features or vote on existing ones.
No feature requests yet.
Discussion
Sign in to join the discussion.
No comments yet.