DBT and BigQuery: How to gain visibility and control over your data cost (Episode 2)

In Episode 1, we covered foundational BigQuery costs and how to use dbt metadata to improve visibility and cloud cost management. In this second episode, we’ll explore BigQuery’s built-in cost optimization techniques, which are essential for FinOps professionals looking to efficiently manage cloud spending on BigQuery workloads. We’ll dive into Partitioning and Clustering, Incremental Processing with dbt, and BigQuery Statistics for Continuous Improvement—all powerful tools for controlling costs in the cloud.


1. BigQuery Partitioning and Clustering for Cost Optimization

Partitioning and clustering are two of the most effective BigQuery cost optimization techniques for managing large data workloads. By structuring data to limit scans to relevant segments, these techniques allow for faster queries and lower costs.

Partitioning BigQuery Tables

Partitioning tables in BigQuery stores data independently divided by a column, such as by date, so queries process only the necessary subsets. This approach can significantly reduce BigQuery query costs.

To optimize partitioning:

  • Identify Key WHERE Clauses: Look for columns frequently used in WHERE clauses, especially in large tables, as candidates for partitioning.
  • Match Partitions with Data Update Patterns: Align partitions with natural data intervals, such as daily or monthly updates, for further efficiency.

At Costory, we partition billing tables by billing_date (usually the first of each month) to simplify monthly updates. For tables with frequent user queries, we partition by usage_date to streamline usage-based data filtering.

Example of setting up partitioning in dbt:

Copied!
sqlCopy code{{ config( materialized = 'table', partition_by = {"field": "usage_date", "data_type": "DATE"} ) }}

Clustering BigQuery Tables for Cost Savings

Clustering groups data within partitions, based on specified columns, to reduce the number of scans for queries. It will store rows that shares the same clustering columns values next to each other. It’s especially useful for smaller tables where partitioning might cause more harm than good.

Example dbt setup for clustering:

Copied!
sqlCopy code{{ config( materialized = 'table', partition_by = {"field": "event_date", "data_type": "DATE"}, clustered_by = ['customer_id', 'event_type'] ) }}

Note: While clustering optimizes costs, the estimated data scan size in BigQuery console UI may not reflect this immediately; accurate usage will be visible in actual bytes processed after the query is ran.


2. Incremental Processing with dbt for Cost Control in BigQuery

dbt incremental processing allows only new or modified records to be updated, which minimizes BigQuery compute costs. This cost-saving technique is highly valuable for resource-intensive models and frequently updated tables.

Incremental processing is ideal when:

  • Processing Costs Justify It: Only apply incremental updates to models that are large or complex enough to benefit from reduced compute costs.
  • Source Data Allows Incremental Queries: The source data must allow querying recent changes, like newly added records.

Here’s a sample dbt configuration for incremental processing:

Copied!
sqlCopy code{{ config( materialized = 'incremental', unique_key = 'record_id' ) }} SELECT * FROM source_table WHERE updated_at >= '{{ this.last_loaded_at }}'

Tip: Refer to dbt’s incremental strategy documentation for additional strategies to optimize incremental models.


3. Using BigQuery Statistics for Continuous Cloud Cost Management

BigQuery’s built-in statistics and cost metrics allow you to monitor and optimize cloud cost management over time. Regularly reviewing metrics like data size, query duration, and resource usage reveals areas for improvement and tracks cost optimization progress.

Key BigQuery Metrics for Cost Management

  • Query Duration & Cost per Query: Identify long-running, expensive queries that could be optimized with BigQuery’s cost-saving features, like partitioning and dbt’s incremental processing.
  • Resource Utilization: Use total_bytes_processed and total_slot_ms metrics to track compute resource usage and identify high-cost workloads for further optimization.

The following query helps identify high-cost tables or models by checking job data over the past day:

Copied!
sqlCopy codeSELECT job_id, query, referenced_tables, total_slot_ms, total_bytes_billed / POW(1024, 3) AS gb_billed FROM `your_project.your_region.INFORMATION_SCHEMA.JOBS_BY_PROJECT` WHERE creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) AND CURRENT_TIMESTAMP() ORDER BY total_bytes_billed DESC

With these insights, FinOps teams can fine-tune configurations and adjust query patterns for efficient cost management in the cloud.


Key Takeaways

BigQuery’s features like partitioning, clustering, and incremental processing with dbt are essential for effective FinOps and cloud cost management. Leveraging these tools along with BigQuery’s cost metrics and regular usage reviews creates a solid framework for keeping data processing costs under control.

By implementing these strategies, you make your BigQuery and dbt workloads more efficient, scalable, and budget-friendly, aligning cloud operations with business goals.

In our next episode, we’ll dive into integrating cost management tools with BigQuery and dbt for real-time visibility into costs, so stay tuned!