BigQuery - How to efficiently join 2 large tables every 5 minutes without incurring a large cost

17:50 30 Aug 2024

This question is a follow up question to Why is there such a difference in performance between these two queries?.

I am having difficulty finding an efficient way to frequently query and join two large BigQuery tables without running up a large cost. I am running in On Demand mode. My goal is to load data into a target_table in near-realtime. Every 5 minutes if possible, or even more frequently.

Here is the use case:

I have two tables: table_a and table_b.
Both tables are partitioned evenly over 10,000 partitions on the record_id field (reference key used to join both tables).
record_id is unique in table_a, but not table_b.
Both tables contain more than 20 million records.
Both tables are ingesting new data constantly (a few thousand per minute).
Every 5 minutes (or more frequently if possible), I need to query by joining the tables by record_id, and filter to locate only records that were added in the last 5 minutes. The resulting records need to be inserted into a target_table.
Both tables contain approximately 30 fields, and all of these fields need to be inserted into the target_table (target_table will contain 60 fields).
I have a mechanism that captures the record_id for records as they are ingested in both tables (table_a_new_record_ids and table_b_new_record_ids). These tables get purged periodically. At any given point in time, they contains only the ids that were added since the last time the query ran. So these tables can be used to identify changed records which need to be loaded into the target_table.

Now there is a caveat:

It is possible that a record gets inserted into table_a that joins to a table_b record which was inserted a long time ago.
It is also possible that a record gets inserted into table_b that joins to a table_a record which was inserted a long time ago.
If a record exists in table_b, a matching record must exist in table_a, but the reverse is not true: table_a might not have any matching record in table_b.

Consequently, it is not possible to simply filter each table based on the associated new_record_ids table and join the resulting data. In other words, we might get a new record added to table_b but nothing new added to table_a (but table_a already contains a matching record). In that case we need to retrieve the new record from table_b and the matching record from table_a even though nothing new was added to table_a. For this reason, date partitioning won't work (because we need to fetch records in one of the tables regardless of the date ingested).

Instead, what is needed is something like this:

select *
from table_a as a
   left join table_b as b
   on a.record_id = b.record_id
where a.record_id in (select record_id from table_a_new_record_ids)
   OR b.record_id in (select record_id from table_b_new_record_ids)

Writing a query which retrieves the correct results is not difficult (as shown above). The problem I'm having is that when I run this query I'm getting 800MB billed. Doing the math, if I run this query every 5 minutes its going to incur a cost of $43 per month. If I need to do this for 100 tables, it's going to cost $4300 per month.

So my question is how can this be achieved without running up such a large bill?

It's possible to query both of those tables manually in a series of steps which results in less than 1MB in usage, so I'm thinking it should be possible to achieve the same thing in a single query (or several queries built up as temporary tables).

Here is an example of what I mean:

create temporary table tmp_a as
select * from table_a where record_id in (1,2,3,4,5);

create temporary table tmp_b as
select * from table_a where record_id in (2,3,4);

select *
from tmp_a as a
   left join tmp_b as b
   on a.record_id = b.record_id;

That query will run very efficiently and process less than 1MB of data. This is because each of the queries used to populate the tmp tables is able to efficiently prune the partitions according to this question. But that only works if I use a constant in the IN clause and only if I have a limited number of values. If I try to rewrite those queries like this this I end up with 800MB again:

create temporary table tmp_a as
select * from table_a where record_id in (select record_id from table_a_new_record_ids);

create temporary table tmp_b as
select * from table_a where record_id in (select record_id from table_b_new_record_ids);

select *
from tmp_a as a
   left join tmp_b as b
   on a.record_id = b.record_id;

So I'm at a loss as to how I can keep updating the target_table with newly added data every few minutes without incurring a very large cost. It seems like there must be a better way to do this.

performance google-bigquery query-optimization partition

Your Answer

Privacy & Cookie Consent