You are here

BigQuery: My "Best Practices" List

Corey Pennycuff's picture
BigQuery.png
BigQuery, part of the
Google Cloud Platform

If you're not familiar with it, BigQuery is an indispensable tool in a Data Warehouse. When you are responsible for databases which grow by 10's of gigs of data per day, the traditional relational database best practices are thrown out the window.

I recently posted about my personal list of best practices in a Hacker News thread, but I want to repeat and expand on that here, so that hopefully it's easier for someone to find and benefit from.

Overview: How is BigQuery different from traditional, relational databases?

Both can have relational data. Both use SQL. There is a lot of overlap, to be sure, so I will focus on the biggest difference, and that is that BigQuery does not use indexes.

That's right. No indexes at all. "But," you say, "indexes are how relational databases are fast!" Yes, but that is a tradeoff, too. Try indexing every column in your database and see how much it slows down on inserts and how much additional disk space it uses.

BigQuery uses partitions, which is a database column that is a number (integer, IIRC) or date. Data may be stored separately (physically or logically) by their different partition. You can think of it as "everything from April 2 is in this bin". If it has to look for something in that bin, then it must do a full scan through all of the data. If it needs to scan multiple partitions, then it can do so in parallel. The paritioning strategy makes writes faster and organization simpler. There is also clustering, which is a further optimization within a partition. In essence, clustering allows the grouping of similar values within a partition.

One other big difference is that BigQuery has nested fields. Some databases also have nested fields, but you can use them in BigQuery to denormalize data and avoid costly ($$$) joins.

Lastly, BigQuery uses columnar storage, traditional RDBMS usually use row-oriented storage. It can be helpful to know when trying to analyze why a query is slow.

Having said that, here are my rules:

Rule #1: BigQuery is not a standard database. If you use it like one, it will cost a fortune.

Pricing is often hard to judge on any cloud service, and you'll probably just have to build something and see the cost trends. BigQuery should not be used as a daily, quick-retrieval system. You would never run a website that executes 20 queries per page load. First it would be slow, and second, the data is probably too small to benefit from the infrastructure that BigQuery brings to the table. It's like using a big rig tractor trailer for your daily commute around town. It could do it, but it's the wrong tool for the job.

BigQuery is meant to crunch large amounts of data (for standard analysis or machine learning tasks). It's good at it. It's fast for that application. It scales well, but has a larger "startup" cost. It's simply the wrong tool for small datasets if you are looking to keep costs down.

Rule #2. BigQuery is amazing for being able to churn through and analyze massive amounts of data, and can very well be the best option in some use cases.

I include this because, quite frankly, some people will dismiss BigQuery because they have come from a traditional database background and they don't understand the scale of the problems that BigQuery solves. I have seen it analyze tons of data at blistering speeds, and, in 30 seconds, produce an analysis that took a week to complete using ElasticSearch. It's all in how the data is structured in order to fit the tool.

Rule #3. Letting "just anyone" run queries is as dangerous as casually handing a credit card to your drug-addicted cousin. Just wait until you get the bill!

GCP has built-in cost controls, and there are limits that you can enforce. If you don't, then you are at fault. There are many foot guns in the pricing of any cloud service, and BigQuery is no exception. You need to manage your costs, and that is a job that requires constant vigilance. It's going to have to be part of someone's responsibilities, and it takes time, and it must be done often (even daily, depending on the circumstances).

Rule #4: Partition and cluster your data wisely. You don't have indexes.

SELECT * FROM `table` LIMIT 1;

I saw this query executed on a very large table that did not have a partition. While the exact amount escapes me, this single query cost over $130.00 USD. Most of our queries only cost pennies, because the data was structured well and the queries were optimized to the storage.

I have also seen a developer execute a $25 query a dozen times in an attempt to reduce the cost. We did get it down to ~$5 a query (it did a lot of processing), and it must run monthly for multiple clients, so it was worth the time to optimize it. But, again, most of our queries cost nowhere near this.

Rule #5: Duplicate data. Throw all of the normal forms out the window. Storage is cheap, computation is expensive.

In a traditional database, suppose you have three tables:

TABLE `person`
ID | Name
---------
1  | Alice
2  | Bob
 
TABLE `pet`
ID | Name
---------
A  | Rover
B  | Chewy
C  | Otis
 
TABLE `owner`
person | pet
------------
1      | A
1      | B
2      | C
2      | A

This follows good normalization practice. But it is not the best way to store it in BiqQuery. A BigQuery table would look like this (although this is not a formal syntax):

TABLE `person`
ID | Name  | Pet<ID, Name>
--------------------------
1  | Alice | <A, Rover>, <B, Chewy>
2  | Bob   | <C, Otis>, <A, Rover>

Yes, it duplicates data, but it is much faster to query over when hundreds of millions of rows are needed. Yes, there are problems of duplication (such as what happens when you need to change data), but those issues must be mitigated in other ways. Understand that traditional databases came into popularity when storage was a scarce resource. BigQuery treats storage as plentiful, and processing is more expensive. It's a tradeoff that pays of big for some workloads.

Rule #6: BigQuery is not meant to be used like MySQL. It's "spin up" time is too slow, but you would be hard-pressed to beat its performance on truly large data sets.

SELECT 1;

This query is not instant. It must hand off the query to an instance to process the request. If the query is large, it will spin up many, many instances so that it can process the query in a reasonable time. But it is not instantaneous. It's a tradeoff to know about and account for.

Rule #7: BigQuery is still being actively developed. You need to stay on top of the changes.

Recently, BigQuery announced a change to your choice of how billing is done. If you didn't hear about it, then you might be paying more than you need to. You are in control of the change, however, and you must initiate it.

Tomorrow they may announce a new feature or option to save you money... or cost you more money. You have to manage it. You must have someone whose job is to monitor the tool itself to watch for unexpected cost/usage spikes, billing advancements, and feature improvements. Moreover, you often won't know the final costs involved until you experiment. It costs time and money, and is just part of the development process.

Conclusion

BigQuery is an important tool that fits a specific need in Data Warehouse environments and workloads. It is a semi-truck in a world of sedans. It can be used to great effect, and it can be misused to great disaster. We use it at my company because it is an invaluable tool for what we do. Knowing whether or not it's good for you, though, will require experimentation and work, and perhaps also ask someone who has experience with it.

As always, have fun!

Tags: