Most web applications use a data store of some kind, often a relational database. When a web app becomes successful, it can become all too easy to start "hoarding" data in the database. But hoarding data leads to the unbounded growth of database tables (both row count and stored data size).
While this works fine to a certain point, it is very useful to prevent some data bloat — or, if you cannot prevent it, to plan for your infrastructure ahead of time to adequately manage growth.
Before we dive in, let's see how we might end up with bloated applications.
More Data is Not Always Great
Most of the applications we work on tend to get bigger with time.
If you use a cloud provider for your database, you might hit your allocated storage limit. Once that happens, you will need to upgrade to a different instance type. Heroku PostgreSQL databases have a limit, for example, a "hobby" tier instance is limited to 1GB of data.
Having more data also has implications for query speeds. What used to be possible without an index becomes impossible with larger tables. Some row range scans will get slower. More locks will be acquired to do database UPDATE
and DELETE
operations.
How Database Tables Grow
Data hoarding happens gradually. What is not a problem today can easily become a problem in a month or six months. The most treacherous thing about data hoarding is that it is so easy to miss. Consider a few very classic scenarios:
- For compliance reasons, you adopt a gem like
paper_trail
and get anaudit_log_entries
table. Every operation with a mode of significance in your application creates a row in theaudit_log_entries
table. Those audit log entries never get archived. - You accept uploads and are using ActiveStorage. You never delete the uploads, so your
activestorage_blobs
table keeps getting bigger and bigger. - You run a shared CMS for publishing on the Web, and you allow article segments to be stored in your database. Your platform becomes successful, but most of your authors write book-sized articles. The
pages
table becomes very sizeable even though it only contains a few thousand pages. - You allow user-uploaded content, but you do not delete the data for compliance reasons, and instead use something like
paranoia
to delete via a flag. Youruser_items
table grows indefinitely, and without you knowing it, passes 10M rows.
These patterns have implications if not caught in time. If you have a good view of the size of your tables, you can anticipate when you will have to upgrade and schedule maintenance at off-peak times to perform upgrades with less user impact.
Making projections also becomes much easier. For example:
- Today our
events
table fits in memory. At the current growth rate, it won't fit in memory in seven months. - We are at 30% storage use for our RDS instance type. We will hit 90% in January next year.
- We have a full table scan query on
payments
to compute a function over every row where input depends on the query. We know that thepayments
table will exceed two million rows in three weeks and 20 million by January next year.
These all have the potential to present as an incident or an outage. But if you attack early enough, you can mitigate against this fairly easily. For example:
- Set up archiving for all data older than 30 days in the
events
table. - Upgrade your RDS instance in four months.
- To limit the full table scan, add an additional
WHERE
condition to our query to compute the function on a much smaller subset of rows instead.
Gain Visibility Over Your Database Growth
To keep an eye on your database growth, you have two possibilities:
- Install special tooling (like stats for MySQL) and connect it to your metrics collection engine via Prometheus, Telegraph, or other tools.
- Use AppSignal, especially if you're already using it for your application.
AppSignal can store several metric types, and one of the metric types it supports is called a gauge
. A gauge
is a single time series per environment (like production
, staging
, or development
) that you can update from time to time. AppSignal metrics also allow tags, so we can automatically create some tagged gauge
metrics for our database tables. Let's do that:
db.row_count
for the approximate number of rows in our tables, per table (we'll get to approximates later)db.data_size_bytes
for the number of bytes a table is usingdb.index_size_bytes
for the number of bytes a table's indices are using
Note that I am suffixing the metric name with the value type — this helps later on when we define how the metric will be displayed. The separate "data" and "indices" metrics are also important. If a table contains many rows and more than a few indices, the size of those indices might be two or three times the size of the stored data (since every index stores derived data for itself, creating some storage overhead).
Counting Database Table Rows Quickly
The "approximate" part of the number of rows is essential. Normally, if you want a precise number of rows in a table, you can do a query like SELECT COUNT(1) FROM my_table
. However, it might be slower than we would like.
When you COUNT
, the database ensures that the number of rows in the table does not change during the query — and so will lock the table or create a divergent transaction while the query runs. The larger the table, the slower the query — more rows will be scanned, and more locks will accumulate.
Therefore, when all we want is a "close enough" approximation of the number of rows (for performance estimation, the precision of 10-30 thousand rows less or more is good enough), we can use the internal database engine statistics to query for this type of data.
Most databases have optimized access to the table data since rows are written out in "pages". The database engine keeps track of which rows are assigned to which pages, roughly in insertion order:
- Rows 1-100 are on page 1
- Rows 101-200 are on page 2
- Rows 201-300 are on page 3
and so on.
As the engine knows how many pages it has per table and the rough row count per page it uses, it can count the pages it has allocated and then multiply this by the page size (rows per page) to give you that estimate. This has a few advantages: the count is very fast, and the table does not need to be locked while the query runs.
Record Table Sizes for MySQL
What we then need to do differs per database since we need to query the database engine for the table statistics. Let's start with MySQL. Here's the query we need to run:
SHOW TABLE STATUS
Then, grab a few columns from its output. The ones we are interested in are Data_length
, Index_length
, and Rows
. The size of a table is defined as the sum of Data_length + Index_length
, and the approximate row count based on pages is in Rows
.
Let's package it into a code block that collects this data for your entire database. Since we are not working with any ActiveModel
classes, we are going to use the query methods directly provided by ActiveRecord:
table_information_rows = ActiveRecord::Base.connection.select_all("SHOW TABLE STATUS").to_ary table_information_rows.each do |table_info| table_properties = { data_size_bytes: table_info.fetch('Data_length'), index_size_bytes: table_info.fetch('Index_length'), approximate_row_count: table_info.fetch('Rows'), } table_properties.each_pair do |metric_name, metric_value| Appsignal.set_gauge("db.#{metric_name}", metric_value.to_i, table: table_info.fetch('Name')) end end
Record Table Sizes for PostgreSQL
We need a more elaborate query for PostgreSQL since we do not have a shortcut query like SHOW TABLE STATUS
. We have to query internal PostgreSQL tables:
query = <<~EOS SELECT table_name AS Name, full_table_name AS Full_name, pg_table_size(full_table_name) AS Data_length, pg_indexes_size(full_table_name) AS Index_length, row_count_estimate AS Rows FROM ( SELECT t.table_name AS table_name, (t.table_schema || '.' || t.table_name ) AS full_table_name, c.reltuples::bigint AS row_count_estimate FROM information_schema.tables AS t INNER JOIN pg_class AS c ON c.relname = t.table_name INNER JOIN pg_namespace AS n ON n.oid = c.relnamespace AND n.nspname = t.table_schema WHERE t.table_schema = 'public' ) AS all_tables EOS table_information_rows = ActiveRecord::Base.connection.select_all(query).to_ary table_information_rows.each do |table_info| table_properties = { data_size_bytes: table_info.fetch('data_length'), index_size_bytes: table_info.fetch('index_length'), approximate_row_count: table_info.fetch('rows'), } table_properties.each_pair do |metric_name, metric_value| Appsignal.set_gauge("db.#{metric_name}", metric_value.to_i, table: table_info.fetch('name')) end end
Note that this only accounts for the public
schema (the default one you are likely using). If you want to include other schemas, you will need to remove the WHERE t.table_schema = 'public'
condition and replace table_info.fetch('name')
with table_info.fetch('full_table_name')
.
Make Sure Metrics Update Regularly
This block has to run at regular intervals, so it is a good idea to put it into the Sidekiq scheduler or a Rake task that runs from cron. For example, if you use good_job, you can add it to your "cron" section like so:
config.good_job = { # ... your other configuration cron: { record_database_metrics: { cron: "@hourly", class: "RecordDatabaseMetricsJob" } } }
Create a Dashboard
Once you have data coming in, build a dashboard. You can copy the following dashboard for your needs:
{ "title": "Database Tables", "description": "", "visuals": [ { "title": "Row Counts", "line_label": "%name% %table%", "display": "LINE", "format": "number", "draw_null_as_zero": false, "metrics": [ { "name": "db.approximate_row_count", "fields": [{ "field": "GAUGE" }], "tags": [{ "key": "table", "value": "*" }], }, ], "type": "timeseries", }, { "title": "Table Sizes", "line_label": "%table%", "display": "LINE", "format": "size", "format_input": "byte", "draw_null_as_zero": false, "metrics": [ { "name": "db.data_size_bytes", "fields": [{ "field": "GAUGE" }], "tags": [{ "key": "table", "value": "*" }], }, ], "type": "timeseries", }, ], }
Click Add dashboard
and then Import dashboard
in the modal dialog that appears. The dashboard will give you graphs like this one:
And this table sizes graph:
Note how we use wildcard tags to automatically build graphs for every table in the database.
Interpreting the Data
When looking at your data, look out for exponential or linear growth in either row count or size — in other words, a table getting bigger and bigger.
If you see this, you have a couple of options. One is to architect for this unbounded growth — know when to upgrade, and whether you can upgrade to the next best size. Another is to set up a regular deletion task — my former colleague Wander Hillen has written a great article on the topic.
For example, in the screenshot above, some tables shrink regularly — this is when the regular cleanup tasks run. You can see that the data accumulates until a certain point, but while the ingress of rows is fairly constant, sizes and row counts drop.
As long as there are these drops and the amount of data in your tables grows at a steady rate, your database won't surprise you with a sudden limit bust.
Sum Up: Avoid Data Bloat from Database Tables
In this post, we looked at how you can gain visibility over the growth of your database and keep your data bloat down.
Data bloat is a real risk for applications that become a success in the market. By setting up metrics for your database tables, you can better anticipate when to scale your database vertically and whether you need to install regular stale data cleanups. AppSignal gauges with tags, combined with a little SQL, can get you this data in a convenient and pleasant format.
Don't let your database surprise you!