Jun 25, 2024

Data Growth, Statistics, Indexes, and Keeping Our Dialer Humming

{frontmatter.author}
by Stefan Wójcik

A Brief Introduction to Our Dialer System

We handle a fairly large amount of calls at Close, helping our customers connect with their customers ~9 million times a month. The core service that we’ve built to manage the lifecycle of each call is called the Dialer. We’ve designed it pretty well, so for quite a few years – and a couple bumps in the order of magnitude of call volume – it carried on with only minuscule tweaks.

And then, one fateful day, it started becoming much slower, its database’s CPU almost pegged.

This is the story as old as web apps themselves, isn’t it? Your data keeps growing, you missed some indexes, and eventually the queries that used to be fast enough are downgraded to unacceptably slow.

Except that this didn’t seem to be the case here! Our RDS Performance Insights dashboard was indeed showing a lot of time spent cumulatively querying our main table, dialer_call, but each individual query used an index and we didn’t have any N+1 query issues either.

While the solution we eventually arrived at might seem obvious, it taught us a lot about how PostgreSQL works internally, so we thought we’d share our journey, not just the conclusion.

What’s Better Than an Index? An Even Better Index!

Our dialer_call table contains the latest information about each call leg that we’ve initiated with the help of our friends at Twilio. It covers things like when the call started ringing, when the callee picked up, when the call ended (aka terminated), which Close user was connected to that call, etc.

For every active Dialer session that’s going on at a given time (where a group of Close users dial together through a list of contacts), our system checks whether we have any calls which are currently ringing and don’t yet have a particular Close user assigned to them:

SELECT
    -- Tell me how many calls there are...
    count(dialer_call.id) as count_1
FROM
    dialer_call
WHERE
    -- ...belonging to a particular Dialer session...
    dialer_call.dialer_id = 'SOME_DIALER_UUID'::uuid
    -- ...where the call is ongoing / has not finished...
    AND dialer_call.date_terminated IS NULL
    -- ...and it hasn't been assigned to any user yet.
    AND dialer_call.date_assigned IS NULL
    AND dialer_call.dialer_user_id IS NULL;

We already had an index on date_terminated and confirmed it was being used… So why was this query showing up in our Performance Insights as taking the most time cumulatively? Well, with the rise of many simultaneous active Dialer sessions, we were running this query A LOT. While still decently fast – taking ~20-80ms – it wasn’t as fast as it could be. We’ve decided to make it even more selective with the following index:

CREATE INDEX CONCURRENTLY ix_dialer_call_unassigned_ringing_calls_count 
    ON dialer_call (dialer_id)
    WHERE (
        date_assigned IS NULL
        AND date_terminated IS NULL
        AND dialer_user_id IS NULL
    )

Takeaway from this chapter: Even if an operation is already pretty fast, make sure it’s as fast as it can be if it’s on the critical path.

Great, we created a new index, were about to proclaim victory, and then… The index had absolutely no effect on the performance of the query. What gives?!

Lies, Damned Lies, and (PostgreSQL) Statistics

Out of the box, PostgreSQL doesn’t offer a way to “force” a given index to be chosen for a given query (what many databases call “hints”). Although 3rd party modules like pg_hint_plan offer this capability – and are worth considering as a way to buy yourself some extra time during a crisis – they’re typically not needed. PostgreSQL’s query planner is really, really good at picking the right index for the right query.

To make the decision though, PG first has to gather various statistics about various characteristics of the data that exists in your table. Then, it uses these statistics to drive its decisions. You can look up what statistics currently exist for your table like so:

SELECT * FROM pg_stats WHERE tablename = 'your_table';

And you can control when statistics are gathered either by configuring PG to analyze your tables as part of automatic vacuuming (see config params like autovacuum_analyze_threshold and autovacuum_analyze_scale_factor) or manually by running:

ANALYZE VERBOSE your_table;

But, as we know all too well from various political & cultural spats around the world, statistics can lie! Or, to put it less dramatically, they may overlook some important details.

Takeaway from this chapter: It’s important to understand PostgreSQL statistics, what data is gathered, how it’s used, and how you can calibrate that process.

The Big Difference Between “Zero” and “Approximately Zero”

We have eventually discovered two problems with our dialer_call statistics.

First, the null_frac – i.e. the fraction of a given column’s entries that are null – for the date_terminated column was 0.0. This doesn’t actually mean exactly zero, but rather approximately zero. It’s as if Postgres said: “We’ve looked at a few rows in your table and found no nulls. This doesn’t mean there aren’t any, but we haven’t run across them during our sampled analysis.” Most of the calls in our database have already happened, so it makes sense that the calls which are still active – i.e. their date_terminated is null – is minuscule in comparison.

The problem with date_terminated’s null_frac being 0.0 is that PG’s query planner will pretty much never consider any other index than the already-existing-but-somewhat-imperfect (date_terminated). After all, if a query contains the date_terminated is null clause and there’s barely any values in the table with nulls in them, then the index should already be perfectly selective.

There’s a way to alleviate this temporarily by increasing the statistics target. That is, you tell PostgreSQL to look at more rows when gathering statistics. This is what we did, bumping the default of 100 to 10,000:

ALTER TABLE dialer_call ALTER date_terminated SET STATISTICS 10000;

This command makes it so that, instead of scanning a sample of 30,000 rows (the default value of 100 multiplied by a magical constant of 300 present here and explained here), our PostgreSQL will look at 3,000,000 Dialer calls to gather the updated statistics.

The more rows PG looks at, the likelier it is to find some calls that are currently ringing and thus have date_terminated set to null. Of course, it makes the ANALYZE call run longer and pull more data off the disk, but it’s a worthy tradeoff, at least temporarily.

As soon as we bumped our statistics target and re-ran the analysis of the table, the right index got chosen and our database took a well-deserved breather…

Until the next morning!

Takeaway from this chapter: When PG statistics overlook an important aspect of your data, consider bumping the statistics target as temporary relief.

Not Only “What”, But “When”

The next morning we woke up to a critical alert that our DB’s CPU is on fire yet again… But why?! We’ve bumped the statistics’ target, we ran ANALYZE, we gathered enough rows to find some ringing calls, so what gives?

The problem is, due to issues with some other tables & use cases, and to overall save the database from excessive load at peak times, we set up our recurring vacuuming & analysis of tables to run at 5am UTC, before our European & African customers start working and while our American customers are fast asleep.

The problem is – you guessed it! – almost nobody makes calls at that time, so even with the increased statistics target, PG found approximately zero rows with null date_terminated, and reverted back to using the suboptimal index.

We re-ran ANALYZE at a better time and things calmed down again, but it was clear we needed to prioritize a more permanent solution.

Takeaways from this chapter:

  • Sometimes saving your database from a bit of extra work during peak load is the wrong thing to do.
  • If you have tables in which the characteristics of the data vary by the time of day, think wisely about when / how you’re analyzing that table.

Don’t Let Cold Data Mix With Hot Data

The core issue from the get-go was our database schema design.

We were mixing together two very different datasets in the same table:

  • The “hot” dataset, i.e. the rows that represent currently active calls. They receive lots of updates & lookups as they progress through various states in the calling state machine. All / most of them have date_terminated set to null and they’re critical to the correct running of the Dialer.
  • The “cold” dataset, representing calls that have happened in the past. These were once active calls that we used to handle, but now they've reached their final state and lie dormant. All of them have date_terminated set to some timestamp in the past. These rows are really only used for debugging purposes. For presentation purposes in our app, we have another data structure to represent a call activity, stored in our MongoDB.

This Pareto type data distribution means trouble! Over time, your ever-growing "cold" dataset skews all of PostgreSQL’s statistics, affecting the “hot” dataset negatively.

This is the biggest takeaway from this whole adventure: Do not put the “hot” dataset and the “cold” dataset – which have different characteristics and use cases – in the same table. You can either partition the table or – if you don’t need the data past some expiration date – you can set up automatic purging of the old entries (leaving enough buffer for forensics).

As soon as we purged older data, PG started gathering the right statistics, using the right indexes, and overall humming under a much more manageable load.

Extra Tips

Some extra bits of advice that didn’t fit neatly into the narrative above:

  • Don’t underestimate a “small fraction” of dead rows. At one point during the investigation, we considered running an additional manual vacuum to remove some dead tuples from PG. At first we dismissed this because these dead rows represented a tiny fraction of the overall table size – less than 1-3%. BUT this was the important, “hot” 1%! That is, these dead rows corresponded to the calls that were actively receiving updates and were actively being queried. Even though their number was not huge, they still needed to be loaded and then discarded when queried. When we finally ran a vacuum, it helped a lot.
  • Don’t keep your transactions open while making external calls. Another thing we noticed is that in some places we kept the PG transactions open while making API calls to Twilio. Given that any request to an external service can take a while and even time out, keeping the transaction open during this time meant adding extra load to our database.
  • Speaking of transactions, just because you don’t delete data doesn’t mean you won’t get a lot of dead tuples. In PostgreSQL, an update is essentially a delete of an old tuple and an insert of new one. It has to work this way to support various transaction isolation levels through its multi-version concurrency control (MVCC) mechanism. Even though we weren’t deleting any dialer_call rows, we were generating quite a few dead tuples through all the updates to the calling state machine.