Slow database? It might not be your fault
<rant>
Okay, it usually is your fault. If you logged the SQL your ORM was generating, or saw how you are doing joins in code, or realised what that indexed UUID does to your insert rate etc youâd probably admit it was all your fault. And the fault of your tooling, of course.
In my experience, most databases are tiny. Tiny tiny.  Tables with a few thousand rows. If your web app is slow, its going to all be your fault. Stop building something webscale with microservices and just get things done right there in your database instead. Etc.
But, quite often, each company has one or two databases that have at least one or two large tables. Tables with tens of millions of rows.  I work on databases with billions of rows.  They exist.  And thatâs the kind of database where your database server is underserving you. There could well be a metric ton of actual performance improvements that your database is leaving on the table. Areas where your database server hasnât kept up with recent (as in the past 20 years) of regular improvements in how programs can work with the kernel, for example.
Over the years Iâve read some really promising papers that have speeded up databases. But as far as I can tell, nothing ever happens.  What is going on?
For example, your database might be slow just because its making a lot of syscalls. Back in 2010, experiments with syscall batching improved MySQL performance by 40% (and lots of other regular software by similar or better amounts!).  That was long before spectre patches made the costs of syscalls even higher.
So where are our batched syscalls? I canât see a downside to them.  Why isnât linux offering them and glib using them, and everyone benefiting from them? Itâll probably speed up your IDE and browser too.
Of course, your database might be slow just because you are using default settings. The historic defaults for MySQL were horrid.  Pretty much the first thing any innodb user had to do was go increase the size of buffers and pools and various incantations they find by googling. I havenât investigated, but Iâd guess that a lot of the performance claims Iâve heard about innodb on MySQL 8 is probably just sensible modern defaults.
I would hold tokudb up as being much better at the defaults. That took over half your RAM, and deliberately left the other half to the operating system buffer cache.
That mention of the buffer cache brings me to another area your database could improve. Historically, databases did âdirectâ IO with the disks, bypassing the operating system.  These days, that is a metric ton of complexity for very questionable benefit. Take tokudb again: that used normal buffered read writes to the file system and deliberately left the OS half the available RAM so the file system had somewhere to cache those pages. It didnât try and reimplement and outsmart the kernel.
This paid off handsomely for tokudb because they combined it with absolutely great compression. It completely blows the two kinds of innodb compression right out of the water.  Well, in my tests, tokudb completely blows innodb right out of the water, but then teams who adopted it had to live with its incomplete implementation e.g. minimal support for foreign keys. Things that have nothing to do with the storage, and only to do with how much integration boilerplate they wrote or didnât write.  (tokudb is being end-of-lifed by percona; donât use it for a new project đ)Â
However, even tokudb didnât take the next step: they didnât go to async IO. Iâve poked around with async IO, both for networking and the file system, and found it to be a major improvement. Think how quickly you could walk some tables by asking for pages breath-first and digging deeper as soon as the OS gets something back, rather than going through it depth-first and blocking, waiting for the next page to come back before you can proceed.
Iâve gone on enough about tokudb, which I admit I use extensively. Tokutek went the patent route (no, it didnât pay off for them) and Google released leveldb and Facebook adapted leveldb to become the MySQL MyRocks engine. Thatâs all history now.
In the actual storage engines themselves there have been lots of advances. Fractal Trees came along, then there was a SSTable+LSM renaissance, and just this week I heard about a fascinating paper on B+ + LSM beating SSTable+LSM. A user called Jules commented, wondered about B-epsilon trees instead of B+, and that got my brain going too. There are lots of things you can imagine an LSM tree using instead of SSTable at each level.
But how invested is MyRocks in SSTable? And will MyRocks ever close the performance gap between it and tokudb on the kind of workloads they are both good at?
Of course, what about Postgres? TimescaleDB is a really interesting fork based on Postgres that has a âhypertableâ approach under the hood, with a table made from a collection of smaller, individually compressed tables. In so many ways it sounds like tokudb, but with some extra finesse like storing the min/max values for columns in a segment uncompressed so the engine can check some constraints and often skip uncompressing a segment.
Timescaledb is interesting because its kind of merging the classic OLAP column-store with the classic OLTP row-store. I want to know if TimescaleDBâs hypertable compression works for things that arenât time-series too?  Iâm thinking âif we claim our invoice line items are time-series dataâŚâ
Compression in Postgres is a sore subject, as is out-of-tree storage engines generally. Saying the file system should do compression means nobody has big data in Postgres because which stable file system supports decent compression? Postgres really needs to have built-in compression and really needs to go embrace the storage engines approach rather than keeping all the cool new stuff as second class citizens.
Of course, I fight the query planner all the time. If, for example, you have a table partitioned by day and your query is for a time span that spans two or more partitions, then you probably get much faster results if you split that into n queries, each for a corresponding partition, and glue the results together client-side! There was even a proxy called ShardQuery that did that.  Its crazy.  When people are making proxies in PHP to rewrite queries like that, it means the database itself is leaving a massive amount of performance on the table.
And of course, the client library you use to access the database can come in for a lot of blame too. For example, when I profile my queries where I have lots of parameters, I find that the mysql jdbc drivers are generating a metric ton of garbage in their safe-string-split approach to prepared-query interpolation. It shouldnât be that my insert rate doubles when I do my hand-rolled string concatenation approach.  Oracle, stop generating garbage!
This doesnât begin to touch on the fancy cloud service you are using to host your DB. Youâll probably find that your laptop outperforms your average cloud DB server. Between all the spectre patches (I really donât want you to forget about the syscall-batching possibilities!) and how you have to mess around buying disk space to get IOPs and all kinds of nonsense, its likely that you really would be better off perforamnce-wise by leaving your dev laptop in a cabinet somewhere.
Crikey, what a lot of complaining! But if you hear about some promising progress in speeding up databases, remember it's not realistic to hope the databases you use will ever see any kind of benefit from it. The sad truth is, your database is still stuck in the 90s.  Async IO?  Huh no.  Compression?  Yeah right.  Syscalls?  Okay, thatâs a Linux failing, but still!
Right now my hopes are on TimescaleDB. I want to see how it copes with billions of rows of something that arenât technically time-series. That hybrid row and column approach just sounds so enticing.
Oh, and hopefully MyRocks2 might find something even better than SSTable for each tier?
But in the meantime, hopefully someone working on the Linux kernel will rediscover the batched syscalls idea� ;)








