Optimising Redshift tables (for storage)
In this post, I'll share some tried and tested techniques for optimising your Redshift tables, with a focus on storage.
5. (Bonus) Monitor performance
Redshift's documentation says that the vacuum command "Resorts rows and reclaims space in either a specified table or all tables in the current database." That's exactly right, and you really need to run Vacuums regularly.
When you update or delete rows in a Redshift table, you don't actually physically remove the old/deleted record until a 'vacuum' (or a 'vacuum delete only') command is run. Redshift tries to perform this operation automatically, but in my experience, this doesn't always work. I have seen tables with billions of undeleted records - chances are, Redshift's automated 'vacuum delete' fails to complete before new records are inserted/deleted/updated due to the sheer volume of the data being processed.
Run this command on your larger tables to see if they have many deleted rows waiting to be vacuumed (changing the parameters with your own schema and table name)
SELECT tbl_rows - (SELECT count(*) FROM public.users) FROM svv_table_info WHERE "schema" = '<YOUR SCHEMA>' AND "table" = '<YOUR TABLE>';
A further reason you want to run vacuum regularly is so that your table gets re-sorted. Any insertion or modification of records causes your table to become imperfectly sorted. The less well sorted a table is, the slower it will be to query it. This is super important for query optimisation, but this post is about storage, so I'll leave this bit for another post.
Lastly, I'd like to mention that VACUUM does take compute resource from your cluster, and it can take a long time to run on some bigger tables. That's why I'd recommend you schedule vacuum commands to run during off-peak times.
Encode. Every. Single. Column. (Apart from your sort-key columns, don't encode those.)
This is the single greatest advice I can give for reducing table storage sizes. Picking the correct compressions for your columns can render up to 70% reduction in size (and with Redshift, storage size equals cost).
There are many articles written around how to pick your best column encodings but I recommend starting simple - https://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html
Redshift will handle tables with billions of records easily. That is until you need to change a dist/sort key or alter an encoding. At that point, you need to create a new table with the added modifications and move your data from the old table to the new one. And this isn't a pleasant operation at all, especially if you have tons of ELT running on your platform at the same time, and even more so if your table isn't optimised in the first place. Similar story with running full VACUUM commands on a really big table - it just takes forever and takes away a valuable compute resource.
The solution - partition your data into multiple tables. Take for example daily clickstream data from a website. We might be talking hundreds of millions of click records per day. The most optimal approach here would be to have one table per day's worth of data (e.g. clickstream_data_2019_01_01, then clickstream_data_2019_01_02 etc). This makes you a lot more flexible to:
1) schema changes over time,
2) ability to do changes on your tables without risk of all your commands timing out or taking forever and
3) vacuum more easily (you'd only need to vacuum once each daily table).
This partitioning does require you to join multiple tables together, however, you can easily create a VIEW that unions all the tables, reducing the complexity of your queries.
Note: if your daily data isn't that big, you can partition it monthly, or weekly, depending on what makes sense.
Do your reports and analytics really only query the last month of data in a table? Then just unload the rest of the data to S3! In general, there is no point keeping data in Redshift that you don't query very often - that's what data lakes are for. You can always bring it back into Redshift later or, preferably, use Redshift Spectrum to query it directly on S3.
5. (Bonus) Monitor performance
If you don't already, you should make sure to have these Redshift Admin views (https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminViews) set up on your Redshift cluster. They are very, very valuable for monitoring tables that need vacuum'ing, checking for transaction locks, and obtaining various other handy details about your cluster. In fact, I'd dedicate a single WLM slot to a process that periodically selects from some of these views and presents the results via a dashboard, raising an alert on various thresholds (e.g. a lock on my main reporting tables, a non-vacuumed table, etc.)
Sales spiel: Infinite Lambda helps you optimise your data warehouse while reducing your operational costs significantly. Not only can we help you save dozens of thousands of pounds on storage and compute costs, we can also offer you ongoing maintenance and data engineering support at very affordable rates. Reach out via the contact form below.