PostgreSQL Date Partitioning and a Stored Procedure
PostgreSQL | 2017-03-21 |
A decade ago, when I was a Django early adopter, I remember being very anti-ORM. I dug my heels in hard and used raw SQL for longer than I want to admit. It wasn't that the Django ORM (or other framework ORMs) were particularly hard to understand. They weren't even really that buggy. (Or were they? It was a long time ago, maybe I'm romanticizing a bit.)
The problem was that I'd spent the previous *mumblemumble* years learning to become an expert in relational databases. Not too many people know this, but at one time I was seriously thinking about pursuing MySQL certification and eventually becoming a DBA. I'm so glad I didn't go down that road, but at the time I didn't want all that study to be for nothing either, and I was convinced it would be if ORMs were to become the way of the future.
Fast-forward to 2017, and I'm working for a great company that gives me a lot of flexibility in terms of tools and even languages, and a lot of autonomy when it comes to designing my own part of a project. I'm currently working on an API that bridges a large set of performance data with a reporting UI. For the most part, I'm using Flask and SQLAlchemy on top of Postgres, and the writing of raw SQL has been minimal.
But of course, some requests are slow, as they are wont to be with a large dataset. I'll skip the rest of the boring details and just say that we decided to partition one of our tables by date. No one else on the team is a Postgres expert, so I volunteered, and needless to say it required knocking a lot of rust off. It took me about three days of Googling, Stack Overflowing, lazy-webbing, and testing a lot of variations to come up with this. I was really surprised that I didn't just find code out there that someone had already written - date partitioning seems like such a common need. Anyway, that's why I decided to post this here - I needed it, maybe someone else will.
In a nutshell, I wrote a trigger that would take any new insertions to the table and pass them off to a function; the function identifies and creates the appropriate partition and inserts the record to it. Using a stored procedure prevented us having to make any changes to the loader code that normally inserts records - any new inserts will automatically be handled by the trigger and function. Updating existing records was a snap too; I'll explain below. Let's take a look at the SQL and then I'll talk about what's in it:
-- PostgreSQL 9.6.1 -- write the function/trigger CREATE OR REPLACE FUNCTION metrics_partition_function() RETURNS TRIGGER AS $$ DECLARE year VARCHAR(4); DECLARE week VARCHAR(2); DECLARE partition_table VARCHAR(25); DECLARE name_index VARCHAR(25); DECLARE interval_days VARCHAR(1); DECLARE start_date VARCHAR(20); DECLARE end_date VARCHAR(20); BEGIN year := to_char(NEW.recorded_date::date, 'YYYY'); week := to_char(NEW.recorded_date::date, 'IW'); partition_table := 'metrics' || '_' || 'y' || year || 'w' || week; name_index := 'met_name_idx' || '_' || 'y' || year || 'w' || week; interval_days := to_char(NEW.recorded_date::date, 'ID'); start_date := to_char(NEW.recorded_date::date, 'YYYY-MM-DD'); IF interval_days::int > 1 THEN interval_days := to_char(NEW.recorded_date::date - INTERVAL '1 day', 'ID'); start_date := to_char(NEW.recorded_date::date - (interval_days || ' days')::INTERVAL, 'YYYY-MM-DD'); END IF; end_date := to_char(start_date::date + INTERVAL '6 days', 'YYYY-MM-DD'); EXECUTE 'CREATE TABLE IF NOT EXISTS ' || partition_table || ' (CHECK (recorded_date::date BETWEEN ''' || start_date || '''::date AND ''' || end_date || '''::date)) INHERITS (metrics)'; EXECUTE 'CREATE INDEX IF NOT EXISTS ' || name_index || ' ON ' || partition_table || ' (name)'; EXECUTE 'INSERT INTO ' || partition_table || '(name, res_id, count, recorded_date) VALUES ($1, $2, $3, $4)' USING NEW.name, NEW.res_id, NEW.count, NEW.recorded_date; RAISE NOTICE 'Inserted into %', partition_table; RETURN NULL; END; $$ LANGUAGE plpgsql; -- write the trigger to call the function CREATE TRIGGER metrics_partition_trigger BEFORE INSERT ON metrics FOR EACH ROW EXECUTE PROCEDURE metrics_partition_function();
- The first couple of lines of the function are perfunctory - what's notable is that every variable you use in the body of the function must be declared here. All of mine happen to be VARCHARs, but you can use any data type that's appropriate for your value.
-
We decided initially to partition by year/week, based on a date value in the record. I was able to use the to_char() formatting function and a few different date/time patterns to extract the year and week of a record's date (
NEW
represents the incoming record object). Those values are then concatenated to generate the name of the partition and the index, e.g.metrics_y2016w49
. New to me here was the use of double-pipes for concatenating strings. -
Creating date-based partitions meant setting a date range for each one, which meant I had to have a start and end date to pass in. The lines beginning with
interval_days
and ending withEND IF;
are where I calculated the start date. The patternID
returns an ISO 8601 day of the week number, starting with 1 for Monday, ending with 7 for Sunday. Thatinterval_days
var represents the number of days I would have to subtract to get the date at the beginning of the week. I had to do a little juggling, because if that date number comes back as a1
, you already have the start date for that week. And once thestart_date
is calculated correctly, theend_date
is easy - just add 6 days to get to the date at the end of the week. -
The next two
EXECUTE
lines are where the partition table and index are actually created. Note that this all assumes the parent table,metrics
, has already been created. I used more pipes for concatenating, cast thestart_date
andend_date
strings back to dates, and had to use triple single quotes to escape the single quotes that would ultimately need to be around the dates (don't ask me how long I stumbled before I finally figured *that* one out). -
The final
EXECUTE
line actually inserts the new record into the partition. This line passes in dynamic values from the record in a method similar to Python string formatting. - Finally, there's a trigger that calls the function whenever a new record is being inserted into the table.
I want to note that I ultimately did not use by-week partitions, but I like how I calculated the start_date
up there, so I decided to leave it in for this example. We're dealing with several years' worth of data, and the number of weekly partitions that generated actually made queries slower as during the planning phase Postgres had to iterate over that long list of partitions to determine which ones to ignore based on date range. Fewer tables, with more rows, ended up being a better solution for us. Oh, and I ditched the index creation in the function because it should have been inherited from the parent table.
-- by year/month with no indexes: CREATE OR REPLACE FUNCTION metrics_partition_function() RETURNS TRIGGER AS $$ DECLARE year VARCHAR(4); DECLARE month VARCHAR(2); DECLARE partition_table VARCHAR(25); DECLARE start_date VARCHAR(20); DECLARE end_date VARCHAR(20); BEGIN year := to_char(NEW.recorded_date::date, 'YYYY'); month := to_char(NEW.recorded_date::date, 'MM'); partition_table := 'metrics' || '_' || 'y' || year || 'm' || month; start_date := to_char(NEW.recorded_date::date, 'YYYY-MM-01'); end_date := to_char(start_date::date + INTERVAL '1 month', 'YYYY-MM-DD'); EXECUTE 'CREATE TABLE IF NOT EXISTS ' || partition_table || ' (CHECK (recorded_date::date >= ''' || start_date || '''::date AND recorded_date::date < ''' || end_date || '''::date)) INHERITS (metrics)'; EXECUTE 'INSERT INTO ' || partition_table || '(name, res_id, count, recorded_date) VALUES ($1, $2, $3, $4)' USING NEW.name, NEW.res_id, NEW.count, NEW.recorded_date; RAISE NOTICE 'Inserted into %', partition_table; RETURN NULL; END; $$ LANGUAGE plpgsql;
During testing, I used RAISE NOTICE
statements in the SQL, much as you might use print()
in Python. I'd make adjustments to the function, replace it, then insert a single record to test the output/insertion.
After you create your partitions, if you want to see what planning and execution times and costs look like when you query against them, you can use EXPLAIN
thusly:
SET constraint_exclusion = on; EXPLAIN (ANALYZE, TIMING on) SELECT * FROM my_table WHERE recorded_date::date BETWEEN '2016-12-01'::date AND '2016-12-25'::date;
Using this stored trigger/function combo meant not having to make any changes to the Python code that manages loading new data, but re-partitioning existing records in the database was also a fairly simple procedure.
This isn't a particularly risky process, but always create a backup just in case:
$ pg_dump --host={address} --port={maybe 5432?} --dbname={database name} --username={user} -f my_backup.sql
-- get a record count on the original table SELECT COUNT(*) FROM metrics; 1407383 -- write the function CREATE OR REPLACE FUNCTION metrics_partition_function() RETURNS TRIGGER AS $$ DECLARE year VARCHAR(4); DECLARE month VARCHAR(2); DECLARE partition_table VARCHAR(25); DECLARE start_date VARCHAR(20); DECLARE end_date VARCHAR(20); BEGIN year := to_char(NEW.recorded_date::date, 'YYYY'); month := to_char(NEW.recorded_date::date, 'MM'); partition_table := 'metrics' || '_' || 'y' || year || 'm' || month; start_date := to_char(NEW.recorded_date::date, 'YYYY-MM-01'); end_date := to_char(start_date::date + INTERVAL '1 month', 'YYYY-MM-DD'); EXECUTE 'CREATE TABLE IF NOT EXISTS ' || partition_table || ' (CHECK (recorded_date::date >= ''' || start_date || '''::date AND recorded_date::date < ''' || end_date || '''::date)) INHERITS (metrics)'; EXECUTE 'INSERT INTO ' || partition_table || '(name, res_id, count, recorded_date) VALUES ($1, $2, $3, $4)' USING NEW.name, NEW.res_id, NEW.count, NEW.recorded_date; RAISE NOTICE 'Inserted into %', partition_table; RETURN NULL; END; $$ LANGUAGE plpgsql; -- create the trigger CREATE TRIGGER metrics_partition_trigger BEFORE INSERT ON metrics FOR EACH ROW EXECUTE PROCEDURE metrics_partition_function(); -- copy the original `metrics` table to a backup CREATE TABLE metrics_backup AS TABLE metrics; -- get a record count on the backup metrics table SELECT COUNT(*) FROM metrics_backup; 1407383 -- truncate the original `metrics` table TRUNCATE TABLE metrics RESTART IDENTITY; -- re-insert to `metrics` from the backup table INSERT INTO metrics SELECT * FROM metrics_backup; -- get a record count on the original `metrics` table SELECT COUNT(*) FROM metrics; 1407383
If you're using psql
and run \d
, you should see a lovely list of partitions that have been created:
List of relations Schema | Name | Type | Owner --------+-----------------------------+----------+-------- public | metrics | table | myuser public | metrics_id_seq | sequence | myuser public | metrics_y2016m10 | table | myuser public | metrics_y2016m11 | table | myuser public | metrics_y2016m12 | table | myuser public | metrics_y2017m01 | table | myuser public | metrics_y2017m02 | table | myuser ...