Python Developer and Educator
2017-03-21
A decade ago, when I was a Django early adopter, I remember being very anti-ORM. I dug my heels in hard and used raw SQL for longer than I want to admit. It wasn't that the Django ORM (or other framework ORMs) were particularly hard to understand. They weren't even really that buggy. (Or were they? It was a long time ago, maybe I'm romanticizing a bit.)
The problem was that I'd spent the previous *mumblemumble* years learning to become an expert in relational databases. Not too many people know this, but at one time I was seriously thinking about pursuing MySQL certification and eventually becoming a DBA. I'm so glad I didn't go down that road, but at the time I didn't want all that study to be for nothing either, and I was convinced it would be if ORMs were to become the way of the future.
Fast-forward to 2017, and I'm working for a great company that gives me a lot of flexibility in terms of tools and even languages, and a lot of autonomy when it comes to designing my own part of a project. I'm currently working on an API that bridges a large set of performance data with a reporting UI. For the most part, I'm using Flask and SQLAlchemy on top of Postgres, and the writing of raw SQL has been minimal.
But of course, some requests are slow, as they are wont to be with a large dataset. I'll skip the rest of the boring details and just say that we decided to partition one of our tables by date. No one else on the team is a Postgres expert, so I volunteered, and needless to say it required knocking a lot of rust off. It took me about three days of Googling, Stack Overflowing, lazy-webbing, and testing a lot of variations to come up with this. I was really surprised that I didn't just find code out there that someone had already written - date partitioning seems like such a common need. Anyway, that's why I decided to post this here - I needed it, maybe someone else will.
In a nutshell, I wrote a trigger that would take any new insertions to the table and pass them off to a function; the function identifies and creates the appropriate partition and inserts the record to it. Using a stored procedure prevented us having to make any changes to the loader code that normally inserts records - any new inserts will automatically be handled by the trigger and function. Updating existing records was a snap too; I'll explain below. Let's take a look at the SQL and then I'll talk about what's in it:
-- PostgreSQL 9.6.1 -- write the function/trigger CREATE OR REPLACE FUNCTION metrics_partition_function() RETURNS TRIGGER AS $$ DECLARE year VARCHAR(4); DECLARE week VARCHAR(2); DECLARE partition_table VARCHAR(25); DECLARE name_index VARCHAR(25); DECLARE interval_days VARCHAR(1); DECLARE start_date VARCHAR(20); DECLARE end_date VARCHAR(20); BEGIN year := to_char(NEW.recorded_date::date, 'YYYY'); week := to_char(NEW.recorded_date::date, 'IW'); partition_table := 'metrics' || '_' || 'y' || year || 'w' || week; name_index := 'met_name_idx' || '_' || 'y' || year || 'w' || week; interval_days := to_char(NEW.recorded_date::date, 'ID'); start_date := to_char(NEW.recorded_date::date, 'YYYY-MM-DD'); IF interval_days::int > 1 THEN interval_days := to_char(NEW.recorded_date::date - INTERVAL '1 day', 'ID'); start_date := to_char(NEW.recorded_date::date - (interval_days || ' days')::INTERVAL, 'YYYY-MM-DD'); END IF; end_date := to_char(start_date::date + INTERVAL '6 days', 'YYYY-MM-DD'); EXECUTE 'CREATE TABLE IF NOT EXISTS ' || partition_table || ' (CHECK (recorded_date::date BETWEEN ''' || start_date || '''::date AND ''' || end_date || '''::date)) INHERITS (metrics)'; EXECUTE 'CREATE INDEX IF NOT EXISTS ' || name_index || ' ON ' || partition_table || ' (name)'; EXECUTE 'INSERT INTO ' || partition_table || '(name, res_id, count, recorded_date) VALUES ($1, $2, $3, $4)' USING NEW.name, NEW.res_id, NEW.count, NEW.recorded_date; RAISE NOTICE 'Inserted into %', partition_table; RETURN NULL; END; $$ LANGUAGE plpgsql; -- write the trigger to call the function CREATE TRIGGER metrics_partition_trigger BEFORE INSERT ON metrics FOR EACH ROW EXECUTE PROCEDURE metrics_partition_function();
NEW
represents the incoming record object). Those values are then concatenated to generate the name of the partition and the index, e.g. metrics_y2016w49
. New to me here was the use of double-pipes for concatenating strings.
interval_days
and ending with END IF;
are where I calculated the start date. The pattern ID
returns an ISO 8601 day of the week number, starting with 1 for Monday, ending with 7 for Sunday. That interval_days
var represents the number of days I would have to subtract to get the date at the beginning of the week. I had to do a little juggling, because if that date number comes back as a 1
, you already have the start date for that week. And once the start_date
is calculated correctly, the end_date
is easy - just add 6 days to get to the date at the end of the week.
EXECUTE
lines are where the partition table and index are actually created. Note that this all assumes the parent table, metrics
, has already been created. I used more pipes for concatenating, cast the start_date
and end_date
strings back to dates, and had to use triple single quotes to escape the single quotes that would ultimately need to be around the dates (don't ask me how long I stumbled before I finally figured *that* one out).
EXECUTE
line actually inserts the new record into the partition. This line passes in dynamic values from the record in a method similar to Python string formatting.
I want to note that I ultimately did not use by-week partitions, but I like how I calculated the start_date
up there, so I decided to leave it in for this example. We're dealing with several years' worth of data, and the number of weekly partitions that generated actually made queries slower as during the planning phase Postgres had to iterate over that long list of partitions to determine which ones to ignore based on date range. Fewer tables, with more rows, ended up being a better solution for us. Oh, and I ditched the index creation in the function because it should have been inherited from the parent table.
-- by year/month with no indexes: CREATE OR REPLACE FUNCTION metrics_partition_function() RETURNS TRIGGER AS $$ DECLARE year VARCHAR(4); DECLARE month VARCHAR(2); DECLARE partition_table VARCHAR(25); DECLARE start_date VARCHAR(20); DECLARE end_date VARCHAR(20); BEGIN year := to_char(NEW.recorded_date::date, 'YYYY'); month := to_char(NEW.recorded_date::date, 'MM'); partition_table := 'metrics' || '_' || 'y' || year || 'm' || month; start_date := to_char(NEW.recorded_date::date, 'YYYY-MM-01'); end_date := to_char(start_date::date + INTERVAL '1 month', 'YYYY-MM-DD'); EXECUTE 'CREATE TABLE IF NOT EXISTS ' || partition_table || ' (CHECK (recorded_date::date >= ''' || start_date || '''::date AND recorded_date::date < ''' || end_date || '''::date)) INHERITS (metrics)'; EXECUTE 'INSERT INTO ' || partition_table || '(name, res_id, count, recorded_date) VALUES ($1, $2, $3, $4)' USING NEW.name, NEW.res_id, NEW.count, NEW.recorded_date; RAISE NOTICE 'Inserted into %', partition_table; RETURN NULL; END; $$ LANGUAGE plpgsql;
During testing, I used RAISE NOTICE
statements in the SQL, much as you might use print()
in Python. I'd make adjustments to the function, replace it, then insert a single record to test the output/insertion.
After you create your partitions, if you want to see what planning and execution times and costs look like when you query against them, you can use EXPLAIN
thusly:
SET constraint_exclusion = on; EXPLAIN (ANALYZE, TIMING on) SELECT * FROM my_table WHERE recorded_date::date BETWEEN '2016-12-01'::date AND '2016-12-25'::date;
Using this stored trigger/function combo meant not having to make any changes to the Python code that manages loading new data, but re-partitioning existing records in the database was also a fairly simple procedure.
This isn't a particularly risky process, but always create a backup just in case:
$ pg_dump --host={address} --port={maybe 5432?} --dbname={database name} --username={user} -f my_backup.sql
-- get a record count on the original table SELECT COUNT(*) FROM metrics; 1407383 -- write the function CREATE OR REPLACE FUNCTION metrics_partition_function() RETURNS TRIGGER AS $$ DECLARE year VARCHAR(4); DECLARE month VARCHAR(2); DECLARE partition_table VARCHAR(25); DECLARE start_date VARCHAR(20); DECLARE end_date VARCHAR(20); BEGIN year := to_char(NEW.recorded_date::date, 'YYYY'); month := to_char(NEW.recorded_date::date, 'MM'); partition_table := 'metrics' || '_' || 'y' || year || 'm' || month; start_date := to_char(NEW.recorded_date::date, 'YYYY-MM-01'); end_date := to_char(start_date::date + INTERVAL '1 month', 'YYYY-MM-DD'); EXECUTE 'CREATE TABLE IF NOT EXISTS ' || partition_table || ' (CHECK (recorded_date::date >= ''' || start_date || '''::date AND recorded_date::date < ''' || end_date || '''::date)) INHERITS (metrics)'; EXECUTE 'INSERT INTO ' || partition_table || '(name, res_id, count, recorded_date) VALUES ($1, $2, $3, $4)' USING NEW.name, NEW.res_id, NEW.count, NEW.recorded_date; RAISE NOTICE 'Inserted into %', partition_table; RETURN NULL; END; $$ LANGUAGE plpgsql; -- create the trigger CREATE TRIGGER metrics_partition_trigger BEFORE INSERT ON metrics FOR EACH ROW EXECUTE PROCEDURE metrics_partition_function(); -- copy the original `metrics` table to a backup CREATE TABLE metrics_backup AS TABLE metrics; -- get a record count on the backup metrics table SELECT COUNT(*) FROM metrics_backup; 1407383 -- truncate the original `metrics` table TRUNCATE TABLE metrics RESTART IDENTITY; -- re-insert to `metrics` from the backup table INSERT INTO metrics SELECT * FROM metrics_backup; -- get a record count on the original `metrics` table SELECT COUNT(*) FROM metrics; 1407383
If you're using psql
and run \d
, you should see a lovely list of partitions that have been created:
List of relations Schema | Name | Type | Owner --------+-----------------------------+----------+-------- public | metrics | table | myuser public | metrics_id_seq | sequence | myuser public | metrics_y2016m10 | table | myuser public | metrics_y2016m11 | table | myuser public | metrics_y2016m12 | table | myuser public | metrics_y2017m01 | table | myuser public | metrics_y2017m02 | table | myuser ...
Contact: barbara@mechanicalgirl.com