Before jumping straightly into practical examples of row-level locks in Postgres, let's summarize which lock types Postgres supports:
- Table-level locks
- Row-level locks
- Page-level locks
- Advisory locks
- Deadlocks
Each of these types has most common use-cases where it can be used, but deadlocks. Deadlocks are always to avoid and any occurence of these points to serious design or implementation errors in the system. In this post, there is an intention to give explanation of row-level locks in practical examples.
What is a row-level lock and what is an example of using it
Row-level locks are set of locks which control concurent access to same rows in a single table. In Postgres, there are four types of these, each having a certain degree of mutual exclusivennes with other ones, meaning if you use one type of lock with another one, after the lock is acquired within the first transaction, the second transaction which is trying to acquire a lock will have to wait until the first one makes a commit or a rollback. These four types are:
- FOR UPDATE
- FOR NO KEY UPDATE
- FOR SHARE
- FOR KEY SHARE
There are multiple use-cases of using these family of locks, but I'll name one example just in order to give a sense behind using it:
Consider we have a cronjob which runs every 10 minutes. That cronjob has a task to read from a certain table and upload the data to some other data source (i.e Google BigQuery) for some Big Data analysis or Open Data. Therefore, any missing or duplicated data can be misleading to the consumers of uploaded data, causing serious business problems. The cronjob is designed in the way that it reads the latest data from the table based on checkpoint i.e when the cronjob runs, the table has 20 records at the moment of the run, so the cronjob will read these records and upload them and then set the checkpoint to 20, so in the future run it doesn't include already processed rows. The information about the checkpoint must be stored somewhere, so there is another table which keeps track of the latest processed row. Now, imagine there is a following situation:
- cronjob starts the execution, it reads the checkpoint and it turns out there are many new rows since the last run
- as there are many rows since the last run, the fetching of the data takes longer than 10 minutes i.e 15 minutes, so the checkpoint will be updated after 15 minutes, at the end of the run
- as cronjob runs every 10 minutes, another execution will start and now we have to cronjob running parallely
- the second cronjob will read the old checkpoint as it starts before the first one is finished, processing already processed data by the first cronjob
- as the result, this will result in duplicated data
Some cronjob systems do have a prevention of automatic run of more than one instance of the cronjob at the same time, but this will not prevent the somebody of manually triggering it (even multiple times) because of not knowing the design of it.
Certainly, we need safeguard around this and an elegant solution could be using row-level locks on the table which keeps track on checkpoints. When the first cronjob starts the execution, it will read a row in the checkpoint table and it will lock that row, releasing it when it ends the execution, so if the second instance of the cronjob start execution before the end of the first one, it will need to wait until the first one is finished, solving the possible duplication problem.
The question is now, which row-level lock from mentioned ones to use in postgres and the answer cannot be universal, it depends on many factors - are there any other parts of the system using the checkpoints or inserting the data into the table which is uploaded to the external data source, is the lock also needed for primary key etc.
Below are some examples of what happens in different scenarious with different row-level locks.
Examples
For the analysis of row-level locks, I'll be using Postgres extension called pgrowlocks. It will enable real-time analysis of row-level locks in the system.
create extension if not exists pgrowlocks;
The table and sample data used for examples:
create table table_1 (id serial primary key, name varchar not null);
insert into table_1 select i, 'name' || i from generate_series(1, 200) AS i;
At the end, clean up the table:
drop table table_1;
All examples can be found here.
First example
First transaction acquires a "for update" lock. The second one will be waiting for "for update" lock until it's
released. Even if only one row is locked from all requested ones, it must wait for the lock release.
Second example
Here we can see that the "pure" select statement in the second transaction can read the data without waiting
for the lock acquired in the first one.
Third example
Here we can see that if we try to acquire "for key share" lock after the first first transaction acquired "for update"
lock, as these two are mutually exclusive. From this we can conclude that different types of row-level locks can
be mutually exclusive to each other. The "for update" lock is the most aggressive one, it's mutually exclusive to
all row-level locks.
The table of mutual exlusivennes:
Fourth example
In this example we can see the same thing as in the third one, if we acquire firstly "for key share" in the first
transaction and we try to acquire the "for update" in the second one, the second one needs to wait for the lock
acquired by the first one, again, because of mutual exclusiveness between these two types of row-level locks.
Fifth example
In here the first transaction is acquiring "for key share" lock, the second one the "for share" lock. As they are
not mutual exclusive ones, both transactions can continue without blocking each other.
Sixth example
Whenever an insert operation happens and it inserts a foreign key, it needs to acquire "for key share" lock. This
means that if first transaction acquires a "for update" lock and a second executes an insert statement referencing
the primary key from the row which is locked inside of the first one, the insert statement will be blocked until
the first one is commited or rollbacked. This is crucial to understand, as one table can block another one (and so
completely unrelated task to the original one) just because of locks. Therefore, if primary key doesn't need to be
locked (if it will not be changed or a row deleted inside of a transaction), there is no reason to use "for update"
lock, but "for no key update". As shown in seventh example, if using this row-level lock, the insert statement
referencing the primary key will not be blocked.
Seventh example
How to check row-level locks in real-time
Tip
Use "select for update" if you want to:
- delete a row at some point later in transaction
- change primary or unique key on the row
The reason is that if we want to insert to another table which references the locked row, it will not be able to
acquire the lock. Use "for no key update" instead in other cases.
Keep in mind
- row-level locks are saved in the row metadata
- even if only taking a row-level lock and not doing any update or deletion of the row, the row metadata will be
changed, so it will be persisted anyways, causing slightly higher I/O. So, if using row-level locking extensively,
expect some increase in I/O numbers
Comments
Post a Comment