riley's blog

Thoughts, baseball, engineering

Is Bi-Directional Replication (BDR) in Postgres Transactional?

Recently in the #general Postgres-slack channel I asked if BDR applied at the TX level or the table level. For those of you unfamiliar with BDR, it’s short for bi-directional replication and it enables multi-master setups using postgres. It is particularly suited for geographically distributed applications that have to alter data but don’t want to pay the speed of light penalty in latency.

A contrived example is a large retail operation with brick and mortar stores in various places throughout the USA or world. Store A might be in Los Angeles with Store B in New York. Employees in Store A shouldn’t have to wait for the 70ms roundtrip time to the other coast in order to complete a checkout (assuming the master database was in NY). This wait time can also be compounded by larger transactions that span multiple tables. Imagine you had something like this:

1
2
3
4
5
6
BEGIN;

INSERT INTO order (partnum, quantity) VALUES (1, 1);
UPDATE parts SET quantity = quantity - 1 WHERE partnum = 1;

COMMIT;

And these were coded in something like Java using individual Statements. You would be paying that 70ms cost four different times. Now imagine you have a store in Australia or Japan with a master database in New York.

Given that these transactions are unlikely to collide (stores operate at different hours and shoppers are unlikely to buy the same part in different stores at the same time) this sets up nicely for having multiple master databases that all synchronize to each other and this is where BDR comes in.

BDR allows us to use multiple writable databases (up to 48) that are all synchronizing local commits to each other as those commits occur. This can lead to the potential of conflicts when different masters operate on the same row of data within the replication time period. There are many types of conflicts listed in the BDR documentation.

This leaves some ambiguity in the implementation. If according to the docs conflicts are resolved and conflict resolvers are defined at the table level, does a conflict apply per table or per transaction? I would not want to get in a situation where I have a conflict on the parts table above and yet the INSERT INTO order goes through just fine. I might have run out of stock on that part!

Due to this ambiguity, I set about testing it.

I followed the quick start guide at the BDR site to setup my BDR enabled databases, then I did this on node1:

1
2
3
4
5
6
bdrdemo=# create table thingy (id INT, value TEXT, PRIMARY KEY(id));
CREATE TABLE
bdrdemo=# create table tx_log(id INT, msg TEXT, PRIMARY KEY(id));
CREATE TABLE
bdrdemo=# insert into thingy (id, value) VALUES (1, 'insert from node1');
INSERT 0 1

Then I checked that replication is working on node2:

1
2
3
4
5
bdrdemo=# select * from thingy ;
 id |       value
----+-------------------
  1 | insert from node1
(1 row)

Now we can test whether a conflict is per table or per TX, again, on node1:

1
2
3
4
5
6
bdrdemo=# begin;
BEGIN
bdrdemo=# update thingy set value='update from node1' where id=1;
UPDATE 1
bdrdemo=# insert into tx_log (id, msg) values (1, 'tx log insert from node1');
INSERT 0 1

Notice, no COMMIT yet. Now we do a similar thing on node2:

1
2
3
4
5
6
bdrdemo=# begin;
BEGIN
bdrdemo=# update thingy set value='update from node2' where id=1;
UPDATE 1
bdrdemo=# insert into tx_log (id, msg) values (1, 'tx log insert from node2');
INSERT 0 1

Then I used a very useful feature in iTerm2 to type COMMIT;<cr> into both terminal windows at the same time. While not guaranteed to be executed exactly simultaneously, it was close enough in time to get into each database under the replication time to cause a conflict. node1 made the thingy table say “update from node1” while node2 claimed the update from its namesake. The update is not the part we care about though. We want to ensure that the tx_log table only contains one of these inserts. Indeed, we want the insert that matches the update from that same node. This way we know the entire TX was replicated or not.

1
2
3
4
5
6
7
8
9
10
11
bdrdemo=# select * from tx_log ;
 id |           msg
----+--------------------------
  1 | tx log insert from node2
(1 row)

bdrdemo=# select * from thingy ;
 id |       value
----+-------------------
  1 | update from node2
(1 row)

As you can see, only node2s update and insert applied. Checking database logs confirms this:

1
2
3
4
5
d= p=58952 a=LOG:  23000: CONFLICT: remote UPDATE on relation public.thingy originating at node 6235918664568746738:1:16384 at ts 2016-01-04 14:26:23.192016+00; row was previously updated at node 0:0. Resolution: last_update_wins_keep_local; PKEY: id[int4]:1 value[text]:update from node2
d= p=58952 a=LOCATION:  bdr_conflict_log_serverlog, bdr_conflict_logging.c:561
d= p=58952 a=LOG:  23000: CONFLICT: remote UPDATE on relation public.tx_log originating at node 6235918664568746738:1:16384 at ts 2016-01-04 14:26:23.192016+00; row was previously updated at node 0:0. Resolution: last_update_wins_keep_local; PKEY: id[int4]:1 msg[text]:tx log insert from node2
d= p=58952 a=LOCATION:  bdr_conflict_log_serverlog,
bdr_conflict_logging.c:561

However, as you can see in the log, the insert used the same primary key (1) on both nodes. This caused an insert_insert conflict as shown in the bdr.bdr_conflict_history table:

1
2
3
4
5
bdrdemo=# select conflict_id, local_conflict_xid, object_name, remote_txid, conflict_type, local_tuple, remote_tuple from bdr.bdr_conflict_history where object_name = 'tx_log';
 conflict_id | local_conflict_xid | object_name | remote_txid | conflict_type |                local_tuple                |               remote_tuple                |     conflict_resolution
-------------+--------------------+-------------+-------------+---------------+-------------------------------------------+-------------------------------------------+-----------------------------
           3 |                774 | tx_log      |         757 | insert_insert | {"id":1,"msg":"tx log insert from node2"} | {"id":1,"msg":"tx log insert from node1"} | last_update_wins_keep_local
(1 row)

But what if we had used distinct PK’s on the inserts? Let’s try again, more correctly on node1:

1
2
3
4
5
6
7
8
bdrdemo=# begin;
BEGIN
bdrdemo=# update thingy set value='update from node1' where id=1;
UPDATE 1
bdrdemo=# insert into tx_log (id, msg) values (3, 'node1');
INSERT 0 1
bdrdemo=# commit;
COMMIT

And node2:

1
2
3
4
5
6
7
8
bdrdemo=# begin;
BEGIN
bdrdemo=# update thingy set value='update from node2' where id=1;
UPDATE 1
bdrdemo=# insert into tx_log (id, msg) values (2, 'node2');
INSERT 0 1
bdrdemo=# commit;
COMMIT

Results:

1
2
3
4
5
6
7
bdrdemo=# select * from tx_log ;
 id |           msg
----+--------------------------
  1 | tx log insert from node2
  3 | node1
  2 | node2
(3 rows)

This is unexpected. There is still a conflict with the thingy table but the tx_log inserts go through unmolested.

It appears that logical decoding (the system that BDR relies on) operates at the table or tuple level. This makes keeping a consistent data view impossible IMO. Imagine I had a bank account with a debit_credit table and an account_balance table. If I did the following on node1:

1
2
3
4
BEGIN;
INSERT INTO debit_credit (id, amount) VALUES (1, -100);
UPDATE account_balance SET balance = balance - 100 WHERE id=1;
COMMIT;

And at the same time on node2:

1
2
3
4
BEGIN;
INSERT INTO debit_credit (id, amount) VALUES (1, -200);
UPDATE account_balance SET balance = balance - 200 WHERE id=1;
COMMIT;

If node2 wins the race I get an account_balance that is less by 200, yet I have debit_credit inserts worth 300 and the data is no longer consistent. To be fair, you can implement a custom conflict handler but you cannot invalidate an entire transaction to my knowledge. Since the math commutes here you could take both account_balance changes and merge them into a single change of -300, but what if that creates an overdraw?

You might be able to get around these shortcomings by assigning the related tables a GUID or version and creating a UNIQUE index on that field. When multiple inserts appear in the tx_log one will be rejected due to unique constraint and then you have a strategy that always chooses the remote value. In this scheme if node1 has no conflict, it’s update and insert will apply on itself. When node2 gets the conflict on the thingy table it will always apply the remote value (node1s value) and the INSERT INTO tx_log will also conflict because of the unique constraint on the new version column and we again apply the remote value. This forces node1 to win in both cases.

I think there might be a race condition even in this case so more investigation is necessary

Bottom line is: Do not use or expect to use BDR if you have strong transactional requirements for your data. It’s better to think of BDR like other eventually consistent data stores where conflicts are possible at the ROW level.