Managing Exceptions during Replication
The processes documented on this page apply to users of Spock extension version 4.0 and higher. You can use the pgEdge Platform UM module to upgrade your installed version of the Spock extension.
Spock extension version 4.0 (and later) includes functionality that allows you to write some exceptions to the exception log table. This functionality can prevent interruptions in replication, and allow you to gracefully handle replication errors at a time that best suits your schedule and the server traffic. The exception table also provides visibility into the history of replication errors, and provides enough information to later remediate the errors with tools like pgEdge ACE.
pgEdge Platform clusters running Spock extension 4.0 and later also include a mechanism that allows you to make changes on one node, while not replicating those changes to other nodes (a repair mode). You can use repair mode when fixing data anomalies on a single node, to ensure the modifications don’t propagate to other nodes in the cluster.
Protective Pause Mode If you do not configure your cluster to log errors to the exception log table and the cluster encounters an error in replication, the Spock extension will enter a protective pause mode. During a protective pause, replication stops and the extension continues to retry the failing operation until the issue is resolved. No other transactions are replicated to standby nodes while the conflict remains.
One example of a replication error that can induce a protective pause occurs if you attempt to UPDATE a row that exists only on one node with a unique primary key. The Spock extension will attempt to update the row on other nodes until it is successful, not replicating any operations that come after the update until the conflict is resolved. When the UPDATE is successful, and the protective pause is lifted, the Spock extension will be self-healing (whether the conflict is resolved automatically or manually with repair mode), and replication will continue.
Enabling Exception Logging
You can customize exception logging related to a failed INSERT
, UPDATE
, or DELETE
statement for your pgEdge Platform cluster by setting the GUC values that follow. These GUCs allow you to specify behaviors on a granular level, applying the actions you choose fors operations within a transaction, rather than to an entire transaction. Note that changes to these GUCs require a system restart to apply.
To simplify setting the values, you can use the pgEdge CLI db guc-set
command:
./pgedge db guc-set guc_name guc_value
For example, to set spock.exception_behavior
to discard
, navigate into the pgedge
directory and invoke the command:
./pgedge db guc-set spock.exception_behaviour discard
The default values for these GUCs are transdiscard
and all
; by default, if an error occurs all operations within a transaction are rolled back if one returns an ERROR, and are recorded in spock.exception_log
, regardless of whether the individual operation succeeded or failed.
spock.exception_behaviour
Use this GUC to specify the commit behavior of pgEdge platform when it encounters an ERROR within a transaction:
transdiscard
(the default) - Setspock.exception_behaviour
totransdiscard
to instruct the server to discard a transaction if any operation within that transaction returns anERROR
.discard
- Setspock.exception_behaviour
todiscard
to log failed operations and continue processing (and replicating) without interrupting server use. The server will commit any operation (any portion of a transaction) that does not return anERROR
statement.
spock.exception_logging
Use this GUC to specify which operations/transactions are written to the exception log table:
all
(the default) - Setspock.exception_logging
toall
to instruct the server to record all transactions that contain one or more failed operations in thespock.exception_log
table. Note that this setting will log all operations that are part of a transaction if one operation fails.discard
- Add a row to thespock.exception_log
table for any discarded operation; successful transactions are not logged.none
- Instructs the server to not log any operation or transactions to the exception log table.
The Table Structure
You can use information stored in the exception log table to identify operations or transactions that need to be repaired or rolled back. The exception log table contains the following information:
Column | Type |
---|---|
remote_origin | oid NOT NULL |
remote_commit_ts | timestamptz NOT NULL |
command_counter | integer NOT NULL |
remote_xid bigint | NOT NULL |
local_origin | oid |
local_commit_ts | timestamptz |
table_schema | text |
table_name | text |
operation | text |
local_tup | jsonb |
remote_old_tup | jsonb |
remote_new_tup | jsonb |
ddl_statement | text |
ddl_user | text |
error_message | text NOT NULL |
retry_errored_at | timestamptz NOT NULL |
PRIMARY KEY | Compound: remote_origin, remote_commit_ts, command_counter |
Using spock.repair_mode() when Repairing Transactions
Spock extension versions 4.0 (and later) include a function that you can control at the PSQL command line that places your pgEdge Platform cluster into repair mode while you modify data. The function is useful when fixing errors recorded in the exception log table, but can also be used in other situations where you need to interrupt replication.
Before you start repairing a transaction, set spock.repair_mode
to true
to interrupt replication to other nodes:
SELECT spock.repair_mode(true);
When you COMMIT
or ROLLBACK
a transaction, spock.repair_mode
is returned to false
automatically.
In the following example, the row that contains (3
,Smith
,sales
) already exists on n2
, but does not exist on n1
. To remediate the issue, you can perform the following steps on n1
to add this row without it replicating to n2
:
demo=# BEGIN;
BEGIN
demo=*# SELECT spock.repair_mode('True');
-[ RECORD 1 ]----------
repair_mode | 0/1C5AFD0
demo=*# INSERT INTO employeeid VALUES (`3`,`Smith`,`sales`);
INSERT 0 1
demo=*# COMMIT;
Logged Exception Types
The following examples demonstrate the types of replication errors that might be logged to the exception table. This is not a comprehensive list of logged errors.
Missing Primary Key Exception
In the following example:
- Node
n1
has two records in our sample table (foo
) with primary key values of1
and2
. - Node
n2
has three records in our sample table (foo
) with primary key values of1
,2
, and3
.
Both tables are added to replication, and then the row with a primary key value of 3
is updated on n2
. This row is successfully updated on n2
, but when replicated to n1
a row with a primary key value of 3
is not found.
If this error occurs while the exception behavior GUCs are set to their default values, the following entry is captured in the spock.exception_log
table:
demo=# SELECT * FROM spock.exception_log;
-[ RECORD 1 ]----+---------------------------------------------------------------------------------------------------
remote_origin | 26863
remote_commit_ts | 2024-07-02 19:10:04.317346+00
command_counter | 1
remote_xid | 756
local_origin |
local_commit_ts |
table_schema | public
table_name | foo
operation | UPDATE
local_tup |
remote_old_tup |
remote_new_tup | [{"value": 3, "attname": "a", "atttype": "int4"}, {"value": 2, "attname": "b", "atttype": "int4"}]
ddl_statement |
ddl_user |
error_message | logical replication did not find row to be updated in replication target relation (public.foo)
retry_errored_at | 2024-07-02 19:10:05.058789+00
Duplicate Primary Key Exception
In the following example:
- Node
n1
has one record in our sample table (foo
) with a primary key value of1
. - Node
n2
has two records in our sample table (foo
) with primary key values of1
and2
.
After adding both tables to the replication set, we update the foo
table on n1
, adding a row with a primary key value of 2
. The row is successfully inserted on n1
, but when replicated to n2
, the addition results in a duplicate primary key error.
If this error occurs while the exception behavior GUCs are set to their default values, the following entry is captured in the spock.exception_log
table:
remote_origin | 49708
remote_commit_ts | 2024-07-02 19:07:09.602248+00
command_counter | 1
remote_xid | 747
local_origin |
local_commit_ts |
table_schema | public
table_name | foo
operation | INSERT
local_tup |
remote_old_tup |
remote_new_tup | [{"value": 2, "attname": "a", "atttype": "int4"}, {"value": 3, "attname"
: "b", "atttype": "int4"}]
ddl_statement |
ddl_user |
error_message | duplicate key value violates unique constraint "foo_pkey"
retry_errored_at | 2024-07-02 19:07:10.298847+00
DDL Exception
In the following example:
- Node
n1
has one row in our sample table (foo
) with columnsa
,b
, andc
. - Node
n2
has one row in our sample table (foo
) with columnsa
andb
.
If we then enable Auto DDL (to replicate DDL statements through the default
replication set) and manually perform an update statement to drop column c
from n1
, the DDL statement is replicated to n2
. Since the table (foo
) on n2
does not have this column, the update statement returns an ERROR
: “column "c" of relation "foo" does not exist”
.
If this error occurs while the exception behavior GUCs are set to their default values, the following entry is captured in the spock.exception_log
table:
demo=# select * from spock.exception_log;
-[ RECORD 1 ]----+-------------------------------------------------------------------------
remote_origin | 49708
remote_commit_ts | 2024-07-02 17:01:32.837093+00
command_counter | 1
remote_xid | 744
local_origin |
local_commit_ts |
table_schema | spock
table_name | queue
operation | INSERT
local_tup |
remote_old_tup |
remote_new_tup | [{"value": "2024-07-02T17:01:32.83675+00:00", "attname": "queued_at", "a
tttype": "timestamptz"}, {"value": "pgedge", "attname": "role", "atttype": "name"}, {"value
": ["default_insert_only"], "attname": "replication_sets", "atttype": "_text"}, {"value": "
D", "attname": "message_type", "atttype": "char"}, {"value": "SET search_path TO \"$user\",
public; alter table foo drop column c", "attname": "message", "atttype": "json"}]
ddl_statement |
ddl_user |
error_message |
retry_errored_at | 2024-07-02 17:01:33.609419+00
-[ RECORD 2 ]----+-------------------------------------------------------------------------
remote_commit_ts | 2024-07-02 17:01:32.837093+00
command_counter | 2
remote_xid | 744
local_origin |
local_commit_ts |
table_schema |
table_name |
operation | DDL
local_tup |
remote_old_tup |
remote_new_tup |
ddl_statement | SET search_path TO "$user", public; alter table foo drop column c
ddl_user | pgedge
error_message | column "c" of relation "foo" does not exist
retry_errored_at | 2024-07-02 17:01:33.610043+00
Exception Log Table Limitations
The following cases are not currently captured by the exception log table:
- A mismatched number of columns: If
n1
has 3 columns in a table, whilen2
has 2 columns in the same table, anINSERT
into the table onn1
will cause uncaptured exception onn2
. - Data anomalies resulting from columns with different data types.