Postgres Makes Transactions Atomic, Hacker News

Atomicity (in the sense of “ACID”) states that for a series of operations performed against a database, either every one of them commits together, or they’re all rolled back; no in between states are allowed. For code that needs to be resilient to the messiness of the real world, it’s a godsend. ********

Instead of bugs that make it to production changing data and then leaving it permanently corrupt, those changes are reverted. The long tail of connections that are dropped midway from intermittent problems and other unexpected states while handling millions of requests might cause inconvenience, but won’t scramble your data.

Postgres’s implementation in particular is known to provide powerful transaction semantics with little overhead. And while I’ve used it for years, it’s never been something that I’ve understood. Postgres works reliably enough that I’ve been able to treat it as a black box – wonderfully useful, but with inner workings that are a mystery.

This article looks into how Postgres keeps the books on its transactions, how they’re committed atomically, and some concepts that are key to understanding it all.

Managing concurrent access

Say you build a simple database that reads and writes from an on-disk CSV file. When a single client comes in with a request, it opens the file, reads some information, and writes the changes back. Things are mostly working fine, but then one day you decide to enhance your database with a sophisticated new feature, multi-client support!

Unfortunately, the new implementation is immediately plagued by problems that seem to especially apparent when two clients are trying to access data around the same time. One opens the CSV file, reads, modifies, and writes some data, but that change is immediately clobbered by another client trying to do the same.

Data loss from contention between two clients.

This is a problem of concurrent access and it’s addressed by introducingconcurrency control. There are plenty of naive solutions. We could ensure that any process takes out an exclusive lock on a file before reading or writing it, or we could push all operations through a single flow control point so that they only run one at a time. Not only are these workarounds slow, but they won’t scale up to allow us to make our database fully ACID-compliant. Modern databases have a better way, MVCC (multi-version concurrency control).

Under MVCC, statements execute inside of atransaction, and instead of overwriting data directly, they create new versions of it. The original data is still available to other clients that might need it, and any new data stays hidden until the transaction commits. Clients are no longer in direct contention, and data stays safely persisted because they’re not overwriting each other’s changes.

When a transaction starts, it takes asnapshotthat captures the state of a database at that moment in time. Every transaction in the database is applied inserialorder, with a global lock ensuring that only one is being confirmed committed or aborted at a time. A snapshot is a perfect representation of the database’s state between two transactions.

To avoid the neverending accumulation of rows that have been deleted and hidden, databases will eventually remove obsolete data by way of avacuum (process) or in some cases, opportunistic “microvacuums” that happen in band with other queries), but they’ll only do so for information that’s no longer needed by open snapshots.

Postgres manages concurrent access with MVCC. Lets take a look at how it works.

Transactions, tuples, and snapshots

Here’s the data structure that Postgres uses to represent a transaction (fromproc.c):

typedef struct PGXACT {     TransactionId xid; / * id of top-level transaction currently being                           * executed by this proc, if running and XID                           * is assigned; else InvalidTransactionId * /      TransactionId xmin; / * minimal running XID as it was when we were                           * starting our xact, excluding LAZY VACUUM:                           * vacuum must not remove tuples deleted by                           * xid>=xmin! * /      ... } PGXACT;

Transactions are identified with axid(transaction, or “Xact” ID). As an optimization, Postgres will only assign a transaction axidif it starts to modify data because it’s only at that point where other processes need to start tracking its changes. Readonly transactions can execute happily without ever needing axid.

xminis always set immediately to the smallestxidof any transactions that are still running when this one starts. Vacuum processes calculate the minimum boundary of data that they need to keep by taking the minimum of thexmins of all active transactions

Lifetime-aware tuples

Rows of data in Postgres are often referred to astuples. While Postgres uses common lookup structures like B-trees to make retrievals fast, indexes don’t store a tuple’s full set of data or any of its visibility information. Instead, they store atid(tuple ID) that can be used to retrieve a row from physical storage, otherwise known as “the heap”. TheTIDgives Postgres a starting point where it can start scanning the heap until it finds a tuple that satisfies the current snapshot’s visibility. ********

Here’s the Postgres implementation for aheap tuple(as opposed to anindex tuplewhich is the structure found in an index), along with a few other structs that represent its header information (from (htup.h)andhtup_details.h):

typedef struct HeapTupleData {     uint 32 t_len; / * length of * t_data * /     ItemPointerData t_self; / * SelfItemPointer * /     Oid t_tableOid; / * table the tuple came from * /     HeapTupleHeader t_data; / * ->tuple header and data * / } HeapTupleData;  / * referenced by HeapTupleData * / struct HeapTupleHeaderData {     HeapTupleFields t_heap;      ... }  / * referenced by HeapTupleHeaderData * / typedef struct HeapTupleFields {     TransactionId t_xmin; / * inserting xact ID * /     TransactionId t_xmax; / * deleting or locking xact ID * /      ... } HeapTupleFields;

Like a transaction, a tuple tracks its ownxmin, except in the tuple’s case it’s recorded to represent the first transaction where the tuple becomes visible (i.e. the one that created it). It also tracksxmaxto be thelasttransaction where the tuple is visible (i.e. the one that deleted it).

A heap tuple’s lifetime being tracked with xmin and xmax.

xminand (xmax) are internal concepts, but they can be revealed as hidden columns on any Postgres table. Just select them explicitly by name:

# SELECT *, xmin, xmax FROM names;   id | name | xmin | xmax ----   ----------   -------   -------   1 Hyperion | 27926 | 27928   2 Endymion | 27927 | 0

Snapshots: xmin, xmax, and xip

Here’s the snapshot structure (from snapshot.h):

typedef struct SnapshotData {     / *      * The remaining fields are used only for MVCC snapshots, and are normally      * just zeroes in special snapshots. (But xmin and xmax are used      * specially by HeapTupleSatisfiesDirty.)      *      * An MVCC snapshot can never see the effects of XIDs>=xmax. It can see      * the effects of all older XIDs except those listed in the snapshot. xmin      * is stored as an optimization to avoid needing to search the XID arrays      * for most tuples.      * /     TransactionId xmin; / * all XID=xmax are invisible to me * /      / *      * For normal MVCC snapshot this contains the all xact IDs that are in      * progress, unless the snapshot was taken during recovery in which case      * it's empty. For historic MVCC snapshots, the meaning is inverted, i.e.      * it contains * committed * transactions between xmin and xmax.      *      * note: all ids in xip [] satisfy xmin

A snapshot’sxminis calculated the same way as a transaction’s (ie the lowestxidamongst running transactions when the snapshot is created), but for a different purpose. Thisxminis a lower boundary for data visibility. Tuples created by a transaction withxidare visible to the snapshot.

It also defines anxmax, which is set to the last committedxidplus one.Xmaxtracks the upper bound of visibility; transactions withxid>=xmaxare invisible to the snapshot.

Lastly, a snapshot defines* xip, an array of all of theXIDs of transactions that were in progress when the snapshot was created.* XIPis needed because even though there’s already a visibility boundary withxmin, there may still be some transactions that are already committed withXIDs greater than (xmin) , butalsogreater than aXIDof an in-progress transaction (so they couldn’t be included inxmin).

We want the results any committed transactions withxid>xminto be visible, but the results of any that were in flight hidden.* XIPstores the list of transactions that were active when the snapshot was created so that we can Tell which is which. ********

Transactions executing against a database and a snapshot capturing a moment in time.

Beginning a transaction

When you execute aBEGIN, Postgres puts some basic bookkeeping in place, but it will defer more expensive operations as long as it can. For example, the new transaction isn’t assigned axiduntil it starts modifying data to reduce the expense of tracking it elsewhere in the system.

The new transaction also won’t immediately get a snapshot. It will when it runs its first query, whereuponexec_simple_query(inpostgres.c) will push one onto a stack. Even a simpleSELECT 1;is enough to trigger it:

static void exec_simple_query (const char * query_string) {     ...      / *      * Set up a snapshot if parse analysis / planning will need one.      * /     if (analyze_requires_snapshot (parsetree))     {         PushActiveSnapshot (GetTransactionSnapshot ());         snapshot_set=true;     }      ... }

Creating the new snapshot is where the machinery really starts coming to life. Here’sGetSnapshotData(inprocarray.c):

Snapshot GetSnapshotData (Snapshot snapshot) {     / * xmax is always latestCompletedXid   1 * /     xmax=ShmemVariableCache->latestCompletedXid;     Assert (TransactionIdIsNormal (xmax));     TransactionIdAdvance (xmax);      ...      snapshot->xmax=xmax; }

This function does a lot of initialization, but like we talked about, some of its most important work is set to the snapshot’sxmin,xmax, and* xip. The easiest of these isxmax, which is retrieved from shared memory managed by the postmaster. Every transaction that commits notifies the postmaster that it did, andlatestCompletedXidwill be updated if thexidis higher than what it already holds. (more on this later).

Notice that it’s the function’s responsibility to add one to the lastxid. This isn’t quite as trivial as incrementing it because transaction IDs in Postgres are allowed to wrap. A transaction ID is defined as a simple unsigned 32 – bit integer (from ch):

typedef uint 32 TransactionId;

Even thoughxids are assigned only opportunistically (as mentioned above, reads don’t need one), a system doing a lot of throughput can easily hit the bounds of 32 bits, so the system needs to be able to wrap to “reset” thexidsequence as necessary. This is handled by some preprocessor magic (intransam.h):

# define InvalidTransactionId ((TransactionId) 0) #define BootstrapTransactionId ((TransactionId) 1) #define FrozenTransactionId ((TransactionId) 2) #define FirstNormalTransactionId ((TransactionId) 3)  ...  / * advance a transaction ID variable, handling wraparound correctly * / #define TransactionIdAdvance (dest)      do {         (dest)   ;          if ((dest)

The first few IDs are reserved as special identifiers, so we always skip those and start at3.

Back inGetSnapshotData, we getxminand (xip) by iterating over all running transactions (again, seeSnapshotsabove for an explanation of what these do):

/ *  * Spin over procArray checking xid, xmin, and subxids. The goal is  * to gather all active xids, find the lowest xmin, and try to record  * subxids.  * / for (index=0; indexxmin; / * fetch just once * /      / *      * If the transaction has no XID assigned, we can skip it; it      * won't have sub-XIDs either. If the XID is>=xmax, we can also      * skip it; such transactions will be treated as running anyway      * (and any sub-XIDs will also be>=xmax).      * /     if (! TransactionIdIsNormal (xid)         || ! NormalTransactionIdPrecedes (xid, xmax))         continue;      if (NormalTransactionIdPrecedes (xid, xmin))         xmin=xid;      / * Add XID to snapshot. * /     snapshot->xip [count  ]=xid;      ... }  ...  snapshot->xmin=xmin;

Committing a transaction

Transactions are committed throughCommitTransaction(inxact.c). This function is monstrously complex, but here are a few of its important parts:

static void CommitTransaction (void) {     ...      / *      * We need to mark our XIDs as committed in pg_xact. This is where we      * durably commit.      * /     latestXid=RecordTransactionCommit ();      / *      * Let others know about no transaction in progress by me. Note that this      * must be done _before_ releasing locks we hold and _after_      * RecordTransactionCommit.      * /     ProcArrayEndTransaction (MyProc, latestXid);      ... }

Durability and the WAL

Postgres is entirely designed around the idea of durability, which dictates that even in extreme events like a crash or power loss, committed transactions should stay committed. Like many good systems, it uses awrite-ahead log(WAL, or “xlog”) to achieve this durability. All changes are written and flushed to disk, and even in the event of a sudden termination, Postgres can replay what it finds in the WAL to recover any changes that didn’t make it into its data files.

RecordTransactionCommitfrom the snippet above handles getting a change in transaction state to the WAL:

static TransactionId RecordTransactionCommit (void) {     bool markXidCommitted=TransactionIdIsValid (xid);      / *      * If we haven't been assigned an XID yet, we neither can, nor do we want      * to write a COMMIT record.      * /     if (! markXidCommitted)     {         ...     } else {         XactLogCommitRecord (xactStopTimestamp,                             nchildren, children, nrels, rels,                             nmsgs, invalMessages,                             RelcacheInitFileInval, forceSyncCommit,                             MyXactFlags,                             InvalidTransactionId / * plain commit * /);          ....     }      if ((wrote_xlog && markXidCommitted &&          synchronous_commit>SYNCHRONOUS_COMMIT_OFF) ||         forceSyncCommit || nrels>0)     {         XLogFlush (XactLastRecEnd);          / *          * Now we may update the CLOG, if we wrote a COMMIT record above          * /         if (markXidCommitted)             TransactionIdCommitTree (xid, nchildren, children);     }      ... }

The commit log

Along with the WAL, Postgres also has acommit log(or “Clog” or “pg_xact”) which summarizes every transaction and whether it committed or aborted. This is whatTransactionIdCommitTreeis doing above – the bulk of the information is written out to WAL first, thenTransactionIdCommitTreegoes through and sets the transaction’s status in the commit log to “committed”. ********

Although the commit log is called a “log”, it’s really more of a bitmap of commit statuses split across a number of pages in shared memory and on disk. In an example of the kind of frugality rarely seen in modern programming, the status of a transaction can be recorded in only two bits, so we can store four transactions per byte, or 32, 768 in a standard 8k page. ********

Fromclog.handclog.c:

# define TRANSACTION_STATUS_IN_PROGRESS 0x 00 #define TRANSACTION_STATUS_COMMITTED 0x 01 #define TRANSACTION_STATUS_ABORTED 0x 02 #define TRANSACTION_STATUS_SUB_COMMITTED 0x 03  #define CLOG_BITS_PER_XACT 2 #define CLOG_XACTS_PER_BYTE 4 #define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)

All sizes of optimization

While durability is important, performance is also a value that’s core to the Postgres philosophy. If a transaction was never assigned axid, Postgres skips writing it to the WAL and commit log. If a transaction was aborted, we still write its aborted status to the WAL and commit log, but don’t bother to immediately flush (fsync) because even in the event of a crash, we wouldn’t lose any information. During crash recovery, Postgres would notice the unflagged transactions, and assume that they were aborted.

Defensive programming

TransactionIdCommitTree(intransam.c, and its implementationTransactionIdSetTreeStatusinclog.c) commits a “tree” because a commit may have subcommits. I won’t go into subcommits in any detail, but it’s worth nothing that becauseTransactionIdCommitTreecannot be guaranteed to be atomic, each subcommit is recorded as committed separately, and the parent is recorded as a final step. When Postgres is recovering after a crash, subcommit records aren’t considered to be committed (even if they’re marked as such) until the parent record is read and confirmed committed.

Once again this is in the name of atomicity; the system could have successfully recorded every subcommit, but then crashed before it could write the parent.

Here’s what that looks likeinclog.c(***********:

/ *  * Record the final state of transaction entries in the commit log for  * all entries on a single page. Atomic only on this page.  *  * Otherwise API is same as TransactionIdSetTreeStatus ()  * / static void TransactionIdSetPageStatus (TransactionId xid, int nsubxids,                            TransactionId * subxids, XidStatus status,                            XLogRecPtr lsn, int pageno) {     ...      LWLockAcquire (CLogControlLock, LW_EXCLUSIVE);      / *      * Set the main transaction id, if any.       *      * If we update more than one xid on this page while it is being written      * out, we might find that some of the bits go to disk and others don't.      * If we are updating commits on the page with the top-level xid that      * could break atomicity, so we subcommit the subxids first before we mark      * the top-level commit.      * /     if (TransactionIdIsValid (xid))     {         / * Subtransactions first, if needed ... * /         if (status==TRANSACTION_STATUS_COMMITTED)         {             for (i=0; ishared->page_number [slotno]==TransactionIdToPage (subxids [i]));                 TransactionIdSetStatusBit (subxids [i],                                           TRANSACTION_STATUS_SUB_COMMITTED,                                           lsn, slotno);             }         }          / * ... then the main transaction * /         TransactionIdSetStatusBit (xid, status, lsn, slotno);     }      ...      LWLockRelease (CLogControlLock); }

With the transaction recorded to commit log, it’s safe to signal its completion to the rest of the system. This happens in the second call inCommitTransactionabove (into procarray.c):

void ProcArrayEndTransaction (PGPROC * proc, TransactionId latestXid) {     / *      * We must lock ProcArrayLock while clearing our advertised XID, so      * that we do not exit the set of "running" transactions while someone      * else is taking a snapshot. See discussion in      * src / backend / access / transam / README.      * /     if (LWLockConditionalAcquire (ProcArrayLock, LW_EXCLUSIVE))     {         ProcArrayEndTransactionInternal (proc, pgxact, latestXid);         LWLockRelease (ProcArrayLock);     }      ... }  static inline void ProcArrayEndTransactionInternal (PGPROC * proc, PGXACT * pgxact,                                 TransactionId latestXid) {     ...      / * Also advance global latestCompletedXid while holding the lock * /     if (TransactionIdPrecedes (ShmemVariableCache->latestCompletedXid,                               latestXid))         ShmemVariableCache->latestCompletedXid=latestXid; }

You may be wondering what a “proc array” is. Unlike many other daemon-like services, Postgres uses a process forking model to handle concurrency instead of threading. When it accepts a new connection, the Postmaster forks a new backend (inpostmaster.c). Backends are represented by thePGPROC (structure)inproc.h), and the entire set of active processes is tracked in shared memory, thus “proc array”.

Now remember how when we created a snapshot we set itsXmaxtolatestCompletedXid 1? By settinglatestCompletedXidin global shared memory to the (xid) of the transaction that just committed, we’ve just made its results visible to every new snapshot that starts from this point forward across any backend.

Take a look at the lock acquisition and release calls on the lines withLWLockConditionalAcquireandLWLockRelease. Most of the time, Postgres is perfectly happy to let processes do work in parallel, but there are a few places where locks need to be acquired to avoid contention, and this is one of them. Near the beginning of this article we touched on how transactions in Postgres commit or abort in serial order, one at a time.ProcArrayEndTransactionacquires an exclusive lock so that it can updatelatestCompletedXidwithout having its work negated by another process.

Responding to the client

Throughout this entire process, a client has been waiting synchronously for its transaction to be confirmed. Part of the atomicity guarantee is that false positives where the databases signals a transaction as committed when it hasn’t been aren’t possible. Failures can happen in many places, but if there is one, the client finds out about it and has a chance to retry or otherwise address the problem.

Checking visibility

We covered earlier how visibility information is stored on heap tuples.heapgettup(in (heapam.c) ) is the method responsible for scanning the heap for tuples that meet a snapshot’s visibility criteria:

static void heapgettup (HeapScanDesc scan,            ScanDirection dir,            int nkeys,            ScanKey key) {     ...      / *      * advance the scan until we find a qualifying tuple or run out of stuff      * to scan      * /     lpp=PageGetItemId (dp, lineoff);     for (;;)     {         / *          * if current tuple qualifies, return it.          * /         valid=HeapTupleSatisfiesVisibility (tuple,                                              snapshot,                                              scan->rs_cbuf);          if (valid)         {             return;         }             lpp; / * move forward in this page's ItemId array * /            lineoff;     }      ... }

HeapTupleSatisfiesVisibilityis a preprocessor macro that will call into “satisfies” function likeHeapTupleSatisfiesMVCC(intqual.c):

bool HeapTupleSatisfiesMVCC (HeapTuple htup, Snapshot snapshot,                        Buffer buffer) {     ...      else if (XidInMVCCSnapshot (HeapTupleHeaderGetRawXmin (tuple), snapshot))         return false;     else if (TransactionIdDidCommit (HeapTupleHeaderGetRawXmin (tuple)))         SetHintBits (tuple, buffer, HEAP_XMIN_COMMITTED,                     HeapTupleHeaderGetRawXmin (tuple));      ...      / * xmax transaction committed * /      return false; }

XidInMVCCSnapshotdoes an initial check to see whether the tuple’sxidis visible according to the snapshot’sxmin,xmax, andxip. Here’s a simplified implementation that shows the checks on each (fromtqual.c):

static bool XidInMVCCSnapshot (TransactionId xid, Snapshot snapshot) {     / * Any xidxmin))         return false;     / * Any xid>=xmax is in-progress * /     if (TransactionIdFollowsOrEquals (xid, snapshot->xmax))         return true;      ...      for (i=0; ixcnt; i   )     {         if (TransactionIdEquals (xid, snapshot->xip [i]))             return true;     }      ... }

Note the function’s return value is inverted compared to how you’d think about it intuitively – afalsemeans that thexidisvisible to the snapshot. Although confusing, you can follow what it’s doing by comparing the return values to where it’s invoked.

After confirming that thexidis visible, Postgres checks its commit status withTransactionIdDidCommit(fromtransam.c):

bool / * true if given transaction committed * / TransactionIdDidCommit (TransactionId transactionId) {     XidStatus xidstatus;      xidstatus=TransactionLogFetch (transactionId);      / *      * If it's marked committed, it's committed.      * /     if (xidstatus==TRANSACTION_STATUS_COMMITTED)         return true;      ... }

Further exploring the implementation ofTransactionLogFetchwill reveal that it works as advertised. It calculates a location in the commit log from the given transaction ID and reaches into it to get that transaction’s commit status. Whether or not the transaction committed is used to help determine the tuple’s visibility.

The key here is that for purposes of consistency, the commit log is considered the canonical source for commit status (and by extension, visibility). The same information will be returned regardless of whether Postgres successfully committed a transaction hours ago, or seconds before a crash that the server is just now recovering from.

Hint bits

HeapTupleSatisfiesMVCCfrom above does one more thing before returning from a visibility check:

SetHintBits (tuple, buffer, HEAP_XMIN_COMMITTED,             HeapTupleHeaderGetRawXmin (tuple));

Checking the commit log to see whether a tuple’s (xmin) orXmaxtransactions are committed is an expensive operation. To avoid having to go to it every time, Postgres will set special commit status flags called “hint bits” for a heap tuple that is scanned. Subsequent operations can check the tuple’s hint bits and are saved a trip to the commit log themselves. ********

The box’s dark walls

When I run a transaction against a database:

BEGIN;  SELECT * FROM users WHERE email='[email protected]';  INSERT INTO users (email) VALUES ('[email protected]')     RETURNING *;  COMMIT;

I don’t stop to think about what’s going on. I’m given a powerful high level abstraction (in the form of SQL) which I know will work reliably, and as we’ve seen, Postgres does all the heavy lifting under the covers. Good software is a black box, and Postgres is an especially dark one (although with pleasantly accessible internals).

Thank you toPeter Geogheganfor patiently answering all my amateur questions about Postgres transactions and snapshots, and giving me some pointers for finding relevant code.

Article
How Postgres Makes Transactions Atomic

Published
(August) , 2017

Location
San Francisco

Find me on Twitter at@ Brandur

Please post comments and discussion toHacker News.

Did I make a mistake? Please considersending a pull request.