lxhome/_drafts/mvcc.md

6.6 KiB

layout title date categories tags theme
post Multi-Version Concurrency Control 2019-04-28 DB concurrency dark

Multi version concurrency control or MVCC for short is a famous and commonly used concurrency control methods in DBMSs and some programming languages (for Transactional Memory). Like lots of other concepts and algorithms in computer science it is old (introduced in 70s).

Before we start I presume you are familiar with transaction processing. Also as an heads up, Since MVCC is a huge topic and far beyond a blog post, I'll split this topic into several posts. In this post we're going to have an overview of MVCC.

What is Concurrency Control ?

Concurrency control is the procedure in data oriented systems such as a DBMS or a programming language for managing simultaneous operations without conflicting with each another. Concurrent access is quite easy if all everyone wants to just read data. In a read only environment there is no way that read operations can interfere with one another. But the purpose of every system in this world is to process some data and make changes to the world. Write operations are important part of each system and concurrency controll is all about handling simultaneous write operations in a conflict free way.

MVCC is one of the most popular and widely used concurrency control methods. For more on concurrency control checkout this wikipedia page

MVCC

According to MVCC, the system (DBMS or a programming language) maintains multiple physical versions of a single logical object (any thing under the control of the system, either a tuple in relational DBMSs or some data in memory controlled by a programming language ) in the system:

  • When a transaction writes to an object, the system creates a new version of that object.
  • When a transaction reads an object, it reads the newest version that existed when the transaction started.

We'll see how MVCC works in a minute but let's discuss why to use MVCC ?

There are lots of benefits to using MVCC as the concurrency control method but some of the main benefits are:

  • Writes don't block readers: With MVCC write operations can be done in a way which no reader would get blocked by the write operation which is the case in Two Phase Locking

  • Lock free read operations via consistent snapshots: Read only transactions don't have to acquire a lock anymore because they will provided by a snapshot of the current state of the system to operate on.

  • Time Traveling Operations: With storing all the versions of an object in the system, we easily can operate on a specific version of an object for a given time. For example in the case of DBMS, we can run a query against the state of the database from 2 years ago.

MVCC useful not just for concurrency control. It can shine when it comes to multi version data control as well.

Snapshot Isolation (SI)

In order to understand how MVCC works, first we need to know about snapshot isolation (SI). MVCC and SI have a two way relationship. By two way relationship I mean, In order to implement MVCC we need to implement SI and if we want to have SI in our system we need to have MVCC as well (does it make sense?).

Basically when a transaction starts, the system provides the transaction with a consistent snapshot of the current state of system. By current, I mean the exact state of the system just before the transaction started and by consistent I mean, the snapshot would not contain any uncommited data from a running transaction. So If in any given time transaction T1 is running and T2 is about to start, the system would not include T1 changes in the snapshot which is going to be used for T2. Simple as that.

This way we would not end up with torn writes (for example when a writes operation which is supposed to write two objects in the state, writes only the first one) from any running transaction.

Also the important rule here is that if two transactions want to update the same object the first one will wins and the second one has to retry.

Snapshots might be physical or logical. Depends on the system. For example in a DBMS it does not make sense to copy the database state to each transaction (physical snapshot) because obviously it would be huge. Instead it use logical snapshots which using the same physical data. But in a programming language, it might be much faster to just use a physical snapshot of some data in memory instead of handling the overhead of the necessary book keeping for a logical snapshot.

It's important to bear in mind that SI is not serializable isolation by default. If you need to implement serializable isolation for the snapshots in your system you have to take care of some extra stuff.

Design of MVCC

In order to implement MVCC in a system we need to decide between different aspects of the system which would be involved with MVCC. The most crucial aspects are:

  • Book keeping of data we need to store
  • Concurrency control protocol
  • Index Management
  • Garbage Collection
  • Storage

Data book keeping

Depends on the concurrency control protocol we want to use, we have to manage some extra data about every object in our system. In general we need to keep track of the following information about each object:

  • Transaction ID (TxID)
  • Life time of each object:
    • When the transaction that operate on this object began: BEGIN-TS
    • When the transaction that operate on this object ended: END-TS
  • A link to the previous/next versions of the same object

And some other information depends on the protocol we use for concurrency control. It's crucial to decide who to manage and store these data in your system and it's totally depends on the nature of your system. Is it a disk oriented, single node databse management system ? is it a programming language operating on a single threaded environment ? or maybe it's an in-memory, distributed database management system ?

Whatever it is you have to keep in mind that computer science is about tradeoffs. There is no ultimate answer. For example storing these kind of data along side with the object it self can increase your storage usage but can save you lots of computation time. It can be wise to do it in a DBMS but not in a programming language to implement STM.

Concurrency control protocols for MVCC

  • Mutli Version Timestamp Ordering (MTVO)
  • The "Optimistic Concurrency Control" (MVOCC)
  • Multi Version 2 Phase Locking (MV2PL)
  • Serializable Snapshot Isolation (SSI)