lxhome/_drafts/mvcc.md

---
layout: post
title:  "Multi-Version Concurrency Control"
date:   2019-04-28
categories: DB
tags: concurrency
theme: dark
---

Multi version concurrency control or **MVCC** for short is a famous and commonly used concurrency
control methods in [DBMS](https://en.wikipedia.org/wiki/Database#Database_management_system)s
and some programming languages (for [Transactional Memory](https://en.wikipedia.org/wiki/Transactional_memory)).
Like lots of other concepts and algorithms in computer science it is old (introduced in 70s).

Before we start I presume you are familiar with
[transaction processing](https://en.wikipedia.org/wiki/Transaction_processing). Also as an heads up, Since
MVCC is a huge topic and far beyond a blog post, I'll split this topic into several posts. In this post
we're going to have an overview of MVCC.

## What is Concurrency Control ?
Concurrency control is the procedure in data oriented systems such as a DBMS or a programming language for managing
simultaneous operations without conflicting with each another. Concurrent access is quite easy if all everyone
wants to just read data. In a read only environment there is no way that read operations can interfere with one
another. But the purpose of every system in this world is to process some data and make changes to the world. Write
operations are important part of each system and concurrency controll is all about handling simultaneous write
operations in a conflict free way.

**MVCC** is one of the most popular and widely used concurrency control methods. For more on concurrency control
checkout [this wikipedia page](https://en.wikipedia.org/wiki/Concurrency_control)

## MVCC
According to MVCC, the system (DBMS or a programming language) maintains multiple physical versions of a single
logical object (any thing under the control of the system, either a tuple in relational DBMSs or some data
in memory controlled by a programming language ) in the system:

* When a transaction writes to an object, the system creates a new version of that object.
* When a transaction reads an object, it reads the newest version that existed when the transaction started.

We'll see how MVCC works in a minute but let's discuss why to use MVCC ?

There are lots of benefits to using MVCC as the concurrency control method but some of the main benefits
are:

* Writes don't block readers:
  With MVCC write operations can be done in a way which no reader would get blocked by the write operation
  which is the case in [Two Phase Locking](https://en.wikipedia.org/wiki/Two-phase_locking)

* Lock free read operations via consistent snapshots:
  Read only transactions don't have to acquire a lock anymore because they will provided by a snapshot
  of the current state of the system to operate on.

* Time Traveling Operations:
  With storing all the versions of an object in the system, we easily can operate on a specific version
  of an object for a given time. For example in the case of DBMS, we can run a query against the state
  of the database from 2 years ago.

MVCC useful not just for concurrency control. It can shine when it comes to multi version data control
as well.

## Snapshot Isolation (SI)
In order to understand how MVCC works, first we need to know about snapshot isolation (SI). MVCC and
SI have a two way relationship. By two way relationship I mean, In order to implement MVCC we need
to implement SI and if we want to have SI in our system we need to have MVCC as well (does it make sense?).

Basically when a transaction starts, the system provides the transaction with a consistent
snapshot of the current state of system. By current, I mean the exact state of the system just before
the transaction started and by consistent I mean, the snapshot would not contain any uncommited data
from a running transaction. So If in any given time transaction T1 is running and T2 is about to start,
the system would not include T1 changes in the snapshot which is going to be used for T2. Simple as that.

This way we would not end up with torn writes (for example when a writes operation which is supposed to
write two objects in the state, writes only the first one) from any running transaction.

Also the important rule here is that if two transactions want to update the same object the first one
will wins and the second one has to retry.

Snapshots might be physical or logical. Depends on the system. For example in a DBMS it does not make
sense to copy the database state to each transaction (physical snapshot) because obviously it would be
huge. Instead it use logical snapshots which using the same physical data. But in a programming language,
it might be much faster to just use a physical snapshot of some data in memory instead of handling the overhead
of the necessary book keeping for a logical snapshot.

It's important to bear in mind that SI is not serializable isolation by default. If you need to implement
serializable isolation for the snapshots in your system you have to take care of some extra stuff.

## Design of MVCC
In order to implement MVCC in a system we need to decide between different aspects of the system
which would be involved with MVCC. The most crucial aspects are:

* Book keeping of data we need to store
* Concurrency control protocol
* Index Management
* Garbage Collection
* Storage

### Data book keeping
Depends on the concurrency control protocol we want to use, we have to manage some extra data about
every object in our system. In general we need to keep track of the following information about each
object:

* Transaction ID (`TxID`)
* Life time of each object:
  * When the transaction that operate on this object began: `BEGIN-TS`
  * When the transaction that operate on this object ended: `END-TS`
* A link to the previous/next versions of the same object

And some other information depends on the protocol we use for concurrency control. It's crucial to
decide who to manage and store these data in your system and it's totally depends on the nature of
your system. Is it a disk oriented, single node databse management system ? is it a programming
language operating on a single threaded environment ? or maybe it's an in-memory, distributed
database management system ?

Whatever it is you have to keep in mind that computer science is about tradeoffs. There is no
ultimate answer. For example storing these kind of data along side with the object it self can
increase your storage usage but can save you lots of computation time. It can be wise to do it
in a DBMS but not in a programming language to implement STM.

### Concurrency control protocols for MVCC

* Mutli Version Timestamp Ordering  (MTVO)
* The "Optimistic Concurrency Control" (MVOCC)
* Multi Version 2 Phase Locking (MV2PL)
* Serializable Snapshot Isolation (SSI)
MVCC part 1 has been added 2019-04-28 18:51:24 +01:00			`---`
			`layout: post`
			`title: "Multi-Version Concurrency Control"`
			`date: 2019-04-28`
			`categories: DB`
			`tags: concurrency`
			`theme: dark`
			`---`

Transaction variants post has been added 2019-05-05 20:05:15 +01:00			`Multi version concurrency control or MVCC for short is a famous and commonly used concurrency`
MVCC part 1 has been added 2019-04-28 18:51:24 +01:00			`control methods in [DBMS](https://en.wikipedia.org/wiki/Database#Database_management_system)s`
			`and some programming languages (for [Transactional Memory](https://en.wikipedia.org/wiki/Transactional_memory)).`
			`Like lots of other concepts and algorithms in computer science it is old (introduced in 70s).`

			`Before we start I presume you are familiar with`
			`[transaction processing](https://en.wikipedia.org/wiki/Transaction_processing). Also as an heads up, Since`
			`MVCC is a huge topic and far beyond a blog post, I'll split this topic into several posts. In this post`
			`we're going to have an overview of MVCC.`

			`## What is Concurrency Control ?`
			`Concurrency control is the procedure in data oriented systems such as a DBMS or a programming language for managing`
			`simultaneous operations without conflicting with each another. Concurrent access is quite easy if all everyone`
			`wants to just read data. In a read only environment there is no way that read operations can interfere with one`
			`another. But the purpose of every system in this world is to process some data and make changes to the world. Write`
			`operations are important part of each system and concurrency controll is all about handling simultaneous write`
			`operations in a conflict free way.`

			`MVCC is one of the most popular and widely used concurrency control methods. For more on concurrency control`
			`checkout [this wikipedia page](https://en.wikipedia.org/wiki/Concurrency_control)`

			`## MVCC`
			`According to MVCC, the system (DBMS or a programming language) maintains multiple physical versions of a single`
			`logical object (any thing under the control of the system, either a tuple in relational DBMSs or some data`
			`in memory controlled by a programming language ) in the system:`

			`* When a transaction writes to an object, the system creates a new version of that object.`
			`* When a transaction reads an object, it reads the newest version that existed when the transaction started.`

			`We'll see how MVCC works in a minute but let's discuss why to use MVCC ?`

			`There are lots of benefits to using MVCC as the concurrency control method but some of the main benefits`
			`are:`

			`* Writes don't block readers:`
			`With MVCC write operations can be done in a way which no reader would get blocked by the write operation`
			`which is the case in [Two Phase Locking](https://en.wikipedia.org/wiki/Two-phase_locking)`

			`* Lock free read operations via consistent snapshots:`
			`Read only transactions don't have to acquire a lock anymore because they will provided by a snapshot`
			`of the current state of the system to operate on.`

			`* Time Traveling Operations:`
			`With storing all the versions of an object in the system, we easily can operate on a specific version`
			`of an object for a given time. For example in the case of DBMS, we can run a query against the state`
			`of the database from 2 years ago.`

			`MVCC useful not just for concurrency control. It can shine when it comes to multi version data control`
			`as well.`

			`## Snapshot Isolation (SI)`
			`In order to understand how MVCC works, first we need to know about snapshot isolation (SI). MVCC and`
			`SI have a two way relationship. By two way relationship I mean, In order to implement MVCC we need`
			`to implement SI and if we want to have SI in our system we need to have MVCC as well (does it make sense?).`

			`Basically when a transaction starts, the system provides the transaction with a consistent`
			`snapshot of the current state of system. By current, I mean the exact state of the system just before`
			`the transaction started and by consistent I mean, the snapshot would not contain any uncommited data`
			`from a running transaction. So If in any given time transaction T1 is running and T2 is about to start,`
			`the system would not include T1 changes in the snapshot which is going to be used for T2. Simple as that.`

			`This way we would not end up with torn writes (for example when a writes operation which is supposed to`
			`write two objects in the state, writes only the first one) from any running transaction.`

			`Also the important rule here is that if two transactions want to update the same object the first one`
			`will wins and the second one has to retry.`

			`Snapshots might be physical or logical. Depends on the system. For example in a DBMS it does not make`
			`sense to copy the database state to each transaction (physical snapshot) because obviously it would be`
			`huge. Instead it use logical snapshots which using the same physical data. But in a programming language,`
			`it might be much faster to just use a physical snapshot of some data in memory instead of handling the overhead`
			`of the necessary book keeping for a logical snapshot.`

			`It's important to bear in mind that SI is not serializable isolation by default. If you need to implement`
			`serializable isolation for the snapshots in your system you have to take care of some extra stuff.`

			`## Design of MVCC`
			`In order to implement MVCC in a system we need to decide between different aspects of the system`
			`which would be involved with MVCC. The most crucial aspects are:`

			`* Book keeping of data we need to store`
			`* Concurrency control protocol`
			`* Index Management`
			`* Garbage Collection`
			`* Storage`

			`### Data book keeping`
			`Depends on the concurrency control protocol we want to use, we have to manage some extra data about`
			`every object in our system. In general we need to keep track of the following information about each`
			`object:`

			* Transaction ID (`TxID`)
			`* Life time of each object:`
			* When the transaction that operate on this object began: `BEGIN-TS`
			* When the transaction that operate on this object ended: `END-TS`
			`* A link to the previous/next versions of the same object`

			`And some other information depends on the protocol we use for concurrency control. It's crucial to`
			`decide who to manage and store these data in your system and it's totally depends on the nature of`
			`your system. Is it a disk oriented, single node databse management system ? is it a programming`
			`language operating on a single threaded environment ? or maybe it's an in-memory, distributed`
			`database management system ?`

			`Whatever it is you have to keep in mind that computer science is about tradeoffs. There is no`
			`ultimate answer. For example storing these kind of data along side with the object it self can`
			`increase your storage usage but can save you lots of computation time. It can be wise to do it`
			`in a DBMS but not in a programming language to implement STM.`

			`### Concurrency control protocols for MVCC`

			`* Mutli Version Timestamp Ordering (MTVO)`
			`* The "Optimistic Concurrency Control" (MVOCC)`
			`* Multi Version 2 Phase Locking (MV2PL)`
			`* Serializable Snapshot Isolation (SSI)`