MVCC part 1 has been added

This commit is contained in:
Sameer Rahmani 2019-04-28 18:51:24 +01:00
parent 9b14d4215b
commit 411fbda7af
2 changed files with 131 additions and 6 deletions

123
_drafts/mvcc.md Normal file
View File

@ -0,0 +1,123 @@
---
layout: post
title: "Multi-Version Concurrency Control"
date: 2019-04-28
categories: DB
tags: concurrency
theme: dark
---
Multi version concurrency control or **MVCC** for short is a famous and comonly used concurrency
control methods in [DBMS](https://en.wikipedia.org/wiki/Database#Database_management_system)s
and some programming languages (for [Transactional Memory](https://en.wikipedia.org/wiki/Transactional_memory)).
Like lots of other concepts and algorithms in computer science it is old (introduced in 70s).
Before we start I presume you are familiar with
[transaction processing](https://en.wikipedia.org/wiki/Transaction_processing). Also as an heads up, Since
MVCC is a huge topic and far beyond a blog post, I'll split this topic into several posts. In this post
we're going to have an overview of MVCC.
## What is Concurrency Control ?
Concurrency control is the procedure in data oriented systems such as a DBMS or a programming language for managing
simultaneous operations without conflicting with each another. Concurrent access is quite easy if all everyone
wants to just read data. In a read only environment there is no way that read operations can interfere with one
another. But the purpose of every system in this world is to process some data and make changes to the world. Write
operations are important part of each system and concurrency controll is all about handling simultaneous write
operations in a conflict free way.
**MVCC** is one of the most popular and widely used concurrency control methods. For more on concurrency control
checkout [this wikipedia page](https://en.wikipedia.org/wiki/Concurrency_control)
## MVCC
According to MVCC, the system (DBMS or a programming language) maintains multiple physical versions of a single
logical object (any thing under the control of the system, either a tuple in relational DBMSs or some data
in memory controlled by a programming language ) in the system:
* When a transaction writes to an object, the system creates a new version of that object.
* When a transaction reads an object, it reads the newest version that existed when the transaction started.
We'll see how MVCC works in a minute but let's discuss why to use MVCC ?
There are lots of benefits to using MVCC as the concurrency control method but some of the main benefits
are:
* Writes don't block readers:
With MVCC write operations can be done in a way which no reader would get blocked by the write operation
which is the case in [Two Phase Locking](https://en.wikipedia.org/wiki/Two-phase_locking)
* Lock free read operations via consistent snapshots:
Read only transactions don't have to acquire a lock anymore because they will provided by a snapshot
of the current state of the system to operate on.
* Time Traveling Operations:
With storing all the versions of an object in the system, we easily can operate on a specific version
of an object for a given time. For example in the case of DBMS, we can run a query against the state
of the database from 2 years ago.
MVCC useful not just for concurrency control. It can shine when it comes to multi version data control
as well.
## Snapshot Isolation (SI)
In order to understand how MVCC works, first we need to know about snapshot isolation (SI). MVCC and
SI have a two way relationship. By two way relationship I mean, In order to implement MVCC we need
to implement SI and if we want to have SI in our system we need to have MVCC as well (does it make sense?).
Basically when a transaction starts, the system provides the transaction with a consistent
snapshot of the current state of system. By current, I mean the exact state of the system just before
the transaction started and by consistent I mean, the snapshot would not contain any uncommited data
from a running transaction. So If in any given time transaction T1 is running and T2 is about to start,
the system would not include T1 changes in the snapshot which is going to be used for T2. Simple as that.
This way we would not end up with torn writes (for example when a writes operation which is supposed to
write two objects in the state, writes only the first one) from any running transaction.
Also the important rule here is that if two transactions want to update the same object the first one
will wins and the second one has to retry.
Snapshots might be physical or logical. Depends on the system. For example in a DBMS it does not make
sense to copy the database state to each transaction (physical snapshot) because obviously it would be
huge. Instead it use logical snapshots which using the same physical data. But in a programming language,
it might be much faster to just use a physical snapshot of some data in memory instead of handling the overhead
of the necessary book keeping for a logical snapshot.
It's important to bear in mind that SI is not serializable isolation by default. If you need to implement
serializable isolation for the snapshots in your system you have to take care of some extra stuff.
## Design of MVCC
In order to implement MVCC in a system we need to decide between different aspects of the system
which would be involved with MVCC. The most crucial aspects are:
* Book keeping of data we need to store
* Concurrency control protocol
* Index Management
* Garbage Collection
* Storage
### Data book keeping
Depends on the concurrency control protocol we want to use, we have to manage some extra data about
every object in our system. In general we need to keep track of the following information about each
object:
* Transaction ID (`TxID`)
* Life time of each object:
* When the transaction that operate on this object began: `BEGIN-TS`
* When the transaction that operate on this object ended: `END-TS`
* A link to the previous/next versions of the same object
And some other information depends on the protocol we use for concurrency control. It's crucial to
decide who to manage and store these data in your system and it's totally depends on the nature of
your system. Is it a disk oriented, single node databse management system ? is it a programming
language operating on a single threaded environment ? or maybe it's an in-memory, distributed
database management system ?
Whatever it is you have to keep in mind that computer science is about tradeoffs. There is no
ultimate answer. For example storing these kind of data along side with the object it self can
increase your storage usage but can save you lots of computation time. It can be wise to do it
in a DBMS but not in a programming language to implement STM.
### Concurrency control protocols for MVCC
* Mutli Version Timestamp Ordering (MTVO)
* The "Optimistic Concurrency Control" (MVOCC)
* Multi Version 2 Phase Locking (MV2PL)
* Serializable Snapshot Isolation (SSI)

View File

@ -23,12 +23,14 @@ have to book 4 flights from, `C1 -> CA -> CB -> C2`. The process of booking each
is a transaction by itself and the whole process is a transaction too.
* Bulk updates
Let's say we want to update billion tuples. What if the very last tuple fails to update. Then we
need to revert our changes to a billion tuples.
Let's say we want to update billion tuples. What if the very last tuple fails to update and cause
the transaction to abort. Then we need to revert the changes made by the transaction and revert
a billion tuples which obviously is a huge task.
## Savepoints transactions
These transactions are similar to save point transaction but they have one extra thing which is
save points So any where in there transaction users case ask for a save point and again they can
## Transaction Savepoints
These transactions are similar to flat transaction with addition of one extra thing which is
save points. So any where in there transaction users case ask for a save point and again they can
rollback to a save point or rollback the entire transaction.
```sql
@ -38,7 +40,7 @@ BEGIN
SAVEPOINT 1
WRITE(B)
SAVEPOINT 2
ROLLBACK 2
ROLLBACK TO 1
COMMIT
```