Learn Databases: B+Tree internal working functionality

BTree internally working functionality
--------------------------------------------------
MySQL allows for different types of indexes to be used. Most common is a B-Tree index, although this name is also used for other types of indexes: T-Tree in MySQL Cluster and B+Tree in InnoDB. It is called as Balanced Tree and not Binary Tree.

Requirement:
- Linear search is very slow, complexity is O(n)
- Indexes improve search performance. But add extra cost to INSERT/UPDATE/DELETE
- InnoDB uses a B+Tree structure for its indexes.

B+Tree Characteristics:
- Every node can have p – 1 key values and p node pointers (p is called the order of the tree)
- The leaf node contains data, internal nodes are only used as guides
- The leaf nodes are connected together as doubly linked list
- Keys are stored in the nodes in sorted order
- All leaf nodes are at the same height, that’s why it’s called a balanced tree

Let’s look at how a B+Tree index is designed:

In the diagram above, we have a root node and four leaf nodes. Each entry in a leaf node has a pointer to a row related to this index entry. Leaf nodes are also connected together - each leaf node has a pointer to the beginning of another leaf node (brown square) - this allows for a quick range scan of the index. Index lookup is also pretty efficient. Let’s say we want to find data for entry ‘9’. Instead of scanning the whole table, we can use an index. A quick check in the root node and it’s clear we need to go to a leaf node which covers entries 5<= x < 11. So, we go to the second leaf node and we have the entry for ‘9’.

In the presented example, some of the nodes are not full yet. If we’d insert a row with indexed value of, let’s say 20, it can be inserted into the empty space in the third leaf node.

But what about a row with indexed value of ‘50’? Things are going to be a bit more complex as we do not have enough space in the fourth leaf node. It has to be split.

So, it wasn’t just a single write, i.e., adding an entry to a leaf node. Instead, a couple of additional write operations have to be performed. Please bear in mind that we are still on two levels of a B-Tree index, we have a full root node though. Another split will require adding an additional level of nodes which leads to more write operations. As you may have noticed, indexes make write operations much more expensive than they normally are - a lot of additional work goes into managing indexes. This is very important concept to keep in mind - more indexes is not always better as you are trading quicker selects for slower inserts.

Cost Calculations: - h is the height of the tree
- p is the branching factor of the tree
- n is the number of rows in a table
- p = (page size in bytes/key length in bytes) + 1
- h > log n / log p

Search cost for a single row:
- S = h I/O ops
Update cost for a single row:
- U = search cost + rewrite data page = h + 1 I/O ops
Insert cost for a single row:
- I = search cost + rewrite index page + rewrite data page
- I = h + 1 + 1 = h + 2 I/O ops
Delete cost for a single row:
- D = search cost + rewrite index page + rewrite data page
- D = h + 1 + 1 = h + 2 I/O ops

B+Tree levels and Tree Depth:
B+trees are normally structured in such a way that the size of a node is chosen according to the page size. Why? Because whenever data is accessed on disk, instead of reading a few bits, a whole page of data is read, because that is much cheaper.
Let us look at an example:
Consider InnoDB whose page size is 16KB and suppose we have an index on a integer column of size 4bytes, so a node can contain at most 16 * 1024 / 4 = 4096 keys, and a node can have at most 4097 children.
So for a B+tree of height 1, the root node has 4096 keys and the nodes at height 1 (the leaf nodes) have 4096 * 4097 = 16781312 key values.
This goes to show the effectiveness of a B+tree index, more than 16 million key values can be stored in a B+tree of height 1 and every key value can be accessed in exactly 2 lookups.

Points to Consider:
- Updates are in place only if the new data is of the same size, otherwise its delete plus insert
- Inserts may require splits if the leaf node is full
- Occasionally the split of a leaf node necessitates split of the next higher node
- In worst case scenarios the split may cascade all the way up to the root node
- Deletions may result in emptying a node that necessitates the consolidation of two nodes

Advantages:
- Reduced I/O
- Reduced Rebalancing
- Extremely efficient range scans
- Implicit sorting

Reduced I/O:
- Height of a B+Tree is very small (and has a very large branching factor)
- Generally every node in a tree corresponds to a page of data (page size ranges from 2(11) to 2(14) bytes)
- A node read = read a page = 1 random I/O
- So to reach leaf node, we need to read h pages
- No matter if requested row is at the start or end of table, same number of I/O is needed

Reduced Rebalancing:
- A tree needs rebalancing after an insertion or deletion
- B+Tree is wide, more keys can fit in node, so rebalancing needed few times on insertions and deletions
- Note that rebalancing means extra I/O, so rebalancing saved is I/O saved

Efficient Range Scans:
- Leaf nodes are linked together as doubly linked list
- So need to traverse from root -> leaf just once
- Move from leaf -> leaf until you reach the end of range
- Entire tree may be scanned without visiting the higher nodes at all

Implicit Sorting:
- Nodes contain keys sorted in key-order
- Therefore records can be implicitly returned in sorted order
- No external sorting needed hence memory and CPU cycles saved
- Sometimes sorted data cannot fit into buffer, and data needs to be sorted in passes, needing I/O, which can be avoided if you need data in key order

B+Tree Index in InnoDB:
- B+Tree Index in InnoDB is a typical B+Tree structure, no strings attached!
- Leaf nodes contain the data (what the data is depends whether it’s a Primary Index or a Secondary Index)
- Root nodes and internal nodes contain only key values

Primary and Secondary B+Tree Indexes:
- Primary index holds the entire row data in its leaf nodes
- Primary index can also be called a clustered index, because data is clustered around PK values
- A single PK per table means, a single clustered index per table
- Secondary Indexes have the key values and PK values in the index and no row data
- PK values stored in the leaf nodes of a secondary index act as pointer to the data
- This means secondary index lookups are two lookups
: Cost of secondary index lookup
: C = Height of Secondary Index B+Tree + Height of Primary Index B+Tree

Characteristics of an Ideal Primary Index:
- Create primary index on column(s) that are not updated too often
- Keep the size of the primary index as small as possible
- Select the column(s) to create primary index on, that have sequentially increasing value
- Random value columns, such as those that store UUID, are very bad candidates for primary index

Speeding up Secondary Indexes:
- Remember secondary indexes only store PK pointers meaning two index lookups
- Performance can be dramatically improved if we avoid extra PK lookups
- The trick is to include all the columns queried, in the definition of the secondary index

Secondary Index Optimization:
- Secondary index lookups are more expensive than primary key lookups, because any secondary index lookup only results in a tuple that can be used to navigate the primary key.
- So there’s a cost to secondary indexes in InnoDB. There’s an optimization too. Once a query navigates to the leaf node of a secondary index, it knows two things: the values it used to navigate the index, and the primary key values of that row in the table.

Example:
create table test(
   variety varchar(10) primary key,
   note varchar(50),
   price int,
   key(price)
) engine=InnoDB;
Query> SELECT variety from test where price=5;

The query takes the value 5 and navigates the price index. When it gets to the leaf node, it finds the value, which it can use to navigate the primary key. But why does it need to do that? It already has the value it was looking for. In fact, if the query only refers to values in the secondary and clustered index, it doesn’t need to leave the secondary index.

This is a fantastic optimization. It means each secondary index is like another table, clustered index-first. In this example, the secondary index is like a table containing just price and variety, clustered in that order.

Noticed many people have a tendency to write SELECT * FROM... queries. If you don’t need all the columns, don’t select all the columns, because it can make the difference between a fast and a slow query. If you only select the columns you need, your query might be able to use one of the optimizations.

Tips and Take Away:
- Indexing should always be used to speed up access
- Index trade-off analysis can be done easily using the cost estimation formulae discussed
- Select optimal data types for columns, especially ones that are to be indexed – int vs bigint
- When selecting columns for PK, select those that would make the PK short, sequential and with few updates
- Avoid using UUID style PK definitions
- Insert speed is best when you insert in PK order
- When creating index on string columns, you don’t need to index the entire column, you can index a prefix of the column – idx(str_col(4))
- B+Tree indexes are only suitable for columns with good selectivity
- Don’t shy away from creating composite indexes

Note:
The InnoDB storage engine creates a clustered index for every table. If the table has a primary key, that is the clustered index. If not, InnoDB internally assigns a six-byte unique ID to every row and uses that as the clustered index.

References:
http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/
https://blog.jcole.us/2013/01/10/btree-index-structures-in-innodb/
https://www.percona.com/files/presentations/percona-live/london-2011/PLUK2011-b-tree-indexes-and-innodb.pdf
http://severalnines.com/blog/become-mysql-dba-blog-series-database-indexing
http://www.ovaistariq.net/733/understanding-btree-indexes-and-how-they-impact-performance/#.V7Lk5vl94lI

Learn Databases

Tuesday, August 16, 2016

B+Tree internal working functionality

No comments:

Post a Comment