btrfs:B树经典论文（转）

2019年3月16日 322次阅读来源: B树

B-Trees: Balanced Tree Data Structures

Table of Contents:

Tree structures support various basic dynamic set operations including Search,Predecessor, Successor, Minimum, Maximum, Insert,and Delete in time proportional to the height of the tree. Ideally, atree will be balanced and the height will be log n where n is thenumber of nodes in the tree. To ensure that the height of the tree is as smallas possible and therefore provide the best running time, a balanced treestructure like a red-black tree, AVL tree, or b-tree must be used.

When working with large sets of data, it is often not possible ordesirable to maintain the entire structure in primary storage (RAM). Instead, arelatively small portion of the data structure is maintained in primarystorage, and additional data is read from secondary storage as needed.Unfortunately, a magnetic disk, the most common form of secondary storage, issignificantly slower than random access memory (RAM). In fact, the system oftenspends more time retrieving data than actually processing data.

B-trees are balanced trees that are optimized for situations when part orall of the tree must be maintained in secondary storage such as a magneticdisk. Since disk accesses are expensive (time consuming) operations, a b-treetries to minimize the number of disk accesses. For example, a b-tree with aheight of 2 and a branching factor of 1001 can store over one billion keys butrequires at most two disk accesses to search for any node (Cormen 384).

The Structure of B-Trees

Unlike a binary-tree, each node of a b-tree may have a variable number ofkeys and children. The keys are stored in non-decreasing order. Each key has anassociated child that is the root of a subtree containing all nodes with keysless than or equal to the key but greater than the preceeding key. A node alsohas an additional rightmost child that is the root for a subtree containing allkeys greater than any keys in the node.

A b-tree has a minumum number of allowable children for each node known asthe minimization factor. If t is this minimization factor,every node must have at least t – 1 keys. Under certain circumstances,the root node is allowed to violate this property by having fewer than t – 1keys. Every node may have at most 2t – 1 keys or, equivalently, 2tchildren.

Since each node tends to have a large branching factor (a large number ofchildren), it is typically neccessary to traverse relatively few nodes beforelocating the desired key. If access to each node requires a disk access, then ab-tree will minimize the number of disk accesses required. The minimzationfactor is usually chosen so that the total size of each node corresponds to amultiple of the block size of the underlying storage device. This choicesimplifies and optimizes disk access. Consequently, a b-tree is an ideal datastructure for situations where all data cannot reside in primary storage andaccesses to secondary storage are comparatively expensive (or time consuming).

Height of B-Trees

For n greater than or equal to one, the height of an n-keyb-tree T of height h with a minimum degree t greater than orequal to 2,

《btrfs:B树经典论文（转）》
For a proof of the above inequality, refer to Cormen, Leiserson, and Rivestpages 383-384.

The worst case height is O(log n). Since the”branchiness” of a b-tree can be large compared to many otherbalanced tree structures, the base of the logarithm tends to be large;therefore, the number of nodes visited during a search tends to be smaller thanrequired by other tree structures. Although this does not affect the asymptoticworst case height, b-trees tend to have smaller heights than other trees withthe same asymptotic height.

Operations on B-Trees

The algorithms for the search, create, and insertoperations are shown below. Note that these algorithms are single pass; inother words, they do not traverse back up the tree. Since b-trees strive tominimize disk accesses and the nodes are usually stored on disk, thissingle-pass approach will reduce the number of node visits and thus the numberof disk accesses. Simpler double-pass approaches that move back up the tree tofix violations are possible.

Since all nodes are assumed to be stored in secondary storage (disk)rather than primary storage (memory), all references to a given node be bepreceeded by a read operation denoted by Disk-Read. Similarly, once anode is modified and it is no longer needed, it must be written out tosecondary storage with a write operation denoted by Disk-Write. Thealgorithms below assume that all nodes referenced in parameters have alreadyhad a corresponding Disk-Read operation. New nodes are created andassigned storage with the Allocate-Node call. The implementation detailsof the Disk-Read, Disk-Write, and Allocate-Node functionsare operating system and implementation dependent.

B-Tree-Search(x, k)

i <- 1

while i <= n[x] and k > key_i[x]

do i <- i +1

if i <= n[x] and k = key_i[x]

then return(x, i)

if leaf[x]

then returnNIL

elseDisk-Read(c_i[x])

returnB-Tree-Search(c_i[x], k)

The search operation on a b-tree is analogous to a search on a binary tree.Instead of choosing between a left and a right child as in a binary tree, ab-tree search must make an n-way choice. The correct child is chosen byperforming a linear search of the values in the node. After finding the valuegreater than or equal to the desired value, the child pointer to the immediateleft of that value is followed. If all values are less than the desired value,the rightmost child pointer is followed. Of course, the search can beterminated as soon as the desired node is found. Since the running time of thesearch operation depends upon the height of the tree, B-Tree-Search is O(log_tn).

B-Tree-Create(T)

x <- Allocate-Node()

leaf[x] <- TRUE

n[x] <- 0

Disk-Write(x)

root[T] <- x

The B-Tree-Create operation creates an empty b-tree by allocating anew root node that has no keys and is a leaf node. Only the root node ispermitted to have these properties; all other nodes must meet the criteriaoutlined previously. The B-Tree-Create operation runs in time O(1).

B-Tree-Split-Child(x, i, y)

z <- Allocate-Node()

leaf[z] <- leaf[y]

n[z] <- t – 1

for j <- 1 to t – 1

do key_j[z]<- key_j+t[y]

if not leaf[y]

then for j<- 1 to t

do c_j[z]<- c_j+t[y]

n[y] <- t – 1

for j <- n[x] + 1 downto i + 1

do c_j+1[x]<- c_j[x]

c_i+1 <- z

for j <- n[x] downto i

do key_j+1[x]<- key_j[x]

key_i[x] <- key_t[y]

n[x] <- n[x] + 1

Disk-Write(y)

Disk-Write(z)

Disk-Write(x)

If is node becomes “too full,” it is necessary to perform asplit operation. The split operation moves the median key of node x intoits parent y where x is the i^th child of y.A new node, z, is allocated, and all keys in x right of themedian key are moved to z. The keys left of the median key remain in theoriginal node x. The new node, z, becomes the child immediatelyto the right of the median key that was moved to the parent y, and theoriginal node, x, becomes the child immediately to the left of themedian key that was moved into the parent y.

The split operation transforms a full node with 2t – 1 keys intotwo nodes with t – 1 keys each. Note that one key is moved into theparent node. The B-Tree-Split-Child algorithm will run in time O(t)where t is constant.

B-Tree-Insert(T, k)

r <- root[T]

if n[r] = 2t – 1

then s <-Allocate-Node()

root[T]<- s

leaf[s] <- FALSE

n[s] <- 0

c₁ <- r

B-Tree-Split-Child(s, 1, r)

B-Tree-Insert-Nonfull(s, k)

elseB-Tree-Insert-Nonfull(r, k)

B-Tree-Insert-Nonfull(x, k)

i <- n[x]

if leaf[x]

then while i>= 1 and k < key_i[x]

do key_i+1[x]<- key_i[x]

i <- i – 1

key_i+1[x]<- k

n[x] <- n[x] + 1

Disk-Write(x)

else while i>= and k < key_i[x]

do i<- i – 1

i <- i + 1

Disk-Read(c_i[x])

if n[c_i[x]] = 2t – 1

then B-Tree-Split-Child(x, i, c_i[x])

if k > key_i[x]

then i <- i + 1

B-Tree-Insert-Nonfull(c_i[x], k)

To perform an insertion on a b-tree, the appropriate node for the key mustbe located using an algorithm similiar to B-Tree-Search. Next, the keymust be inserted into the node. If the node is not full prior to the insertion,no special action is required; however, if the node is full, the node must besplit to make room for the new key. Since splitting the node results in movingone key to the parent node, the parent node must not be full or another splitoperation is required. This process may repeat all the way up to the root andmay require splitting the root node. This approach requires two passes. Thefirst pass locates the node where the key should be inserted; the second passperforms any required splits on the ancestor nodes.

Since each access to a node may correspond to a costly disk access, it isdesirable to avoid the second pass by ensuring that the parent node is neverfull. To accomplish this, the presented algorithm splits any full nodesencountered while descending the tree. Although this approach may result inunecessary split operations, it guarantees that the parent never needs to besplit and eliminates the need for a second pass up the tree. Since a split runsin linear time, it has little effect on the O(t log_t n)running time of B-Tree-Insert.

Splitting the root node is handled as a special case since a new root mustbe created to contain the median key of the old root. Observe that a b-treewill grow from the top.

B-Tree-Delete

Deletion of a key from a b-tree is possible; however, special care must betaken to ensure that the properties of a b-tree are maintained. Several casesmust be considered. If the deletion reduces the number of keys in a node belowthe minimum degree of the tree, this violation must be corrected by combiningseveral nodes and possibly reducing the height of the tree. If the key haschildren, the children must be rearranged. For a detailed discussion ofdeleting from a b-tree, refer to Section 19.3, pages 395-397, of Cormen,Leiserson, and Rivest or to another reference listed below.

Examples

Sample B-Tree

Searching a B-Tree for Key 21

Inserting Key 33 into a B-Tree (w/ Split)

Applications

Databases

A database is a collection of data organized in a fashion that facilitates updating,retrieving, and managing the data. The data can consist of anything, including,but not limited to names, addresses, pictures, and numbers. Databases arecommonplace and are used everyday. For example, an airline reservation systemmight maintain a database of available flights, customers, and tickets issued.A teacher might maintain a database of student names and grades.

Because computers excel at quickly and accurately manipulating, storing,and retrieving data, databases are often maintained electronically using a databasemanagement system. Database management systems are essential components ofmany everyday business operations. Database products like MicrosoftSQL Server, Sybase Adaptive Server, IBM DB2, and Oracle serve as a foundation for accounting systems, inventory systems, medicalrecordkeeping sytems, airline reservation systems, and countless otherimportant aspects of modern businesses.

It is not uncommon for a database to contain millions of records requiringmany gigabytes of storage. For examples, TELSTRA, an Australiantelecommunications company, maintains a customer billing database with 51billion rows (yes, billion) and 4.2 terabytes of data. In order for a databaseto be useful and usable, it must support the desired operations, such asretrieval and storage, quickly. Because databases cannot typically bemaintained entirely in memory, b-trees are often used to index the data and toprovide fast access. For example, searching an unindexed and unsorted databasecontaining n key values will have a worst case running time of O(n);if the same data is indexed with a b-tree, the same search operation will runin O(log n). To perform a search for a single key on a set of onemillion keys (1,000,000), a linear search will require at most 1,000,000comparisons. If the same data is indexed with a b-tree of minimum degree 10,114 comparisons will be required in the worst case. Clearly, indexing largeamounts of data can significantly improve search performance. Although otherbalanced tree structures can be used, a b-tree also optimizes costly diskaccesses that are of concern when dealing with large data sets.

Concurrent Access to B-Trees

Databases typically run in multiuser environments where many users canconcurrently perform operations on the database. Unfortunately, this commonscenario introduces complications. For example, imagine a database storing bankaccount balances. Now assume that someone attempts to withdraw $40 from anaccount containing $60. First, the current balance is checked to ensuresufficent funds. After funds are disbursed, the balance of the account isreduced. This approach works flawlessly until concurrent transactions areconsidered. Suppose that another person simultaneously attempts to withdraw $30from the same account. At the same time the account balance is checked by thefirst person, the account balance is also retrieved for the second person.Since neither person is requesting more funds than are currently available,both requests are satisfied for a total of $70. After the first person’stransaction, $20 should remain ($60 – $40), so the new balance is recorded as$20. Next, the account balance after the second person’s transaction, $30 ($60- $30), is recorded overwriting the $20 balance. Unfortunately, $70 have beendisbursed, but the account balance has only been decreased by $30. Clearly,this behavior is undesirable, and special precautions must be taken.

A b-tree suffers from similar problems in a multiuser environment. If twoor more processes are manipulating the same tree, it is possible for the treeto become corrupt and result in data loss or errors.

The simplest solution is to serialize access to the data structure. Inother words, if another process is using the tree, all other processes mustwait. Although this is feasible in many cases, it can place an unecessary andcostly limit on performance because many operations actually can be performedconcurrently without risk. Locking, introduced by Gray and refined bymany others, provides a mechanism for controlling concurrent operations on datastructures in order to prevent undesirable side effects and to ensureconsistency. For a detailed discussion of this and other concurrency controlmechanisms, please refer to the references below.

Works Cited

Bayer, R., M. Schkolnick. Concurrency of Operations on B-Trees. In Readings in Database Systems (ed. Michael Stonebraker), pages 216-226, 1994.
Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, Introduction to Algorithms, MIT Press, Massachusetts: 1998.
Gray, J. N., R. A. Lorie, G. R. Putzolu, I. L. Traiger. Granularity of Locks and Degrees of Consistency in a Shared Data Base. In Readings in Database Systems (ed. Michael Stonebraker), pages 181-208, 1994.
Kung, H. T., John T. Robinson. On Optimistic Methods of Concurrency Control. In Readings in Database Systems (ed. Michael Stonebraker), pages 209-215, 1994.

Useful Links

General B-Tree Information

Databases

Graph and Tree Drawing Tools

Miscellaneous Links

    原文作者：B树
    原文地址: https://blog.csdn.net/u011013137/article/details/9167893
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。