Contents
ToggleCorrected Exercises - Indexing keys in a database
Database indexes are processed by indexing in order to find their record in a database as quickly as possible. Here is a series of exercises dealing with this problem. The course on binary search trees also addresses this problem by
Simple indexing exercise
Let us recall some definitions:
- Index = table allowing the relative address of this article (position in the file) to be associated with an article key (application data)
- Indexed access methods = way of fetching articles (application data) in a file from indexed organization (è primary index)
- Density of an index = quotient of the number of keys in the index to the number of articles in the file. Thus, a dense index has as many keys as there are articles in the file. If the file is sorted, we can use a non-dense index (eg: the largest key of the articles in each block with the relative address of the block)
- Hierarchical index = index on an index, on an index, ... to speed up the search for the key in the index
- Discriminant or non-discriminatory indexes = on discriminating or non-discriminatory data (identifying an article uniquely or not)
- Index placing = which arranges the articles in the order of the keys and restores them in the order in sequential reading of the memory
- Primary index = which is based on the key of the articles, allows them to be stored in memory (incidentally, accelerates access on key)
- Secondary index = an access accelerator, this index is non-discriminatory
Organizations indexed require the creation of a sorted index on which dichotomous searches of a discriminating or non-discriminating key value are applied. The file itself being sorted by key, the corresponding index is of non-dense primary type. This index can be seen as a table of pairs (key_val, page_adr) where key_val is the smallest key of the articles stored in the page referenced by page_adr. When the index is large, it is decomposed into a tree structure in which each node has a size less than or equal to one page. The top node of the tree is called the root of the index.
When searching for a key value in the index, an examination of the root helps determine which part of the tree to continue the search by returning to an immediately lower level node. Examining this node in turn allows the search interval to be refined. This recursive process applies until encountering a node or a leaf (lowest level node of the tree) containing the key value sought and the address of the associated page. Tree organizations are differentiated by the principle according to which the dynamicity of the hierarchical index is managed.
This principle determines the performance obtained both in terms of the number of I/Os necessary to browse the index and in terms of the occupancy rate of each node in the hierarchy.
a swinging tree is also called tree-B, B-tree Where balanced shaft. Intuitively, it is possible to construct a dynamic tree structure as follows. When an index is large, it is possible to index the file containing this index recursively until the highest level index (the root) fits on a single page . Given the dynamicity of the index, the root itself may grow and no longer fit in a page.
We then add a level to the index hierarchy. The hierarchy therefore grows from the root such that all paths from the root to the leaves have the same length. The formal definition of a B-tree is given below and is illustrated on the example of Figure 1.
A B-tree of order m is a tree such that
(i) each node contains k sorted keys, with m <= k <= 2m except the root for which k checks 1<=k <= 2m.
(ii) every non-leaf node has (k+1) children. The ith son has keys including the (i-1)th and ith keys of the father.
(iii) the tree is balanced.
Question 1
Given the formal definition of B-trees, what will be the minimum and maximum heights (number of levels) of a B-tree of order m containing n keys?
Question 2
What are the criteria for choosing the order of a B-tree if we use this structure to build an index?
Question 3
What is the point of the structure of a tree index being balanced?
Question 4
In real applications, access keys may vary in size. Deduce the storage format of the node of a B-tree. What is the use of the limits m and 2m fixed for the number k of keys per node and how should these limits be interpreted in the case of keys of variable size?
Question 5
Represent the B-tree in Figure 1 after inserting keys 37 then 36.
Question 6
Represent the B-tree in Figure 1 after removing keys 17 then 31.
Question 7
Many applications perform sorted sequential processing on their files. Is the B-tree structure well suited to this type of processing? How much I/O does a sorted sequential traversal of all keys in the B-tree shown in Figure 1 require? Propose an optimization of this structure allowing only the leaves of the tree to be read during a sorted sequential traversal. This organization is known as the B+ tree. Discuss the advantages of B+ trees over B trees.
Question 1
The height of a B-tree scales logarithmically with respect to the number of keys stored in the tree. Indeed, in a B-tree of order m, each node has between (m+1) and (2m+1) children. We can therefore store, at level i+1 of tree-B, between (m+1) and (2m+1) times more keys than at level i, depending on whether the tree is lightly or heavily occupied. The formulas for calculating the minimum and maximum height of a B-tree, depending on its order and the number of keys it contains, are therefore:
h_min » log_(2m+1) n h_max » log_(m+1) n
For example, a B-tree of order 100 can contain more than 8 million keys on just three levels.
Question 2
When searching (or writing) an index, accessing each node traversed in the tree generates an I/O. The minimum and maximum heights of the tree therefore limit the I/O cost of searching for a key in the index. However, it was shown in the previous question that the greater the order of the B-tree, the lower its height. On the other hand, we have shown that the number of comparisons generated by the search for a key in a B-tree is independent of the order of this B-tree.
It is therefore interesting to choose as large an order as possible in order to store a maximum of keys per node and minimize the height of the index. This choice must, however, take into account the maximum size of the B-tree nodes. This size is bounded by the size of a disk page to ensure that reading a node requires only one I/O.
Question 3
A balanced tree guarantees a constant access cost to any key value regardless of key distribution. It is therefore possible to limit the number of I/Os necessary for research or writing an article. This property is particularly important in databases. Indeed, when several access paths can be taken to access an article, a DBMS must be able to automatically make the optimal choice.
Question 4
A B-tree node contains a set of pairs (key, child pointer) and an additional pointer referencing the first child node (a node containing n keys has n+1 children), as shown in Figure 1. If the size of the keys is fixed, this size is stored in the index descriptor. If the size of the keys is variable, two solutions are possible:
– the keys are framed by a special character. This character, called a separator, must have a binary coding different from all the bytes likely to be contained in a key. Furthermore, reading the ith key of a node requires reading byte by byte of all the keys preceding it in this node.
– keys are preceded by a length field. This technique is the only one applicable if the binary coding of the keys covers all possible binary codes. The ith key of a node can thus be accessed by jumping from length field to length field.
Figure 1. Storage format of a B-tree node
The m and 2m terminals ensure that the filling rate of a node varies between 50% and 100% (and is therefore on average 75%). In the definition of B-trees, this filling rate k is expressed in number of keys per node. When using B-trees containing keys of variable size, the filling rate of a node must then be expressed in number of bytes rather than in number of keys.
Question 5
The insertions of keys 37 then 36 are illustrated in the Figure. Insertion of key 36 causes the sheet where it is to be inserted to burst. The rise of the middle key 37 in turn causes the parent node of this leaf to burst. The B-tree therefore grows from the top and remains balanced.
Figure 2. Insertion of keys 37 then 36
Question 6
The deletion of key 17 results in an under-occupancy of the leaf containing key 17. We choose here to merge this leaf with its right sister (leaf belonging to the same parent node). The middle key 24 goes back down into the merge result sheet. The merger of two sheets can be followed by a re-burst if the sheet resulting from the merger is over-occupied. When merging 2 nodes (resp. leaves), the choice of brother (resp. sister) with which to carry out the merger can be systematic (right brother or left brother each time) or dynamic.
A dynamic choice allows you to merge with the node (resp. leaf) which will produce the most stable possible fusion result, that is to say neither over-occupied nor under-occupied. This optimization, however, requires the reading of both brothers (resp. sisters) for a hypothetical gain.
Figure 3. Deleting keys 17 then 31
Deleting key 31 introduces a particular problem in that it does not belong to a leaf. To delete a key in a non-leaf node, replace it with the largest key in the left subtree (key 29 in the previous example) or with the smallest key in the right subtree (key 32 in the example ) whose key to delete is root. The chosen key must in turn be deleted from the sheet using the classic deletion procedure.
Question 7
The B-tree structure allows sorted sequential accesses since the keys are sorted in the tree. However, a sorted sequential traversal of the B-tree requires, after reading all the keys of a leaf (resp. of a node), to go back to the parent node to access the right sister leaf (right sibling node) to continue reading the keys in sorted order. Thus, a parent node will be accessed on disk each time one of its children is read, in order to retrieve the address of this child. As a result, a sorted sequential read of the 12 nodes and leaves of the presented B-tree requires a total of 20 I/Os.
An optimization of the B-tree structure for sorted sequential traversals is to duplicate all keys stored in non-leaf nodes at the leaf level and then chain these leaves together in sort order. The address of the first sheet can additionally be stored in the file descriptor. Thus, a sorted sequential traversal only requires reading the leaves of the tree. This type of tree is called a B+ tree [Comer79]. The B+ tree corresponding to the B-tree is presented in Figure 4.
Figure 4. Tree-B+ corresponding to tree-B in Figure 2
Multiple index exercise
In the field of databases, article searches are generally carried out on multiple criteria (example: searching for cars with power greater than 10 hp and Citroën brand). File organizations indexed with a single index favor processing the most frequently used search criterion on a file, ordering the articles on disk according to this criterion. To efficiently process multi-criteria searches, it is possible to manage several indexes for the same file.
The solution consists of using a main placing index (the data is sorted according to this index) and subsidiary indexes, called secondary indexes. A secondary index is constituted on an attribute (or several) which may or may not be discriminatory and gives for each value of the attribute the identifiers (often the relative addresses) of the articles having this value. The attribute (or group of attributes) thus indexed is called a secondary key.
During a secondary key search, access to the secondary index (a B or B+ tree) delivers the identifiers of all articles satisfying the search criterion. These articles are then retrieved one by one via their identifiers, with at most one I/O per article. Creating a secondary index introduces a significant storage cost. In addition, any update of the data file causes all secondary indexes relating to this file to be updated. The use of secondary indexes must therefore be restricted to frequent search criteria.
Question 1
We wish to use an organization based on a B+ tree technique to index the articles of a file according to several attributes, A1 being primary (placing) key and A2, A3, … Ap being secondary keys. Specify how the algorithms for inserting, updating and deleting articles in files should be modified.
Question 2
In the same context where the file is placed on A1 and indexed on the secondary keys A2, A3, … Ap, specify the algorithm for executing a criterion restriction (A1 = v1 & A2 = v2 … & Ap = vp ). How can we optimize the search for relevant articles? How to modify the algorithm if some "and" ( & ) are replaced by "or" ( | ), assuming a disjunctive normal form (and of or)?
Question 3
In the same context where the file is placed on A1 and indexed on the secondary keys A2, A3, … Ap, specify the algorithm for executing a criterion restriction (A1 = v1 & A2 = v2 … & Ap = vp & B1 = v'1 & B2 = v'2 & …), the Bi being non-indexed attributes. Given a forecast frequency of querying and updating an attribute, specify how to choose whether an attribute benefits from being indexed or not?
The B+ tree management algorithms introduced assume single-criteria sorting (i.e., taking into account a single attribute). In order to take into account several attributes in the placement method, it is enough to adapt the function Search_dichoto to multi-criteria sorting. The key to use in sorting is the concatenation of the attributes A1, A2, …, An. The comparison of 2 keys is first carried out on the attribute A1. In case of equality, we take into account the value of attribute A2 and so on. A c1 key is therefore considered superior to a c2 key if ($i / c1.Ai > c2.Ai) and (i=1 or “j
This method does not treat different attributes equally. Thus, if the chosen sorting criterion is A1, A2, ..., An, it is possible to accelerate searches based on a complete key or based on a partial key provided that this partial key preserves the first attributes of the complete key. Thus, a search for all articles such that (A1=value1 and A2=value2) is possible. On the other hand, a search for all articles such as (A2=value1 and A3=value2) without specifying a value for A1 cannot be accelerated by the B+ tree.