[IPV4]: Add LC-Trie implementation notes
Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se> Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is contained in:
parent
908a75c17a
commit
b2f5710265
145
Documentation/networking/fib_trie.txt
Normal file
145
Documentation/networking/fib_trie.txt
Normal file
@ -0,0 +1,145 @@
|
||||
LC-trie implementation notes.
|
||||
|
||||
Node types
|
||||
----------
|
||||
leaf
|
||||
An end node with data. This has a copy of the relevant key, along
|
||||
with 'hlist' with routing table entries sorted by prefix length.
|
||||
See struct leaf and struct leaf_info.
|
||||
|
||||
trie node or tnode
|
||||
An internal node, holding an array of child (leaf or tnode) pointers,
|
||||
indexed through a subset of the key. See Level Compression.
|
||||
|
||||
A few concepts explained
|
||||
------------------------
|
||||
Bits (tnode)
|
||||
The number of bits in the key segment used for indexing into the
|
||||
child array - the "child index". See Level Compression.
|
||||
|
||||
Pos (tnode)
|
||||
The position (in the key) of the key segment used for indexing into
|
||||
the child array. See Path Compression.
|
||||
|
||||
Path Compression / skipped bits
|
||||
Any given tnode is linked to from the child array of its parent, using
|
||||
a segment of the key specified by the parent's "pos" and "bits"
|
||||
In certain cases, this tnode's own "pos" will not be immediately
|
||||
adjacent to the parent (pos+bits), but there will be some bits
|
||||
in the key skipped over because they represent a single path with no
|
||||
deviations. These "skipped bits" constitute Path Compression.
|
||||
Note that the search algorithm will simply skip over these bits when
|
||||
searching, making it necessary to save the keys in the leaves to
|
||||
verify that they actually do match the key we are searching for.
|
||||
|
||||
Level Compression / child arrays
|
||||
the trie is kept level balanced moving, under certain conditions, the
|
||||
children of a full child (see "full_children") up one level, so that
|
||||
instead of a pure binary tree, each internal node ("tnode") may
|
||||
contain an arbitrarily large array of links to several children.
|
||||
Conversely, a tnode with a mostly empty child array (see empty_children)
|
||||
may be "halved", having some of its children moved downwards one level,
|
||||
in order to avoid ever-increasing child arrays.
|
||||
|
||||
empty_children
|
||||
the number of positions in the child array of a given tnode that are
|
||||
NULL.
|
||||
|
||||
full_children
|
||||
the number of children of a given tnode that aren't path compressed.
|
||||
(in other words, they aren't NULL or leaves and their "pos" is equal
|
||||
to this tnode's "pos"+"bits").
|
||||
|
||||
(The word "full" here is used more in the sense of "complete" than
|
||||
as the opposite of "empty", which might be a tad confusing.)
|
||||
|
||||
Comments
|
||||
---------
|
||||
|
||||
We have tried to keep the structure of the code as close to fib_hash as
|
||||
possible to allow verification and help up reviewing.
|
||||
|
||||
fib_find_node()
|
||||
A good start for understanding this code. This function implements a
|
||||
straightforward trie lookup.
|
||||
|
||||
fib_insert_node()
|
||||
Inserts a new leaf node in the trie. This is bit more complicated than
|
||||
fib_find_node(). Inserting a new node means we might have to run the
|
||||
level compression algorithm on part of the trie.
|
||||
|
||||
trie_leaf_remove()
|
||||
Looks up a key, deletes it and runs the level compression algorithm.
|
||||
|
||||
trie_rebalance()
|
||||
The key function for the dynamic trie after any change in the trie
|
||||
it is run to optimize and reorganize. Tt will walk the trie upwards
|
||||
towards the root from a given tnode, doing a resize() at each step
|
||||
to implement level compression.
|
||||
|
||||
resize()
|
||||
Analyzes a tnode and optimizes the child array size by either inflating
|
||||
or shrinking it repeatedly until it fullfills the criteria for optimal
|
||||
level compression. This part follows the original paper pretty closely
|
||||
and there may be some room for experimentation here.
|
||||
|
||||
inflate()
|
||||
Doubles the size of the child array within a tnode. Used by resize().
|
||||
|
||||
halve()
|
||||
Halves the size of the child array within a tnode - the inverse of
|
||||
inflate(). Used by resize();
|
||||
|
||||
fn_trie_insert(), fn_trie_delete(), fn_trie_select_default()
|
||||
The route manipulation functions. Should conform pretty closely to the
|
||||
corresponding functions in fib_hash.
|
||||
|
||||
fn_trie_flush()
|
||||
This walks the full trie (using nextleaf()) and searches for empty
|
||||
leaves which have to be removed.
|
||||
|
||||
fn_trie_dump()
|
||||
Dumps the routing table ordered by prefix length. This is somewhat
|
||||
slower than the corresponding fib_hash function, as we have to walk the
|
||||
entire trie for each prefix length. In comparison, fib_hash is organized
|
||||
as one "zone"/hash per prefix length.
|
||||
|
||||
Locking
|
||||
-------
|
||||
|
||||
fib_lock is used for an RW-lock in the same way that this is done in fib_hash.
|
||||
However, the functions are somewhat separated for other possible locking
|
||||
scenarios. It might conceivably be possible to run trie_rebalance via RCU
|
||||
to avoid read_lock in the fn_trie_lookup() function.
|
||||
|
||||
Main lookup mechanism
|
||||
---------------------
|
||||
fn_trie_lookup() is the main lookup function.
|
||||
|
||||
The lookup is in its simplest form just like fib_find_node(). We descend the
|
||||
trie, key segment by key segment, until we find a leaf. check_leaf() does
|
||||
the fib_semantic_match in the leaf's sorted prefix hlist.
|
||||
|
||||
If we find a match, we are done.
|
||||
|
||||
If we don't find a match, we enter prefix matching mode. The prefix length,
|
||||
starting out at the same as the key length, is reduced one step at a time,
|
||||
and we backtrack upwards through the trie trying to find a longest matching
|
||||
prefix. The goal is always to reach a leaf and get a positive result from the
|
||||
fib_semantic_match mechanism.
|
||||
|
||||
Inside each tnode, the search for longest matching prefix consists of searching
|
||||
through the child array, chopping off (zeroing) the least significant "1" of
|
||||
the child index until we find a match or the child index consists of nothing but
|
||||
zeros.
|
||||
|
||||
At this point we backtrack (t->stats.backtrack++) up the trie, continuing to
|
||||
chop off part of the key in order to find the longest matching prefix.
|
||||
|
||||
At this point we will repeatedly descend subtries to look for a match, and there
|
||||
are some optimizations available that can provide us with "shortcuts" to avoid
|
||||
descending into dead ends. Look for "HL_OPTIMIZE" sections in the code.
|
||||
|
||||
To alleviate any doubts about the correctness of the route selection process,
|
||||
a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which
|
||||
gives userland access to fib_lookup().
|
Loading…
Reference in New Issue
Block a user