samba-mirror

mirror of https://github.com/samba-team/samba.git synced 2024-12-25 23:21:54 +03:00

662 lines

20 KiB

C

Raw Normal View History

TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`#ifndef NTDB_PRIVATE_H`
			`#define NTDB_PRIVATE_H`
			`/*`
			`Trivial Database 2: private types and prototypes`
			`Copyright (C) Rusty Russell 2010`

			`This library is free software; you can redistribute it and/or`
			`modify it under the terms of the GNU Lesser General Public`
			`License as published by the Free Software Foundation; either`
			`version 3 of the License, or (at your option) any later version.`

			`This library is distributed in the hope that it will be useful,`
			`but WITHOUT ANY WARRANTY; without even the implied warranty of`
			`MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU`
			`Lesser General Public License for more details.`

			`You should have received a copy of the GNU Lesser General Public`
			`License along with this library; if not, see <http://www.gnu.org/licenses/>.`
			`*/`

			`#include "config.h"`
			`#ifndef HAVE_CCAN`
			`#error You need ccan to build ntdb!`
			`#endif`
			`#include "ntdb.h"`
			`#include <ccan/compiler/compiler.h>`
			`#include <ccan/likely/likely.h>`
			`#include <ccan/endian/endian.h>`

			`#ifdef HAVE_LIBREPLACE`
			`#include "replace.h"`
			`#include "system/filesys.h"`
			`#include "system/time.h"`
			`#include "system/shmem.h"`
			`#include "system/select.h"`
			`#include "system/wait.h"`
			`#else`
			`#include <stdint.h>`
			`#include <stdbool.h>`
			`#include <stdlib.h>`
			`#include <stddef.h>`
			`#include <sys/time.h>`
			`#include <sys/mman.h>`
			`#include <unistd.h>`
			`#include <fcntl.h>`
			`#include <errno.h>`
			`#include <stdio.h>`
			`#include <utime.h>`
			`#include <unistd.h>`
			`#endif`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`#include <assert.h>`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00
			`#ifndef TEST_IT`
			`#define TEST_IT(cond)`
			`#endif`

			`/* #define NTDB_TRACE 1 */`

			`#ifndef __STRING`
			`#define __STRING(x) #x`
			`#endif`

			`#ifndef __STRINGSTRING`
			`#define __STRINGSTRING(x) __STRING(x)`
			`#endif`

			`#ifndef __location__`
			`#define __location__ __FILE__ ":" __STRINGSTRING(__LINE__)`
			`#endif`

			`typedef uint64_t ntdb_len_t;`
			`typedef uint64_t ntdb_off_t;`

			`#define NTDB_MAGIC_FOOD "NTDB file\n"`
			`#define NTDB_VERSION ((uint64_t)(0x26011967 + 7))`
			`#define NTDB_USED_MAGIC ((uint64_t)0x1999)`
			`#define NTDB_HTABLE_MAGIC ((uint64_t)0x1888)`
			`#define NTDB_CHAIN_MAGIC ((uint64_t)0x1777)`
			`#define NTDB_FTABLE_MAGIC ((uint64_t)0x1666)`
			`#define NTDB_CAP_MAGIC ((uint64_t)0x1555)`
			`#define NTDB_FREE_MAGIC ((uint64_t)0xFE)`
			`#define NTDB_HASH_MAGIC (0xA1ABE11A01092008ULL)`
			`#define NTDB_RECOVERY_MAGIC (0xf53bc0e7ad124589ULL)`
			`#define NTDB_RECOVERY_INVALID_MAGIC (0x0ULL)`

			`/* Capability bits. */`
			`#define NTDB_CAP_TYPE_MASK 0x1FFFFFFFFFFFFFFFULL`
			`#define NTDB_CAP_NOCHECK 0x8000000000000000ULL`
			`#define NTDB_CAP_NOWRITE 0x4000000000000000ULL`
			`#define NTDB_CAP_NOOPEN 0x2000000000000000ULL`

			`#define NTDB_OFF_IS_ERR(off) unlikely(off >= (ntdb_off_t)(long)NTDB_ERR_LAST)`
			`#define NTDB_OFF_TO_ERR(off) ((enum NTDB_ERROR)(long)(off))`
			`#define NTDB_ERR_TO_OFF(ecode) ((ntdb_off_t)(long)(ecode))`

			`/* Packing errors into pointers and v.v. */`
			`#define NTDB_PTR_IS_ERR(ptr) \`
			`unlikely((unsigned long)(ptr) >= (unsigned long)NTDB_ERR_LAST)`
			`#define NTDB_PTR_ERR(p) ((enum NTDB_ERROR)(long)(p))`
			`#define NTDB_ERR_PTR(err) ((void *)(long)(err))`

ntdb: make sure file is always a multiple of PAGESIZE (now NTDB_PGSIZE) ntdb uses tdb's transaction code, and it has an undocumented but implicit assumption: that the transaction recovery area is always aligned to the transaction pagesize. This means that no block will overlap the recovery area. This is maintained by rounding the size of the database up, so do the same for ntdb. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:29 +04:00			`/* This doesn't really need to be pagesize, but we use it for similar`
			`* reasons. */`
ntdb: reduce transaction pagesize from 64k to 16k. The performance numbers for transaction pagesize are indeterminate: larger pagesizes means a smaller transaction array, and a better chance of having a contiguous record (more efficient for ntdb_parse_record and some internal operations inside a transaction). On the other hand, large pagesize means more I/O even if we change a few bytes. But it also controls the multiple by which we will enlarge the file, and hence the minimum db size. It's 4k for tdb1, but 16k seems reasonable in these modern times. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:30 +04:00			`#define NTDB_PGSIZE 16384`
ntdb: make sure file is always a multiple of PAGESIZE (now NTDB_PGSIZE) ntdb uses tdb's transaction code, and it has an undocumented but implicit assumption: that the transaction recovery area is always aligned to the transaction pagesize. This means that no block will overlap the recovery area. This is maintained by rounding the size of the database up, so do the same for ntdb. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:29 +04:00
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`/* Common case of returning true, false or -ve error. */`
			`typedef int ntdb_bool_err;`

			`/* Prevent others from opening the file. */`
			`#define NTDB_OPEN_LOCK 0`
			`/* Expanding file. */`
			`#define NTDB_EXPANSION_LOCK 2`
			`/* Doing a transaction. */`
			`#define NTDB_TRANSACTION_LOCK 8`
			`/* Hash chain locks. */`
			`#define NTDB_HASH_LOCK_START 64`

			`/* Extend file by least 100 times larger than needed. */`
			`#define NTDB_EXTENSION_FACTOR 100`

			`/* We steal this many upper bits, giving a maximum offset of 64 exabytes. */`
			`#define NTDB_OFF_UPPER_STEAL 8`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00
			`/* And we use the lower bit, too. */`
			`#define NTDB_OFF_CHAIN_BIT 0`

			`/* Hash table sits just after the header. */`
			`#define NTDB_HASH_OFFSET (sizeof(struct ntdb_header))`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00
			`/* Additional features we understand. Currently: none. */`
			`#define NTDB_FEATURE_MASK ((uint64_t)0)`

			`/* The bit number where we store the extra hash bits. */`
			`/* Convenience mask to get actual offset. */`
			`#define NTDB_OFF_MASK \`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`(((1ULL << (64 - NTDB_OFF_UPPER_STEAL)) - 1) - (1<<NTDB_OFF_CHAIN_BIT))`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00
			`/* How many buckets in a free list: see size_to_bucket(). */`
			`#define NTDB_FREE_BUCKETS (64 - NTDB_OFF_UPPER_STEAL)`

			`/* We have to be able to fit a free record here. */`
			`#define NTDB_MIN_DATA_LEN \`
			`(sizeof(struct ntdb_free_record) - sizeof(struct ntdb_used_record))`

			`/* Indicates this entry is not on an flist (can happen during coalescing) */`
			`#define NTDB_FTABLE_NONE ((1ULL << NTDB_OFF_UPPER_STEAL) - 1)`

ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`/* By default, hash is 64k bytes */`
			`#define NTDB_DEFAULT_HBITS 13`

TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`struct ntdb_used_record {`
			`/* For on-disk compatibility, we avoid bitfields:`
			`magic: 16, (highest)`
			`key_len_bits: 5,`
			`extra_padding: 32`
			`*/`
			`uint64_t magic_and_meta;`
			`/* The bottom key_len_bits2 are key length, rest is data length. /`
			`uint64_t key_and_data_len;`
			`};`

			`static inline unsigned rec_key_bits(const struct ntdb_used_record *r)`
			`{`
			`return ((r->magic_and_meta >> 43) & ((1 << 5)-1)) * 2;`
			`}`

			`static inline uint64_t rec_key_length(const struct ntdb_used_record *r)`
			`{`
			`return r->key_and_data_len & ((1ULL << rec_key_bits(r)) - 1);`
			`}`

			`static inline uint64_t rec_data_length(const struct ntdb_used_record *r)`
			`{`
			`return r->key_and_data_len >> rec_key_bits(r);`
			`}`

			`static inline uint64_t rec_extra_padding(const struct ntdb_used_record *r)`
			`{`
			`return (r->magic_and_meta >> 11) & 0xFFFFFFFF;`
			`}`

			`static inline uint16_t rec_magic(const struct ntdb_used_record *r)`
			`{`
			`return (r->magic_and_meta >> 48);`
			`}`

			`struct ntdb_free_record {`
			`uint64_t magic_and_prev; /* NTDB_OFF_UPPER_STEAL bits magic, then prev */`
			`uint64_t ftable_and_len; /* Len not counting these two fields. */`
			`/* This is why the minimum record size is 8 bytes. */`
			`uint64_t next;`
			`};`

			`static inline uint64_t frec_prev(const struct ntdb_free_record *f)`
			`{`
			`return f->magic_and_prev & ((1ULL << (64 - NTDB_OFF_UPPER_STEAL)) - 1);`
			`}`

			`static inline uint64_t frec_magic(const struct ntdb_free_record *f)`
			`{`
			`return f->magic_and_prev >> (64 - NTDB_OFF_UPPER_STEAL);`
			`}`

			`static inline uint64_t frec_len(const struct ntdb_free_record *f)`
			`{`
			`return f->ftable_and_len & ((1ULL << (64 - NTDB_OFF_UPPER_STEAL))-1);`
			`}`

			`static inline unsigned frec_ftable(const struct ntdb_free_record *f)`
			`{`
			`return f->ftable_and_len >> (64 - NTDB_OFF_UPPER_STEAL);`
			`}`

			`struct ntdb_recovery_record {`
			`uint64_t magic;`
			`/* Length of record (add this header to get total length). */`
			`uint64_t max_len;`
			`/* Length used. */`
			`uint64_t len;`
			`/* Old length of file before transaction. */`
			`uint64_t eof;`
			`};`

			`/* this is stored at the front of every database */`
			`struct ntdb_header {`
			`char magic_food[64]; /* for /etc/magic */`
			`/* FIXME: Make me 32 bit? */`
			`uint64_t version; /* version of the code */`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`uint64_t hash_bits; /* bits for toplevel hash table. */`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`uint64_t hash_test; /* result of hashing HASH_MAGIC. */`
			`uint64_t hash_seed; /* "random" seed written at creation time. */`
			`ntdb_off_t free_table; /* (First) free table. */`
			`ntdb_off_t recovery; /* Transaction recovery area. */`

			`uint64_t features_used; /* Features all writers understand */`
			`uint64_t features_offered; /* Features offered */`

			`uint64_t seqnum; /* Sequence number for NTDB_SEQNUM */`

			`ntdb_off_t capabilities; /* Optional linked list of capabilities. */`
			`ntdb_off_t reserved[22];`

ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`/*`
			`* Hash table is next:`
			`*`
			`* struct ntdb_used_record htable_hdr;`
			`* ntdb_off_t htable[1 << hash_bits];`
			`*/`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`};`

			`struct ntdb_freetable {`
			`struct ntdb_used_record hdr;`
			`ntdb_off_t next;`
			`ntdb_off_t buckets[NTDB_FREE_BUCKETS];`
			`};`

			`struct ntdb_capability {`
			`struct ntdb_used_record hdr;`
			`ntdb_off_t type;`
			`ntdb_off_t next;`
			`/* ... */`
			`};`

			`/* Information about a particular (locked) hash entry. */`
			`struct hash_info {`
			`/* Full hash value of entry. */`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`uint32_t h;`
			`/* Start of hash table / chain. */`
			`ntdb_off_t table;`
			`/* Number of entries in this table/chain. */`
			`ntdb_off_t table_size;`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`/* Bucket we (or an empty space) were found in. */`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`ntdb_off_t bucket;`
			`/* Old value that was in that entry (if not found) */`
			`ntdb_off_t old_val;`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`};`

			`enum ntdb_lock_flags {`
			`/* WAIT == F_SETLKW, NOWAIT == F_SETLK */`
			`NTDB_LOCK_NOWAIT = 0,`
			`NTDB_LOCK_WAIT = 1,`
			`/* If set, don't log an error on failure. */`
			`NTDB_LOCK_PROBE = 2,`
			`/* If set, don't check for recovery (used by recovery code). */`
			`NTDB_LOCK_NOCHECK = 4,`
			`};`

			`struct ntdb_lock {`
			`struct ntdb_context *owner;`
			`off_t off;`
			`uint32_t count;`
			`uint32_t ltype;`
			`};`

			`/* This is only needed for ntdb_access_commit, but used everywhere to`
			`* simplify. */`
			`struct ntdb_access_hdr {`
			`struct ntdb_access_hdr *next;`
			`ntdb_off_t off;`
			`ntdb_len_t len;`
			`bool convert;`
			`};`

			`struct ntdb_file {`
			`/* How many are sharing us? */`
			`unsigned int refcnt;`

			`/* Mmap (if any), or malloc (for NTDB_INTERNAL). */`
			`void *map_ptr;`

			`/* How much space has been mapped (<= current file size) */`
			`ntdb_len_t map_size;`

			`/* The file descriptor (-1 for NTDB_INTERNAL). */`
			`int fd;`

			`/* Lock information */`
			`pid_t locker;`
			`struct ntdb_lock allrecord_lock;`
			`size_t num_lockrecs;`
			`struct ntdb_lock *lockrecs;`

			`/* Identity of this file. */`
			`dev_t device;`
			`ino_t inode;`
			`};`

			`struct ntdb_methods {`
			`enum NTDB_ERROR (tread)(struct ntdb_context , ntdb_off_t, void *,`
			`ntdb_len_t);`
			`enum NTDB_ERROR (twrite)(struct ntdb_context , ntdb_off_t, const void *,`
			`ntdb_len_t);`
			`enum NTDB_ERROR (oob)(struct ntdb_context , ntdb_off_t, ntdb_len_t, bool);`
			`enum NTDB_ERROR (expand_file)(struct ntdb_context , ntdb_len_t);`
			`void (direct)(struct ntdb_context *, ntdb_off_t, size_t, bool);`
ntdb: special accessor functions for read/write of an offset. We also split off the NTDB_CONVERT case (where the ntdb is of a different endian) into its own io function. NTDB speed: Adding 10000 records: 3894-9951(8553) ns (815528 bytes) Finding 10000 records: 1644-4294(3580) ns (815528 bytes) Missing 10000 records: 1497-4018(3303) ns (815528 bytes) Traversing 10000 records: 1585-4225(3505) ns (815528 bytes) Deleting 10000 records: 3088-8154(6927) ns (815528 bytes) Re-adding 10000 records: 3192-8308(7089) ns (815528 bytes) Appending 10000 records: 5187-13307(11365) ns (1274312 bytes) Churning 10000 records: 6772-17567(15078) ns (1274312 bytes) NTDB speed in transaction: Adding 10000 records: 1602-2404(2214) ns (815528 bytes) Finding 10000 records: 456-871(778) ns (815528 bytes) Missing 10000 records: 393-522(503) ns (815528 bytes) Traversing 10000 records: 729-1015(945) ns (815528 bytes) Deleting 10000 records: 1065-1476(1374) ns (815528 bytes) Re-adding 10000 records: 1397-1930(1819) ns (815528 bytes) Appending 10000 records: 2927-3351(3184) ns (1274312 bytes) Churning 10000 records: 3921-4697(4378) ns (1274312 bytes) smbtorture results: ntdb speed 86581-191518(175666) ops/sec Applying patch..increase-top-level.patch 2012-06-19 07:12:13 +04:00			`ntdb_off_t (read_off)(struct ntdb_context ntdb, ntdb_off_t off);`
			`enum NTDB_ERROR (write_off)(struct ntdb_context ntdb, ntdb_off_t off,`
			`ntdb_off_t val);`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`};`

			`/*`
			`internal prototypes`
			`*/`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`/* Get bits from a value. */`
			`static inline uint32_t bits_from(uint64_t val, unsigned start, unsigned num)`
			`{`
			`assert(num <= 32);`
			`return (val >> start) & ((1U << num) - 1);`
			`}`


TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`/* hash.c: */`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`uint32_t ntdb_jenkins_hash(const void *key, size_t length, uint32_t seed,`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`void *unused);`

			`enum NTDB_ERROR first_in_hash(struct ntdb_context *ntdb,`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`struct hash_info *h,`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`NTDB_DATA kbuf, size_t dlen);`

			`enum NTDB_ERROR next_in_hash(struct ntdb_context *ntdb,`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`struct hash_info *h,`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`NTDB_DATA kbuf, size_t dlen);`

			`/* Hash random memory. */`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`uint32_t ntdb_hash(struct ntdb_context ntdb, const void ptr, size_t len);`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00
			`/* Find and lock a hash entry (or where it would be). */`
			`ntdb_off_t find_and_lock(struct ntdb_context *ntdb,`
			`NTDB_DATA key,`
			`int ltype,`
			`struct hash_info *h,`
ntdb: optimize ntdb_fetch. We access the key on lookup, then access the data in the caller. It makes more sense to access both at once. We also put in a likely() for the case where the hash is not chained. Before: Adding 1000 records: 3644-3724(3675) ns (129656 bytes) Finding 1000 records: 1596-1696(1622) ns (129656 bytes) Missing 1000 records: 1409-1525(1452) ns (129656 bytes) Traversing 1000 records: 1636-1747(1668) ns (129656 bytes) Deleting 1000 records: 3138-3223(3175) ns (129656 bytes) Re-adding 1000 records: 3278-3414(3329) ns (129656 bytes) Appending 1000 records: 5396-5529(5426) ns (253312 bytes) Churning 1000 records: 9451-10095(9584) ns (253312 bytes) smbtorture results (--entries=1000) ntdb speed 183881-191112(188223) ops/sec After: Adding 1000 records: 3590-3701(3640) ns (129656 bytes) Finding 1000 records: 1539-1605(1566) ns (129656 bytes) Missing 1000 records: 1398-1440(1413) ns (129656 bytes) Traversing 1000 records: 1629-2015(1710) ns (129656 bytes) Deleting 1000 records: 3118-3236(3163) ns (129656 bytes) Re-adding 1000 records: 3235-3355(3275) ns (129656 bytes) Appending 1000 records: 5335-5444(5385) ns (253312 bytes) Churning 1000 records: 9350-9955(9494) ns (253312 bytes) smbtorture results (--entries=1000) ntdb speed 180559-199981(195106) ops/sec 2012-06-19 07:13:09 +04:00			`struct ntdb_used_record *rec,`
			`const char **rkey);`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00
			`enum NTDB_ERROR replace_in_hash(struct ntdb_context *ntdb,`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`const struct hash_info *h,`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`ntdb_off_t new_off);`

ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`enum NTDB_ERROR add_to_hash(struct ntdb_context *ntdb,`
			`const struct hash_info *h,`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`ntdb_off_t new_off);`

ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`enum NTDB_ERROR delete_from_hash(struct ntdb_context *ntdb,`
			`const struct hash_info *h);`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00
			`/* For ntdb_check */`
			`bool is_subhash(ntdb_off_t val);`
			`enum NTDB_ERROR unknown_capability(struct ntdb_context ntdb, const char caller,`
			`ntdb_off_t type);`

			`/* free.c: */`
			`enum NTDB_ERROR ntdb_ftable_init(struct ntdb_context *ntdb);`

			`/* check.c needs these to iterate through free lists. */`
			`ntdb_off_t first_ftable(struct ntdb_context *ntdb);`
			`ntdb_off_t next_ftable(struct ntdb_context *ntdb, ntdb_off_t ftable);`

			`/* This returns space or -ve error number. */`
			`ntdb_off_t alloc(struct ntdb_context *ntdb, size_t keylen, size_t datalen,`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`unsigned magic, bool growing);`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00
			`/* Put this record in a free list. */`
			`enum NTDB_ERROR add_free_record(struct ntdb_context *ntdb,`
			`ntdb_off_t off, ntdb_len_t len_with_header,`
			`enum ntdb_lock_flags waitflag,`
			`bool coalesce_ok);`

			`/* Set up header for a used/ftable/htable/chain/capability record. */`
			`enum NTDB_ERROR set_header(struct ntdb_context *ntdb,`
			`struct ntdb_used_record *rec,`
			`unsigned magic, uint64_t keylen, uint64_t datalen,`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`uint64_t actuallen);`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00
			`/* Used by ntdb_check to verify. */`
			`unsigned int size_to_bucket(ntdb_len_t data_len);`
			`ntdb_off_t bucket_off(ntdb_off_t ftable_off, unsigned bucket);`

			`/* Used by ntdb_summary */`
			`ntdb_off_t dead_space(struct ntdb_context *ntdb, ntdb_off_t off);`

			`/* Adjust expansion, used by create_recovery_area */`
			`ntdb_off_t ntdb_expand_adjust(ntdb_off_t map_size, ntdb_off_t size);`

			`/* io.c: */`
			`/* Initialize ntdb->methods. */`
			`void ntdb_io_init(struct ntdb_context *ntdb);`

			`/* Convert endian of the buffer if required. */`
			`void ntdb_convert(const struct ntdb_context ntdb, void *buf, ntdb_len_t size);`

			`/* Unmap and try to map the ntdb. */`
			`void ntdb_munmap(struct ntdb_file *file);`
			`enum NTDB_ERROR ntdb_mmap(struct ntdb_context *ntdb);`

			`/* Either alloc a copy, or give direct access. Release frees or noop. */`
			`const void ntdb_access_read(struct ntdb_context ntdb,`
			`ntdb_off_t off, ntdb_len_t len, bool convert);`
			`void ntdb_access_write(struct ntdb_context ntdb,`
			`ntdb_off_t off, ntdb_len_t len, bool convert);`

			`/* Release result of ntdb_access_read/write. */`
			`void ntdb_access_release(struct ntdb_context ntdb, const void p);`
			`/* Commit result of ntdb_acces_write. */`
			`enum NTDB_ERROR ntdb_access_commit(struct ntdb_context ntdb, void p);`

			`/* Clear an ondisk area. */`
			`enum NTDB_ERROR zero_out(struct ntdb_context *ntdb, ntdb_off_t off, ntdb_len_t len);`

			`/* Return a non-zero offset between >= start < end in this array (or end). */`
			`ntdb_off_t ntdb_find_nonzero_off(struct ntdb_context *ntdb,`
			`ntdb_off_t base,`
			`uint64_t start,`
			`uint64_t end);`

			`/* Return a zero offset in this array, or num. */`
			`ntdb_off_t ntdb_find_zero_off(struct ntdb_context *ntdb, ntdb_off_t off,`
			`uint64_t num);`

			`/* Allocate and make a copy of some offset. */`
			`void ntdb_alloc_read(struct ntdb_context ntdb, ntdb_off_t offset, ntdb_len_t len);`

			`/* Writes a converted copy of a record. */`
			`enum NTDB_ERROR ntdb_write_convert(struct ntdb_context *ntdb, ntdb_off_t off,`
			`const void *rec, size_t len);`

			`/* Reads record and converts it */`
			`enum NTDB_ERROR ntdb_read_convert(struct ntdb_context *ntdb, ntdb_off_t off,`
			`void *rec, size_t len);`

			`/* Bump the seqnum (caller checks for ntdb->flags & NTDB_SEQNUM) */`
			`void ntdb_inc_seqnum(struct ntdb_context *ntdb);`

			`/* lock.c: */`
			`/* Print message because another ntdb owns a lock we want. */`
			`enum NTDB_ERROR owner_conflict(struct ntdb_context ntdb, const char call);`

			`/* If we fork, we no longer really own locks. */`
			`bool check_lock_pid(struct ntdb_context ntdb, const char call, bool log);`

ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`/* Lock/unlock a hash bucket. */`
			`enum NTDB_ERROR ntdb_lock_hash(struct ntdb_context *ntdb,`
			`unsigned int hbucket,`
			`int ltype);`
			`enum NTDB_ERROR ntdb_unlock_hash(struct ntdb_context *ntdb,`
			`unsigned int hash, int ltype);`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00
			`/* For closing the file. */`
			`void ntdb_lock_cleanup(struct ntdb_context *ntdb);`

			`/* Lock/unlock a particular free bucket. */`
			`enum NTDB_ERROR ntdb_lock_free_bucket(struct ntdb_context *ntdb, ntdb_off_t b_off,`
			`enum ntdb_lock_flags waitflag);`
			`void ntdb_unlock_free_bucket(struct ntdb_context *ntdb, ntdb_off_t b_off);`

			`/* Serialize transaction start. */`
			`enum NTDB_ERROR ntdb_transaction_lock(struct ntdb_context *ntdb, int ltype);`
			`void ntdb_transaction_unlock(struct ntdb_context *ntdb, int ltype);`

			`/* Do we have any hash locks (ie. via ntdb_chainlock) ? */`
			`bool ntdb_has_hash_locks(struct ntdb_context *ntdb);`

			`/* Lock entire database. */`
			`enum NTDB_ERROR ntdb_allrecord_lock(struct ntdb_context *ntdb, int ltype,`
			`enum ntdb_lock_flags flags, bool upgradable);`
			`void ntdb_allrecord_unlock(struct ntdb_context *ntdb, int ltype);`
			`enum NTDB_ERROR ntdb_allrecord_upgrade(struct ntdb_context *ntdb, off_t start);`

			`/* Serialize db open. */`
			`enum NTDB_ERROR ntdb_lock_open(struct ntdb_context *ntdb,`
			`int ltype, enum ntdb_lock_flags flags);`
			`void ntdb_unlock_open(struct ntdb_context *ntdb, int ltype);`
			`bool ntdb_has_open_lock(struct ntdb_context *ntdb);`

			`/* Serialize db expand. */`
			`enum NTDB_ERROR ntdb_lock_expand(struct ntdb_context *ntdb, int ltype);`
			`void ntdb_unlock_expand(struct ntdb_context *ntdb, int ltype);`
			`bool ntdb_has_expansion_lock(struct ntdb_context *ntdb);`

			`/* If it needs recovery, grab all the locks and do it. */`
			`enum NTDB_ERROR ntdb_lock_and_recover(struct ntdb_context *ntdb);`

			`/* Default lock and unlock functions. */`
			`int ntdb_fcntl_lock(int fd, int rw, off_t off, off_t len, bool waitflag, void *);`
			`int ntdb_fcntl_unlock(int fd, int rw, off_t off, off_t len, void *);`

			`/* transaction.c: */`
			`enum NTDB_ERROR ntdb_transaction_recover(struct ntdb_context *ntdb);`
			`ntdb_bool_err ntdb_needs_recovery(struct ntdb_context *ntdb);`

			`struct ntdb_context {`
			`/* Single list of all TDBs, to detect multiple opens. */`
			`struct ntdb_context *next;`

			`/* Filename of the database. */`
			`const char *name;`

			`/* Logging function */`
			`void (log_fn)(struct ntdb_context ntdb,`
			`enum ntdb_log_level level,`
			`enum NTDB_ERROR ecode,`
			`const char *message,`
			`void *data);`
			`void *log_data;`

			`/* Open flags passed to ntdb_open. */`
			`int open_flags;`

			`/* low level (fnctl) lock functions. */`
			`int (lock_fn)(int fd, int rw, off_t off, off_t len, bool w, void );`
			`int (unlock_fn)(int fd, int rw, off_t off, off_t len, void );`
			`void *lock_data;`

			`/* the ntdb flags passed to ntdb_open. */`
			`uint32_t flags;`

			`/* Our statistics. */`
			`struct ntdb_attribute_stats stats;`

			`/* The actual file information */`
			`struct ntdb_file *file;`

			`/* Hash function. */`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`uint32_t (hash_fn)(const void key, size_t len, uint32_t seed, void *);`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`void *hash_data;`
ntdb: remove hash table trees. TDB2 started with a top-level hash of 1024 entries, divided into 128 groups of 8 buckets. When a bucket filled, the 8 bucket group expanded into pointers into 8 new 64-entry hash tables. When these filled, they expanded in turn, etc. It's a nice idea to automatically expand the hash tables, but it doesn't pay off. Remove it for NTDB. 1) It only beats TDB performance when the database is huge and the TDB hashsize is small. We are about 20% slower on medium-size databases (1000 to 10000 records), worse on really small ones. 2) Since we're 64 bits, our hash tables are already twice as expensive as TDB. 3) Since our hash function is good, it means that all groups tend to fill at the same time, meaning the hash enlarges by a factor of 128 all at once, leading to a very large database at that point. 4) Our efficiency would improve if we enlarged the top level, but that makes our minimum db size even worse: it's already over 8k, and jumps to 1M after about 1000 entries! 5) Making the sub group size larger gives a shallower tree, which performs better, but makes the "hash explosion" problem worse. 6) The code is complicated, having to handle delete and reshuffling groups of hash buckets, and expansion of buckets. 7) We have to handle the case where all the records somehow end up with the same hash value, which requires special code to chain records for that case. On the other hand, it would be nice if we didn't degrade as badly as TDB does when the hash chains get long. This patch removes the hash-growing code, but instead of chaining like TDB does when a bucket fills, we point the bucket to an array of record pointers. Since each on-disk NTDB pointer contains some hash bits from the record (we steal the upper 8 bits of the offset), 99.5% of the time we don't need to load the record to determine if it matches. This makes an array of offsets much more cache-friendly than a linked list. Here are the times (in ns) for tdb_store of N records, tdb_store of N records the second time, and a fetch of all N records. I've also included the final database size and the smbtorture local.[n]tdb_speed results. Benchmark details: 1) Compiled with -O2. 2) assert() was disabled in TDB2 and NTDB. 3) The "optimize fetch" patch was applied to NTDB. 10 runs, using tmpfs (otherwise massive swapping as db hits ~30M, despite plenty of RAM). Insert Re-ins Fetch Size dbspeed (nsec) (nsec) (nsec) (Kb) (ops/sec) TDB (10000 hashsize): 100 records: 3882 3320 1609 53 203204 1000 records: 3651 3281 1571 115 218021 10000 records: 3404 3326 1595 880 202874 100000 records: 4317 3825 2097 8262 126811 1000000 records: 11568 11578 9320 77005 25046 TDB2 (1024 hashsize, expandable): 100 records: 3867 3329 1699 17 187100 1000 records: 4040 3249 1639 154 186255 10000 records: 4143 3300 1695 1226 185110 100000 records: 4481 3425 1800 17848 163483 1000000 records: 4055 3534 1878 106386 160774 NTDB (8192 hashsize) 100 records: 4259 3376 1692 82 190852 1000 records: 3640 3275 1566 130 195106 10000 records: 4337 3438 1614 773 188362 100000 records: 4750 5165 1746 9001 169197 1000000 records: 4897 5180 2341 83838 121901 Analysis: 1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%. 2) TDB starts to lose when hash chains get 10 long (fetch 10% slower than TDB2/NTDB). 3) TDB does horribly when hash chains get 100 long (fetch 4x slower than NTDB, 5x slower than TDB2, insert about 2-3x slower). 4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger than TDB1 2012-06-19 07:13:04 +04:00			`uint32_t hash_seed;`
			`/* Bits in toplevel hash table. */`
			`unsigned int hash_bits;`
TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00
ntdb: allocator attribute. This is designed to allow us to make ntdb_context (and NTDB_DATA returned from ntdb_fetch) a talloc pointer. But it can also be used for any other alternate allocator. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-19 07:12:08 +04:00			`/* Allocate and free functions. */`
			`void (alloc_fn)(const void owner, size_t len, void priv_data);`
			`void (expand_fn)(void old, size_t newlen, void priv_data);`
			`void (free_fn)(void old, void *priv_data);`
			`void *alloc_data;`

TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`/* Our open hook, if any. */`
			`enum NTDB_ERROR (openhook)(int fd, void data);`
			`void *openhook_data;`

			`/* Are we accessing directly? (debugging check). */`
			`int direct_access;`

			`/* Set if we are in a transaction. */`
			`struct ntdb_transaction *transaction;`

			`/* What free table are we using? */`
			`ntdb_off_t ftable_off;`
			`unsigned int ftable;`

			`/* IO methods: changes for transactions. */`
			`const struct ntdb_methods *io;`

			`/* Direct access information */`
			`struct ntdb_access_hdr *access;`
			`};`

			`/* ntdb.c: */`
			`enum NTDB_ERROR COLD PRINTF_FMT(4, 5)`
			`ntdb_logerr(struct ntdb_context *ntdb,`
			`enum NTDB_ERROR ecode,`
			`enum ntdb_log_level level,`
			`const char *fmt, ...);`

ntdb: inline oob check The simple "is it in range" check can be inline; complex cases can be handed through to the normal or transaction handler. NTDB speed: Adding 10000 records: 4111-9983(9149) ns (815528 bytes) Finding 10000 records: 1667-4464(3810) ns (815528 bytes) Missing 10000 records: 1511-3992(3546) ns (815528 bytes) Traversing 10000 records: 1698-4254(3724) ns (815528 bytes) Deleting 10000 records: 3608-7998(7358) ns (815528 bytes) Re-adding 10000 records: 3259-8504(7805) ns (815528 bytes) Appending 10000 records: 5393-13579(12356) ns (1274312 bytes) Churning 10000 records: 6966-17813(16136) ns (1274312 bytes) NTDB speed in transaction: Adding 10000 records: 916-2230(2004) ns (815528 bytes) Finding 10000 records: 330-866(770) ns (815528 bytes) Missing 10000 records: 196-520(471) ns (815528 bytes) Traversing 10000 records: 356-879(800) ns (815528 bytes) Deleting 10000 records: 505-1267(1108) ns (815528 bytes) Re-adding 10000 records: 658-1681(1477) ns (815528 bytes) Appending 10000 records: 1088-2827(2498) ns (1274312 bytes) Churning 10000 records: 1636-4267(3785) ns (1274312 bytes) smbtorture results: ntdb speed 85588-189430(157110) ops/sec 2012-06-19 07:12:12 +04:00			`static inline enum NTDB_ERROR ntdb_oob(struct ntdb_context *ntdb,`
			`ntdb_off_t off, ntdb_len_t len,`
			`bool probe)`
			`{`
			`if (likely(off + len >= off)`
			`&& likely(off + len <= ntdb->file->map_size)`
			`&& likely(!probe)) {`
			`return NTDB_SUCCESS;`
			`}`
			`return ntdb->io->oob(ntdb, off, len, probe);`
			`}`

ntdb: special accessor functions for read/write of an offset. We also split off the NTDB_CONVERT case (where the ntdb is of a different endian) into its own io function. NTDB speed: Adding 10000 records: 3894-9951(8553) ns (815528 bytes) Finding 10000 records: 1644-4294(3580) ns (815528 bytes) Missing 10000 records: 1497-4018(3303) ns (815528 bytes) Traversing 10000 records: 1585-4225(3505) ns (815528 bytes) Deleting 10000 records: 3088-8154(6927) ns (815528 bytes) Re-adding 10000 records: 3192-8308(7089) ns (815528 bytes) Appending 10000 records: 5187-13307(11365) ns (1274312 bytes) Churning 10000 records: 6772-17567(15078) ns (1274312 bytes) NTDB speed in transaction: Adding 10000 records: 1602-2404(2214) ns (815528 bytes) Finding 10000 records: 456-871(778) ns (815528 bytes) Missing 10000 records: 393-522(503) ns (815528 bytes) Traversing 10000 records: 729-1015(945) ns (815528 bytes) Deleting 10000 records: 1065-1476(1374) ns (815528 bytes) Re-adding 10000 records: 1397-1930(1819) ns (815528 bytes) Appending 10000 records: 2927-3351(3184) ns (1274312 bytes) Churning 10000 records: 3921-4697(4378) ns (1274312 bytes) smbtorture results: ntdb speed 86581-191518(175666) ops/sec Applying patch..increase-top-level.patch 2012-06-19 07:12:13 +04:00			`/* Convenience routine to get an offset. */`
			`static inline ntdb_off_t ntdb_read_off(struct ntdb_context *ntdb,`
			`ntdb_off_t off)`
			`{`
			`return ntdb->io->read_off(ntdb, off);`
			`}`

			`/* Write an offset at an offset. */`
			`static inline enum NTDB_ERROR ntdb_write_off(struct ntdb_context *ntdb,`
			`ntdb_off_t off,`
			`ntdb_off_t val)`
			`{`
			`return ntdb->io->write_off(ntdb, off, val);`
			`}`

TDB2: Goodbye TDB2, Hello NTDB. This renames everything from tdb2 to ntdb: importantly, we no longer use the tdb_ namespace, so you can link against both ntdb and tdb if you want to. This also enables building of standalone ntdb by the autobuild script. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> 2012-06-18 17:00:26 +04:00			`#ifdef NTDB_TRACE`
			`void ntdb_trace(struct ntdb_context ntdb, const char op);`
			`void ntdb_trace_seqnum(struct ntdb_context ntdb, uint32_t seqnum, const char op);`
			`void ntdb_trace_open(struct ntdb_context ntdb, const char op,`
			`unsigned hash_size, unsigned ntdb_flags, unsigned open_flags);`
			`void ntdb_trace_ret(struct ntdb_context ntdb, const char op, int ret);`
			`void ntdb_trace_retrec(struct ntdb_context ntdb, const char op, NTDB_DATA ret);`
			`void ntdb_trace_1rec(struct ntdb_context ntdb, const char op,`
			`NTDB_DATA rec);`
			`void ntdb_trace_1rec_ret(struct ntdb_context ntdb, const char op,`
			`NTDB_DATA rec, int ret);`
			`void ntdb_trace_1rec_retrec(struct ntdb_context ntdb, const char op,`
			`NTDB_DATA rec, NTDB_DATA ret);`
			`void ntdb_trace_2rec_flag_ret(struct ntdb_context ntdb, const char op,`
			`NTDB_DATA rec1, NTDB_DATA rec2, unsigned flag,`
			`int ret);`
			`void ntdb_trace_2rec_retrec(struct ntdb_context ntdb, const char op,`
			`NTDB_DATA rec1, NTDB_DATA rec2, NTDB_DATA ret);`
			`#else`
			`#define ntdb_trace(ntdb, op)`
			`#define ntdb_trace_seqnum(ntdb, seqnum, op)`
			`#define ntdb_trace_open(ntdb, op, hash_size, ntdb_flags, open_flags)`
			`#define ntdb_trace_ret(ntdb, op, ret)`
			`#define ntdb_trace_retrec(ntdb, op, ret)`
			`#define ntdb_trace_1rec(ntdb, op, rec)`
			`#define ntdb_trace_1rec_ret(ntdb, op, rec, ret)`
			`#define ntdb_trace_1rec_retrec(ntdb, op, rec, ret)`
			`#define ntdb_trace_2rec_flag_ret(ntdb, op, rec1, rec2, flag, ret)`
			`#define ntdb_trace_2rec_retrec(ntdb, op, rec1, rec2, ret)`
			`#endif /* !NTDB_TRACE */`

			`#endif`

662 lines 20 KiB C Raw Normal View History Unescape Escape

662 lines

20 KiB

C

Raw Normal View History