Commit Graph

50 Commits

Author SHA1 Message Date
Gleb Fotengauer-Malinovskiy
2cee1a78a7 set.c: rewrite without nested functions 2015-05-21 18:08:21 +03:00
Alexey Tourbin
f25f962fe6 set.c: fixed sentinel allocation 2012-12-24 16:24:15 +04:00
Alexey Tourbin
798ce0db28 set.c: implemented 4-byte and 8-byte steppers for rpmsetcmp main loop
Provides versions, on average, are about 34 times longer that Requires
versions.  More precisely, if we consider all rpmsetcmp calls for
"apt-shell <<<unmet" command, then sum(c1)/sum(c2)=33.88.  This means
that we can save some time and instructions by skipping intermediate
bytes - in other words, by stepping a few bytes at a time.  Of course,
after all the bytes are skipped, we must recheck a few final bytes and
possibly step back.  Also, this requires more than one sentinel for
proper boundary checking.

This change implements two such "steppers" - 4-byte stepper for c1/c2
ratio below 16 and 8-byte stepper which is used otherwise.  When
stepping back, both steppers use bisecting.  Note that replacing last
two bisecting steps with a simple loop might be actually more efficient
with respect to branch prediction and CPU's BTB.  It is very hard to
measure any user time improvement, though, even in a series of 100 runs.
The improvement is next to none, at least on older AMD CPUs.  And so I
choose to keep bisecting.

callgrind annotations for "apt-shell <<<unmet", previous commit:
2,279,520,414  PROGRAM TOTALS
646,107,201  lib/set.c:decode_base62_golomb
502,438,804  lib/set.c:rpmsetcmp
 98,243,148  sysdeps/x86_64/memcmp.S:bcmp
 93,038,752  sysdeps/x86_64/strcmp.S:__GI_strcmp

callgrind annotations for "apt-shell <<<unmet", this commit:
2,000,254,692  PROGRAM TOTALS
642,039,009  lib/set.c:decode_base62_golomb
227,036,590  lib/set.c:rpmsetcmp
 98,247,798  sysdeps/x86_64/memcmp.S:bcmp
 93,047,422  sysdeps/x86_64/strcmp.S:__GI_strcmp
2012-03-15 07:02:09 +04:00
Alexey Tourbin
d78a2cbf3d set.c: increased cache size from 160 to 256 slots, 75% hit ratio
Hit ratio for "apt-shell <<<unmet" command:
160 slots: hit=46813 miss=22862 67.2%
256 slots: hit=52238 miss=17437 75.0%

So, we've increased the cache size by a factor of 256/160=1.6 or by 60%,
and the number of misses has decreased by a factor of 22862/17437=1.31
or by 1-17437/22862=23.7%.  This is not so bad, but it looks like we're
paying more for less.  The following analysis shows that this is not
quite true, since the real memory usage has increased by a somewhat
smaller factor.

160 slots, callgrind annotations:
2,406,630,571  PROGRAM TOTALS
795,320,289  lib/set.c:decode_base62_golomb
496,682,547  lib/set.c:rpmsetcmp
 93,466,677  sysdeps/x86_64/strcmp.S:__GI_strcmp
 91,323,900  sysdeps/x86_64/memcmp.S:bcmp
 90,314,290  stdlib/msort.c:msort_with_tmp'2
 83,003,684  sysdeps/x86_64/strlen.S:__GI_strlen
 58,300,129  sysdeps/x86_64/memcpy.S:memcpy
...
inclusive:
1,458,467,003  lib/set.c:rpmsetcmp

256 slots, callgrind annotations:
2,246,961,708  PROGRAM TOTALS
634,410,352  lib/set.c:decode_base62_golomb
492,003,532  lib/set.c:rpmsetcmp
 95,643,612  sysdeps/x86_64/memcmp.S:bcmp
 93,467,414  sysdeps/x86_64/strcmp.S:__GI_strcmp
 90,314,290  stdlib/msort.c:msort_with_tmp'2
 79,217,962  sysdeps/x86_64/strlen.S:__GI_strlen
 56,509,877  sysdeps/x86_64/memcpy.S:memcpy
...
inclusive:
1,298,977,925  lib/set.c:rpmsetcmp

So the decoding routine now takes about 20% fewer instructions, and
inclusive rpmsetcmp cost is reduced by about 11%.  Note, however, that
bcmp is now the third most expensive routine (due to higher hit ratio).
Since recent glibc versions provide optimized memcmp implementations, I
imply that total/inclusive improvement can be somewhat better than 11%.

As per memory usage, the question "how much the cache takes" cannot be
generally answered with a single number.  However, if we simply sum the
size of all malloc'd chunks on each rpmsetcmp invocation, using the
piece of code with a few obvious modifications elsewhere, we can obtain
the following statistics.

	if (hc == CACHE_SIZE) {
	    int total = 0;
	    for (i = 0; i < hc; i++)
	        total += ev[i]->msize;
	    printf("total %d\n", total);
	}

160 slots, memory usage:
min=1178583
max=2048701
avg=1330104
dev=94747
q25=1266647
q50=1310287
q75=1369005

256 slots, memory usage:
min=1670029
max=2674909
avg=1895076
dev=122062
q25=1828928
q50=1868214
q75=1916025

This indicates that average cache size is increased by about 42% from
1.27M to 1.81M; however, the third quartile is increased by about 40%,
and the maximum size is increased only by about 31% from 1.95M to 2.55M.
By which I conclude that extra 600K must be available even on low-memory
machines like Raspberry Pi (256M RAM).

* * *

What's a good hit ratio?

$ DepNames() { pkglist-query '[%{RequireName}\t%{RequireVersion}\n]' \
	/var/lib/apt/lists/_ALT_Sisyphus_x86%5f64_base_pkglist.classic |
		fgrep set: |cut -f1; }
$ DepNames |wc -l
34763
$ DepNames |sort -u |wc -l
2429
$ DepNames |sort |uniq -c |sort -n |awk '$1>1{print$1}' |Sum
33924
$ DepNames |sort |uniq -c |sort -n |awk '$1>1{print$1}' |wc -l
1590
$ DepNames |sort |uniq -c |sort -n |tail -256 |Sum
27079
$

We have 34763 set-versioned dependencies, which refer to 2429 sonames;
however, only 33924 dependencies refer to 1590 sonames more than once,
and the first reference is always a miss.  Thus the best possible hit
ratio (if we use at least 1590 slots) is (33924-1590)/34763=93.0%.

What happens if we use only 256 slots?  Assuming that dependencies are
processed in random order, the best strategy must spend its cache slots
on sonames with the most references.  This way we can serve (27079-256)
dependencies via cache hit, and so the best possible hit ratio for 256
slots is is 77.2%, assuming that dependencies are processed in random
order.
2012-03-09 02:42:21 +04:00
Alexey Tourbin
0af7afd2e5 set.c: precompute r mask for putbits coroutines
callgrind annotations for "apt-shell <<<unmet", previous commit:
2,424,712,279  PROGRAM TOTALS
813,389,804  lib/set.c:decode_base62_golomb
496,701,778  lib/set.c:rpmsetcmp

callgrind annotations for "apt-shell <<<unmet", this commit:
2,406,630,571  PROGRAM TOTALS
795,320,289  lib/set.c:decode_base62_golomb
496,682,547  lib/set.c:rpmsetcmp
2012-03-09 00:51:58 +04:00
Alexey Tourbin
80cec29464 set.c: improved cache_decode_set loop
I am going to consdier whether it is worthwhile to increase the cache
size.  Thus I have to ensure that the linear search won't be an obstacle
for doing so.  Particularly, its loop must be efficient in terms of both
cpu instructions and memory access patterns.

1) On behalf of memory access patterns, this change introduces two
separate arrays: hv[] with hash values and ev[] with actual cache
entries.  On x86-64, this saves 4 bytes per entry which have previously
been wasted to align cache_hdr structures.  This has some benefits on
i686 as well: for example, ev[] is not accessed on a cache miss.

2) As per instructions, the loop has two branches: the first is for
boundary checking, and the second is for matching hash condition.  Since
the boundary checking condition (cur->ent != NULL) relies on a sentinel,
the loop cannot be unrolled; it takes 6 instructions per iteration.  If
we replace the condition with explicit boundary check (hp < hv + hc),
the number of iterations becomes known upon entry to the loop, and gcc
will unroll the loop; it takes now 3 instructions per iteration, plus
some (smaller) overhead for boundary checking.

This change also removes __thread specifiers, since gcc is apparently
not very good at optimizing superfluous __tls_get_addr calls.  Also, if
we are to consider larger cache sizes, it becomes questionable whether
each thread should posess its own cache only as a means of achieving
thread safety.  Anyway, currently I'm not aware of threaded applications
which make concurrent librpm calls.

callgrind annotations for "apt-shell <<<unmet", previous commit:
2,437,446,116  PROGRAM TOTALS
820,835,411  lib/set.c:decode_base62_golomb
510,957,897  lib/set.c:rpmsetcmp
...
 23,671,760      for (cur = cache; cur->ent; cur++) {
  1,114,800  => /usr/src/debug/glibc-2.11.3-alt7/elf/dl-tls.c:__tls_get_addr (69675x)
 11,685,644  	if (hash == cur->hash) {
          .  	    ent = cur->ent;

callgrind annotations for "apt-shell <<<unmet", this commit:
2,431,849,572  PROGRAM TOTALS
820,835,411  lib/set.c:decode_base62_golomb
496,682,547  lib/set.c:rpmsetcmp
...
 10,204,175      for (hp = hv; hp < hv + hc; hp++) {
 11,685,644  	if (hash == *hp) {
    189,344  	    i = hp - hv;
    189,344  	    ent = ev[i];

Total improvement is not very impressive (6M instead of expected 14M),
mostly due to memmove complications - hv[] cannot be shifted efficiently
using 8-byte words.  However, the code now scales better.  Also, recent
glibc versions supposedly provide much improved memmove implementation.
2012-03-08 02:35:59 +04:00
Alexey Tourbin
568fe52e61 set.c: reimplemented decode_base62_golomb using Knuth's coroutines
Since the combined base62+golomb decoder is still the most expensive
routine, I have to consider very clever tricks to give it a boost.

In the routine, its "master logic" is executed on behalf of the base62
decoder: it makes bits from the string and passes them on to the "slave"
golomb routine.  The slave routine has to maintain its own state (doing
q or doing r); after the bits are processed, it returns and base62 takes
over.  When the slave routine is invoked again, it has to recover the
state and take the right path (q or r).  These seemingly cheap state
transitions can actually become relatively expensive, since the "if"
clause involves branch prediction which is not particularly accurate on
variable-length inputs.  This change demonstrates that it is possible to
get rid of the state-related instructions altogether.

Roughly, the idea is that, instead of calling putNbits(), we can invoke
"goto *putNbits", and the pointer will dispatch either to putNbitsQ or
putNbitsR label (we can do this with gcc's computed gotos).  However,
the goto will not return, and so the "putbits" guys will have to invoke
"goto getbits", and so on.  So it gets very similar to coroutines as
described in [Knuth 1997, vol. 1, p. 194].  Furthermore, one must
realize that computed gotos are not actually required: since the total
number of states is relatively small - roughly (q^r)x(reg^esc,align) -
it is possible to instantiate a few similar coroutines which pass
control directly to the right labels.

For example, the decoding is started with "get24q" coroutine - that is,
we're in the "Q" mode and we try to grab 24 bits (for the sake of the
example, I do not consider the initial align step).  If 24 bits are
obtained successfully, they are passed down to the "put24q" coroutine
which, as its name suggests, takes over in the "Q" mode immediately;
furthermore, in the "put24q" coroutine, the next call to get bits has to
be either "get24q" or "get24r" (depending on whether Q or R is processed
when no bits are left) - that is, the coroutine itself must "know" that
there is no base62 complications at this point.  The "get24r" is similar
to "get24q" except that it will invoke "put24r" instead of "put24q".  On
the other hand, consider that, in the beginning, only 12 bits have been
directly decoded (and the next 12 bits probably involve "Z").  We then
pass control to "put12q", which will in turn call either "get12q" or
"get12r" to handle irregular cases for the pending 12 bits (um, the
names "get12q" and "get12r" are a bit of a misnomer).

This change also removes another branch in golomb R->Q transition:

        r &= (1 << Mshift) - 1;
        *v++ = (q << Mshift) | r;
        q = 0;
        state = ST_VLEN;
-       if (left == 0)
-           return;
        bits >>= n - left;
        n = left;
    vlen:
        if (bits == 0) {
            q += n;
            return;
        }
        int vbits = __builtin_ffs(bits);
        ...

This first "left no bits" check is now removed and performed implicitly
by the latter "no need for bsf" check, with the result being far better
than I expected.  Perhaps it helps to understand that the condition
"left exactly 0" rarely holds, but CPU is stuck by the check.

So, Q and R processing step each now have exactly one branch (that is,
exactly one condition which completes the step).  Also, in the "put"
coroutines, I simply make a sequence of Q and R steps; this produces
a clean sequence of instructions which branches only when absolutely
necessary.

callginrd annotations for "apt-cache <<<unmet", previous commit:
2,671,717,564  PROGRAM TOTALS
1,059,874,219  lib/set.c:decode_base62_golomb
509,531,239  lib/set.c:rpmsetcmp

callginrd annotations for "apt-cache <<<unmet", this commit:
2,426,092,837  PROGRAM TOTALS
812,534,481  lib/set.c:decode_base62_golomb
509,531,239  lib/set.c:rpmsetcmp
2012-03-07 01:27:20 +04:00
Alexey Tourbin
4d55d9fad0 set.c: better estimation of encode_base62_size 2012-02-19 08:43:36 +04:00
Alexey Tourbin
17452dba48 set.c: reimplmeneted downsampling unsing merges
Most of the time, downsampling is needed for Provides versions,
which are expensive, and values are reduced by only 1 bit, which
can be implemented without sorting the values again.  Indeed,
only a merge is required.  The array v[] can be split into two
parts: the first part v1[] and the second part v2[], the latter
having values with high bit set.  After the high bit is stripped,
v2[] values are still sorted.  It suffices to merge v1[] and v2[].

Note that, however, a merge cannot be done inplace, and also we have
to support 2 or more downsampling steps.  We also want to avoid copying.
This requires careful buffer management - each version needs two
alternate buffers.

callgrind annotations for "apt-cache <<<unmet", previous commit:
2,743,058,808  PROGRAM TOTALS
1,068,102,605  lib/set.c:decode_base62_golomb
  509,186,920  lib/set.c:rpmsetcmp
  131,678,282  stdlib/msort.c:msort_with_tmp'2
   93,496,965  sysdeps/x86_64/strcmp.S:__GI_strcmp
   91,066,266  sysdeps/x86_64/memcmp.S:bcmp
   83,062,668  sysdeps/x86_64/strlen.S:__GI_strlen
   64,584,024  sysdeps/x86_64/memcpy.S:memcpy

callgrind annotations for "apt-cache <<<unmet", this commit:
2,683,295,262  PROGRAM TOTALS
1,068,102,605  lib/set.c:decode_base62_golomb
  510,261,969  lib/set.c:rpmsetcmp
   93,692,793  sysdeps/x86_64/strcmp.S:__GI_strcmp
   91,066,275  sysdeps/x86_64/memcmp.S:bcmp
   90,080,205  stdlib/msort.c:msort_with_tmp'2
   83,062,524  sysdeps/x86_64/strlen.S:__GI_strlen
   58,165,691  sysdeps/x86_64/memcpy.S:memcpy
2012-02-17 14:14:25 +04:00
Alexey Tourbin
692818eb72 set.c: combine and process 24 bits at a time
callgrind annotations for "apt-shell <<<unmet", previous commit:
2,794,697,010  PROGRAM TOTALS
1,119,563,508  lib/set.c:decode_base62_golomb
  509,186,920  lib/set.c:rpmsetcmp

callgrind annotations for "apt-shell <<<unmet", this commit:
2,743,128,315  PROGRAM TOTALS
1,068,102,605  lib/set.c:decode_base62_golomb
  509,186,920  lib/set.c:rpmsetcmp
2012-02-17 09:42:53 +04:00
Alexey Tourbin
7d414b68aa set.c: use plain array to make linear search even simpler
The only reason for using a linked list is to make LRU reordering O(1).
This change replaces the linked list with a plain array.  The inner loop
is now very tight, but reordering involves memmove(3) and is O(N), since
on average, half the array has to be shifted.  Note, however, that the
leading part of the array which is to be shifted is already there in L1
cache, and modern memmove(3) must be very efficient - I expect it to
take much fewer instructions than the loop itself.
2012-02-17 09:42:00 +04:00
Alexey Tourbin
5d0932c8a0 set.c: use contiguous memory to facilitate linear search
Recently I tried to implement another data structure similar to SVR2
buffer cache [Bach 1986], but the code got too complicated.  So I still
maintain that, for small cache sizes, linear search is okay.  Dennis
Ritchie famously argued that a linear search of a directory is efficient
because it is bounded by the size of the directory [Ibid., p. 76].
Great minds think alike (and share similar views on a linear search).

What can make the search slow, however, is not the loop per se, but
rather memory loads: on average, about 67% entries have to be loaded
(assuming 67% hit ratio), checked for entry->hash, and most probably
followed by entry->next.

With malloc'd cache entries, memory loads can be slow.  To facilitate
the search, this change introduces new structure "cache_hdr", which
has only 3 members necessary for the search.  The structures are
pre-allocated in contiguous memory block.  This must play nice with
CPU caches, resulting in fewer memory loads and faster searches.

Indeed, based on some measurements of "apt-shell <<<unmet", this change
can demonstrate about 2% overall improvement in user time.  Using more
sophisticated SVR2-like data structure further improves the result only
by about %0.5.
2012-02-11 06:44:55 +04:00
Alexey Tourbin
c3f705993b set.c: fixed off-by-one error in barrier allocation 2012-02-11 04:46:23 +04:00
Alexey Tourbin
55409f2b03 set.c: fixed assertion failure with malformed "empty set" set-string
In decode_set_init(), we explicitly prohibit empty sets:

    // no empty sets for now
    if (*str == '\0')
	return -4;

This does not validate *str character, since the decoder will check for
errors anyway.  However, this assumes that, otherwise, a non-empty set
will be decoded.  The assumption is wrong: it was actually possible to
construct an "empty set" which triggered assertion failure.

$ /usr/lib/rpm/setcmp yx00 yx00
setcmp: set.c:705: decode_delta: Assertion `c > 0' failed.
zsh: abort      /usr/lib/rpm/setcmp yx00 yx00
$

Here, the "00" part of the set-version yields a sequence of zero bits.
Since trailing zero bits are okay, golomb decoding routine basically
skips the whole sequence and returns 0.

To fix the problem, we have to observe that only up to 5 trailing zero
bits can be required to complete last base62 character, and the leading
"0" sequence occupies 6 or more bits.
2011-10-03 05:28:00 +04:00
Alexey Tourbin
771548f6ec set.c: increased cache size somewhat (128 -> 160)
Below I use 'apt-shell <<<unmet' as a baseline for measurements.

Cache performance with cache_size = 128: hit=39628 miss=22394 (64%)
Cache performance with cache_size = 160: hit=42031 miss=19991 (68%)
(11% fewer cache misses)

Cache performance with cache_size = 160 pivot_size = 1 (plain LRU):
hit=36172 miss=25850 (58%)

Total number of soname set-versions which must be decoded at least once:
miss=2173 (max 96%)

callgrind annotations, 4.0.4-alt100.27:
3,904,042,289  PROGRAM TOTALS
1,378,794,846  decode_base62_golomb
1,176,120,148  rpmsetcmp
  291,805,495  __GI_strcmp
  162,494,544  __GI_strlen
  162,222,530  msort_with_tmp'2
   56,758,517  memcpy
   53,132,375  __GI_strcpy
...

callgrind annotations, this commit (rebuilt in hasher):
2,558,482,547  PROGRAM TOTALS
987,220,089  decode_base62_golomb
468,510,579  rpmsetcmp
162,222,530  msort_with_tmp'2
 85,422,341  __GI_strcmp
 82,063,609  bcmp
 76,510,060  __GI_strlen
 63,806,309  memcpy
...

Inclusive rpmsetcmp annotation, this commit:
1,719,199,968  rpmsetcmp

Typical execution time, 4.0.4-alt100.27:
1.87s user 0.29s system 96% cpu 2.242 total

Typical execution time, this commit:
1.52s user 0.31s system 96% cpu 1.895 total

Based on user time, this constitutes about 20% speed-up.  For some
reason, the speed-up is more noticable on i586 architecture (27%).

Note that the cache should not be further increased, because of two
reasons: 1) LRU search is linear - this is fixable; 2) cache memory
cannot be reclaimed - this is unfixable.  On average, the cache now
takes 1.3M (max 2M).  For small cache sizes, linear search is okay
then (cache_decode_set costs about 20M Ir, which is less than memcmp).

An interesting question is to what extent it is worth to increase
the cache size, assuming that memory footprint is not an issue.
A plausible answer is that decode_base62_golomb should cost no
more than 1/2 of rpmsetcmp inclusive time, which is 987M Ir and
1,719M Ir respectively.  So, Ideally, the cache should be increased
up to the point where decode_base62_golomb takes about 700M Ir.

Note, however, that using midpoint insertion technique seems to
improve cache performance far more than simply increasing cache size.
2011-06-18 22:54:51 +04:00
Alexey Tourbin
d98cab549d set.c: more redesign to avoid extra copying and strlen
This partially reverts what's been introduced with previous commit.
Realize that strlen() must be *only* called when allocating space
for v[].  There is no reason to call strlen() for every Provides
string, since most of them are decoded via the cache hit.

Note, however, that now I have to use the following trick:

        memcmp(str, cur->str, cur->len + 1) == 0

I rely on the fact this works as expected even when str is shorter than
cur->len.  Namely, memcmp must start from lower addresses and stop at
the first difference (i.e. memcmp must not read past the end of str,
possibly except for a few trailing bytes on the same memory page); this
is not specified by the standard, but this is how it must work.

Also, since the cache now stores full decoded values, it is possible to
avoid copying and instead to set the pointer to internal cache memory.
Copying must be performed, however, when the set is to be downsampled.

Note that average Provides set size is around 1024, which corresponds
to base62 string length of about 2K and v[] of 4K.  Saving strlen(2K)
and memcpy(4K) on every rpmsetcmp call is indeed an improvement.

callgrind annotations for "apt-cache unmet", 4.0.4-alt100.27
1,900,016,996  PROGRAM TOTALS
694,132,522  decode_base62_golomb
583,376,772  rpmsetcmp
106,136,459  __GI_strcmp
102,581,178  __GI_strlen
 80,781,386  msort_with_tmp'2
 38,648,490  memcpy
 26,936,309  __GI_strcpy
 26,918,522  regionSwab.clone.2
 21,000,896  _int_malloc
...

callgrind annotations for "apt-cache unmet", this commit (rebuilt in hasher):
1,264,977,497  PROGRAM TOTALS
533,131,492  decode_base62_golomb
230,706,690  rpmsetcmp
 80,781,386  msort_with_tmp'2
 60,541,804  __GI_strlen
 42,518,368  memcpy
 39,865,182  bcmp
 26,918,522  regionSwab.clone.2
 21,841,085  _int_malloc
...
2011-06-16 00:49:41 +04:00
Alexey Tourbin
91d560c35c set.c: redesigned decode API to avoid extra strlen/cmp/cpy calls
Now that string functions are expensive, the API is redesigned so that
strlen is called only once, in rpmsetcmp.  The length is then passed as
an argument down to decoding functions.  With the length argument, it is
now possible to replace strcmp with memcmp and strcpy with memcpy.
2011-06-14 00:43:33 +04:00
Alexey Tourbin
4d6a444af4 set.c: minor cleanup and English fixes
"Effectively avoided" means something like "prakticheski avoided"
in Russian.  Multiple escapse are not avoided "prakticheski", though;
they are avoided altogether and "in principle".  The right word does
not come to mind.
2011-06-14 00:00:54 +04:00
Alexey Tourbin
68df596fd7 set.c: removed support for caching short deltas, shrinked cache
Now that decode_base62_golomb is much cheaper, the question is:
is it still worth to store short deltas, as opposed to storing
full values at the expense of shrinking the cache?

callgrind annotations for previous commit:
1,526,256,208  PROGRAM TOTALS
470,195,400  decode_base62_golomb
434,006,244  rpmsetcmp
106,137,949  __GI_strcmp
102,459,314  __GI_strlen
...

callgrind annotations for this commit:
1,427,199,731  PROGRAM TOTALS
533,131,492  decode_base62_golomb
231,592,751  rpmsetcmp
103,476,056  __GI_strlen
102,008,203  __GI_strcmp
...

So, decode_base62_golomb now takes more cycles, but the overall price
goes down.  This is because, when caching short deltas, two additional
stages should be performed: 1) short deltas must be copied into unsigned
v[] array; 2) decode_delta must be invoked to recover hash values.  Both
stages iterate on per-value basis and both are seemingly fast.  However,
they are not that fast when both of them are replaced with bare memcpy,
which uses xmm registers or something like this.
2011-06-10 23:58:43 +04:00
Alexey Tourbin
3ff35a310c set.c: improved rpmsetcmp main loop performance
The loop is logically impeccable, but its main condition
(v1 < v1end && v2 < v2end) is somewhat redundant: in two
of the three cases, only one pointer gets advanced.  To
save instructions, the conditions are now handled within
the cases.  The loop is now a while (1) loop, a disguised
form of goto.

Also not that, when comparing Requires against Provides,
the Requires is usually sparse:

P: a b c d e f g h i j k l ...
R: a   c         h   j     ...

This means that a nested loop which skips intermediate Provides
elements towards the next Requires element may improve performance.

	while (v1 < v1end && *v1 < *v2)
	    v1++;

However, note that the first condition (v1 < v1end) is also somewhat
redundant.  This kind of boundary checking can be partially omitted if
the loop gets unrolled.  There is a better technique, however, called
the barrier: *v1end must contain the biggest element possible, so that
the trailing *v1 is never smaller than any of *v2.  The nested loop is
then becomes as simple as

	while (*v1 < *v2)
	    v1++;

callgrind annotations, 4.0.4-alt100.27:
1,899,657,916  PROGRAM TOTALS
694,132,522  decode_base62_golomb
583,376,772  rpmsetcmp
106,225,572  __GI_strcmp
102,459,314  __GI_strlen
...

callgrind annotations, this commit (rebuilt in hasher):
1,526,256,208  PROGRAM TOTALS
470,195,400  decode_base62_golomb
434,006,244  rpmsetcmp
106,137,949  __GI_strcmp
102,459,314  __GI_strlen
...

Note that rpmsetcmp also absorbs cache_decode_set and decode_delta;
the loop is now about twice as faster.
2011-06-10 15:12:33 +04:00
Alexey Tourbin
2651bb3246 set.c: unindented rpmsetcmp 2011-06-10 10:50:05 +04:00
Alexey Tourbin
0cfbd8401f set.c: use __builtin_ffs to count vlen bits 2011-06-08 10:29:02 +04:00
Alexey Tourbin
57e25bb189 set.c: implemented two-bytes-at-a-time base62 decoding
callgrind annotations, 4.0.4-alt100.27:
1,899,576,194  PROGRAM TOTALS
694,132,522  decode_base62_golomb
583,376,772  rpmsetcmp
106,136,459  __GI_strcmp
102,459,362  __GI_strlen
...

callgrind annotations, this commit (built in hasher):
1,691,904,239  PROGRAM TOTALS
583,395,352  rpmsetcmp
486,433,168  decode_base62_golomb
106,122,657  __GI_strcmp
102,458,654  __GI_strlen
2011-06-07 10:49:48 +04:00
Alexey Tourbin
238e421ad3 set.c: use long subscript for table lookup, to avoid extra movslq instructions 2011-05-25 08:20:06 +04:00
Alexey Tourbin
97ff0102cd set.c: improved base62_decode table lookup
The whole point of using a table is not only that comparisons
like (c >= 'a' && c <= 'z') can be eliminated; but also that conditional
branches (the "ands" and "ifs") should be eliminated as well.

The existing code, however, uses separate branches to check e.g. for the
end of string; to check for an error; and to check for the (num6b < 61)
common case.  With this change, the table is restructured so that the
common case will be handled with only a single instruction.
2011-05-25 08:18:40 +04:00
Alexey Tourbin
fad9df878b set.c: tweak LRU first-time insertion policy
Pushing new elements to the front tends to assign extra weight to that
elements, at the expense of other elements that are already in the cache.
The idea is then to try first-time insertion somewhere in the middle.
Further attempts suggest that the "pivot" should be closer to the end.

Cache performance for "apt-shell <<<unmet", previous commit:
hit=62375 miss=17252

Cache performance for "apt-shell <<<unmet", this commit:
hit=65085 miss=14542
2011-01-07 06:45:38 +03:00
Alexey Tourbin
42db4e0a9d set.c: final touches 2011-01-03 09:24:15 +03:00
Alexey Tourbin
73adeec07e set.c: cache short deltas
After base62+golomb decoding, most deltas are under 65536 (in Provides
versions, average delta should be around 1536).  So the whole version
can be stored using short deltas, effectively halving memory footprint.
However, this seems to be somewhat slower: per-delta copying and
decode_golomb must be invoked to recover hash values.  On the other
hand, this allows to increase cache size (128 -> 192).  But note that,
with larger cache sizes, LRU linear search will take longer.  So this is
a compromise - and apparently a favourable one.
2011-01-03 09:04:50 +03:00
Alexey Tourbin
b674d2bc31 set.c: tweak delta routines and maskv 2011-01-03 08:21:13 +03:00
Alexey Tourbin
ebe77a46cd set.c: avoid extra strcmp calls in the caching code 2011-01-03 08:21:12 +03:00
Alexey Tourbin
46f3968205 set.c: optimize array access in rpmsetcmp
callgrind results for "apt-cache unmet", 4.0.4-alt100.6:
2,198,298,537  PROGRAM TOTALS
1,115,738,267  lib/set.c:decode_set
  484,035,006  lib/set.c:rpmsetcmp
  143,078,002  ???:strcmp
   79,477,321  ???:strlen
   61,780,572  ???:0x0000000000033080'2
   54,466,947  ???:memcpy
   31,161,399  ???:strcpy
   24,438,336  ???:pkgCache::DepIterator::AllTargets()

callgrind results for "apt-cache unmet", this commit:
1,755,431,664  PROGRAM TOTALS
764,189,271  lib/set.c:decode_base62_golomb
404,493,494  lib/set.c:rpmsetcmp
143,076,968  ???:strcmp
 70,833,953  ???:strlen
 61,780,572  ???:0x0000000000033080'2
 54,466,947  ???:memcpy
 31,161,399  ???:strcpy
 24,438,336  ???:pkgCache::DepIterator::AllTargets()
2011-01-03 08:21:10 +03:00
Alexey Tourbin
d1ad36aef1 set.c: implemented combined base62+golomb decoding routine 2011-01-03 08:20:55 +03:00
Alexey Tourbin
bb21779d02 set.c: facilitate base62 decoding with table lookup 2011-01-03 02:06:07 +03:00
Alexey Tourbin
61866ee15a set.c: minor base62 tweaks 2011-01-03 02:05:11 +03:00
Alexey Tourbin
a2f26af268 set.c: reverted Kirill's changes 2011-01-02 06:39:32 +03:00
Kirill A. Shutemov
4b05315a84 set.c: optimize decode_golomb()
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
2010-12-07 16:58:43 +00:00
Kirill A. Shutemov
4d9409cb9c set.c: optimize putbits()
Use bit operations instead of cycles.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
2010-12-07 16:58:43 +00:00
Kirill A. Shutemov
6a9ce451bc set.c: use packed bitmap for bit vector
Currently, set.c uses array of chars to store bit vector. Each char
stores one bit: 0 or 1.

Let's use packed bitmap instead. It creates room for optimizations.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
2010-12-07 16:58:43 +00:00
Kirill A. Shutemov
6afa2793f8 set.c: cleanup self-tests
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
2010-12-07 16:58:43 +00:00
Kirill A. Shutemov
82a064020b set.c: use function-like syntax for sizeof.
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
2010-12-07 16:58:43 +00:00
Kirill A. Shutemov
d7ece41758 set.c: do not mix declarations and code
Let's move variable declarations to the begin of blocks.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
2010-12-07 16:58:43 +00:00
Kirill A. Shutemov
82da3b0f49 set.c: slightly reformat code to increase its readability
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
2010-12-07 16:57:43 +00:00
Kirill A. Shutemov
5c39279236 set.c: fixup self-test functions declaration
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
2010-12-07 16:56:19 +00:00
Kirill A. Shutemov
4226027fbb set.c: get rid of nested functions
Nested function in C is GCC extension.
Let's try to follow ANSI C syntax.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
2010-12-07 16:56:19 +00:00
Kirill A. Shutemov
75a4506d75 set.c, set.h: get rid of C++-style comments
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
2010-12-07 16:56:19 +00:00
Alexey Tourbin
fc5b5a5f7d set.c: implemented LRU caching (2x speed-up, 1M footprint) 2010-12-04 13:11:45 +03:00
Alexey Tourbin
2463074998 set.c: corrected set_free() return type 2010-10-12 03:14:10 +04:00
Alexey Tourbin
50a7eb6602 set.c: minor fixes, no changes 2010-09-19 09:56:00 +04:00
Alexey Tourbin
ca83cfb381 set.c (rpmsetcmp): set2 err fix 2010-09-13 18:54:45 +04:00
Alexey Tourbin
a6d2902271 set.c: implemented base62, golomb, and set-string routines 2010-09-11 01:34:05 +04:00