1
0
mirror of https://github.com/samba-team/samba.git synced 2025-01-21 18:04:06 +03:00

89 Commits

Author SHA1 Message Date
Jo Sutton
72ac0ec850 lib:compression: Update my name
Signed-off-by: Jo Sutton <josutton@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2024-02-16 02:41:36 +00:00
Andreas Schneider
9621a3d7a6 Use python.h from libreplace
BUG: https://bugzilla.samba.org/show_bug.cgi?id=15513

Signed-off-by: Andreas Schneider <asn@samba.org>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2023-11-20 15:37:33 +00:00
Joseph Sutton
03ca8c25d0 lib:compression: Correctly fix sign extension of long matches (CID 1517275)
Commit 6b4d94c9877ec59081b9da946c00fa2647cad928 was a previous attempt
to fix this issue.

Signed-off-by: Joseph Sutton <josephsutton@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2023-10-13 02:18:30 +00:00
Joseph Sutton
e961783add lib:compression: Fix building with FORTIFY_SOURCE=2
Signed-off-by: Joseph Sutton <josephsutton@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2023-10-01 22:45:38 +00:00
Joseph Sutton
1c35195ff7 lib:compression: Fix code spelling
Signed-off-by: Joseph Sutton <josephsutton@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2023-09-11 02:42:41 +00:00
Andreas Schneider
3d409c16ee lib:compression: Fix code spelling
Best reviewed with: `git show --word-diff`.

Signed-off-by: Andreas Schneider <asn@samba.org>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2023-04-03 03:56:35 +00:00
Andrew Bartlett
e37f20fb36 lib/compression: Fix documentation of lzxpress_huffman_compress()
The "inconvenience function" takes one type, and converts it to another
but the documentation was not updated.

Signed-off-by: Andrew Bartlett <abartlet@samba.org>
Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
2023-03-31 01:48:30 +00:00
Andrew Bartlett
0ab5552c8c lib/compression: Add helper function lzxpress_huffman_max_compressed_size()
This allows the calculation of the worst case to be shared with callers.

Signed-off-by: Andrew Bartlett <abartlet@samba.org>
Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
2023-03-31 01:48:30 +00:00
Joseph Sutton
ae6e76c082 lib/compression: Fix length check
Put the division on the correct side of the inequality.

Signed-off-by: Joseph Sutton <josephsutton@catalyst.net.nz>
Reviewed-by: Jeremy Allison <jra@samba.org>
2023-01-10 20:22:32 +00:00
Douglas Bagnall
41249302a3 lib/compression: add simple python bindings
There are four functions, allowing compression and decompression in
the two formats we support so far. The functions will accept bytes or
unicode strings which are treated as utf-8.

The LZ77+Huffman decompression algorithm requires an exact target
length to decompress, so this is mandatory.

The plain decompression algorithm does not need an exact length, but
you can provide one to help it know how much space to allocate. As
currently written, you can provide a short length and it will often
succeed in decompressing to a different shorter string.

These bindings are intended to make ad-hoc investigation easier, not
for production use. This is reflected in the guesses about output size
that plain_decompress() makes if you don't supply one -- either they
are stupidly wasteful or ridiculously insufficient, depending on
whether or not you were trying to decompress a 20MB string.

>>> a = '12345678'
>>> import compression
>>> b = compression.huffman_compress(a)
>>> b
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00  #....
>>> len(b)
262
>>> c = compression.huffman_decompress(b, len(a))
>>> c
b'12345678'                                   # note, c is bytes, a is str
>>> a
'12345678'
>>> d = compression.plain_compress(a)
>>> d
b'\xff\xff\xff\x0012345678'
>>> compression.plain_decompress(d)           # no size specified, guesses
b'12345678'
>>> compression.plain_decompress(d,5)
b'12345'
>>> compression.plain_decompress(d,0)         # 0 for auto
b'12345678'
>>> compression.plain_decompress(d,1)
b'1'
>>> compression.plain_decompress(a,444)
Traceback (most recent call last):
   compression.CompressionError: unable to decompress data into a buffer of 444 bytes.
>>> compression.plain_decompress(b,444)
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 #...

That last one decompresses the Huffman compressed file with the plain
compressor; pretty much any string is valid for plain decompression.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Jeremy Allison <jra@samba.org>
2022-12-22 19:50:33 +00:00
Douglas Bagnall
44a44005a6 compression/huffman: debug function bails upon disaster (CID 1517261)
We shouldn't get a node with a zero code, and there's probably nothing
to do but stop.

   CID 1517261 (#1-2 of 2): Bad bit shift operation
   (BAD_SHIFT)11. negative_shift: In expression j >> offset - k,
   shifting by a negative amount has undefined behavior. The shift
   amount, offset - k, is -3.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Jeremy Allison <jra@samba.org>

Autobuild-User(master): Jeremy Allison <jra@samba.org>
Autobuild-Date(master): Mon Dec 19 23:29:04 UTC 2022 on sn-devel-184
2022-12-19 23:29:04 +00:00
Douglas Bagnall
628f14c149 compression/huffman: double check distance in matches (CID 1517278)
Because we just wrote the intermediate representation to have no zero
distances, we can be sure it doesn't, but Coverity doesn't know. If
distance is zero, `bitlen_nonzero_16(distance)` would be bad.

   CID 1517278 (#1 of 1): Bad bit shift operation
   (BAD_SHIFT)41. large_shift: In expression 1 << code_dist, left
   shifting by more than 31 bits has undefined behavior. The shift
   amount, code_dist, is 65535.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Jeremy Allison <jra@samba.org>
2022-12-19 22:32:35 +00:00
Douglas Bagnall
6b4d94c987 compression: fix sign extension of long matches (CID 1517275)
Very long matches would be written instead as very very long matches.

We can't in fact hit this because we have a MAX_MATCH_LENGTH defined
as 64M, but if we could, it might make certain 2GB+ strings impossible
to compress.

  CID 1517275 (#1 of 1): Unintended sign extension
  (SIGN_EXTENSION)sign_extension: Suspicious implicit sign extension:
  intermediate[i + 2UL] with type uint16_t (16 bits, unsigned) is
  promoted in intermediate[i + 2UL] << 16 to type int (32 bits, signed),
  then sign-extended to type unsigned long (64 bits, unsigned). If
  intermediate[i + 2UL] << 16 is greater than 0x7FFFFFFF, the upper bits
  of the result will all be 1.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Jeremy Allison <jra@samba.org>
2022-12-19 22:32:35 +00:00
Douglas Bagnall
baba440ffa compression tests: avoid div by zero in failure (CID 1517297)
Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Jeremy Allison <jra@samba.org>
2022-12-19 22:32:35 +00:00
Douglas Bagnall
b99e0e9301 compression/tests: calm the static analysts (CID: numerous)
None of our test vectors are 18446744073709551615 bytes long, which
means we can know an `expected_length == returned_length` check will
catch the case where the compression function returns -1 for error. We
know that, but Coverity doesn't.

It's the same thing over and over again, in two different patterns:

>>>     CID 1517301:  Memory - corruptions  (OVERRUN)
>>>     Calling "memcmp" with "original.data" and "original.length" is
suspicious because of the very large index, 18446744073709551615. The index
may be due to a negative parameter being interpreted as unsigned.
393     	if (original.length != decomp_written ||
394     	    memcmp(decompressed.data,
395     		   original.data,
396     		   original.length) != 0) {
397     		debug_message("\033[1;31mgot %zd, expected %zu\033[0m\n",
398     			      decomp_written,

*** CID 1517299:  Memory - corruptions  (OVERRUN)
/lib/compression/tests/test_lzxpress_plain.c: 296 in
test_lzxpress_plain_decompress_more_compressed_files()
290     		debug_start_timer();
291     		written = lzxpress_decompress(p.compressed.data,
292     					      p.compressed.length,
293     					      dest,
294     					      p.decompressed.length);
295     		debug_end_timer("decompress", p.decompressed.length);
>>>     CID 1517299:  Memory - corruptions  (OVERRUN)
>>>     Calling "memcmp" with "p.decompressed.data" and
"p.decompressed.length" is suspicious because of the very large index,
18446744073709551615. The index may be due to a negative parameter being
interpreted as unsigned.
296     		if (written == p.decompressed.length &&
297     		    memcmp(dest, p.decompressed.data, p.decompressed.length)
== 0) {
298     			debug_message("\033[1;32mdecompressed %s!

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Jeremy Allison <jra@samba.org>
2022-12-19 22:32:35 +00:00
Douglas Bagnall
d6a67908e1 compression/huffman: check again for invalid codes (CID 1517302)
We know that code is non-zero, because it comes from the combination of
the intermediate representation and the symbol tables that were generated
at the same time. But Coverity doesn't know that, and it thinks we could
be doing undefined things in the subsequent shift.

    CID 1517302:  Integer handling issues  (BAD_SHIFT)
    In expression "1 << code_bit_len", shifting by a negative amount has

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Jeremy Allison <jra@samba.org>
2022-12-19 22:32:35 +00:00
Douglas Bagnall
27af27f901 compression/huffman: tighten bit_len checks (fix SUSE -O3 build)
The struct write_context bit_len attribute is always between 0 and 31,
but if the next patches are applied without this, SUSE GCC -O3 will
worry thusly:

 ../../lib/compression/lzxpress_huffman.c: In function
  ‘lzxpress_huffman_compress’:
 ../../lib/compression/lzxpress_huffman.c:953:5: error: assuming signed
  overflow does not occur when simplifying conditional to constant
  [-Werror=strict-overflow]
   if (wc->bit_len > 16) {
         ^
         cc1: all warnings being treated as errors

Inspection tell us that the invariant holds. Nevertheless, we can
safely use an unsigned type and insist that over- or under- flow is
bad.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Jeremy Allison <jra@samba.org>
2022-12-19 22:32:35 +00:00
Douglas Bagnall
6f77b376d4 compression/huffman: avoid semi-defined behaviour in decompress
We had

               output[output_pos - distance];

where output_pos and distance are size_t and distance can be greater
than output_pos (because it refers to a place in the previous block).

The underflow is defined, leading to a big number, and when
sizeof(size_t) == sizeof(*uint8_t) the subsequent overflow works as
expected. But if size_t is smaller than a pointer, bad things will
happen.

This was found by OSSFuzz with
'UBSAN_OPTIONS=print_stacktrace=1:silence_unsigned_overflow=1'.

Credit to OSSFuzz.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Jeremy Allison <jra@samba.org>
2022-12-19 22:32:35 +00:00
Anoop C S
0c2146eb00 lib/compression: Include missing stat header file
<sys/stat.h> was missing from compression library tests which resulted
in the following compile time error:

../../lib/compression/tests/test_lzx_huffman.c: In function
                                                   ‘datablob_from_file’:
../../lib/compression/tests/test_lzx_huffman.c:383:21: error:
                                         storage size of ‘s’ isn’t known
  383 |         struct stat s;
      |                     ^
../../lib/compression/tests/test_lzx_huffman.c:389:15: warning:
    implicit declaration of function ‘fstat’ [-Wimplicit-function-declaration]
  389 |         ret = fstat(fileno(fh), &s);
      |               ^~~~~

Signed-off-by: Anoop C S <anoopcs@samba.org>
Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Volker Lendecke <vl@samba.org>

Autobuild-User(master): Volker Lendecke <vl@samba.org>
Autobuild-Date(master): Tue Dec  6 11:39:16 UTC 2022 on sn-devel-184
2022-12-06 11:39:16 +00:00
Andreas Schneider
a451fa5ef9 lib:compression: Initialize variables
lib/compression/tests/test_lzx_huffman.c: In function ‘test_lzxpress_huffman_overlong_matches’:
lib/compression/tests/test_lzx_huffman.c:1013:35: error: ‘j’ may be used uninitialized [-Werror=maybe-uninitialized]
 1013 |         assert_int_equal(score, i * j);
      |                                   ^
lib/compression/tests/test_lzx_huffman.c:979:19: note: ‘j’ was declared here
  979 |         size_t i, j;
      |                   ^
lib/compression/tests/test_lzx_huffman.c: In function ‘test_lzxpress_huffman_overlong_matches_abc’:
lib/compression/tests/test_lzx_huffman.c:1059:39: error: ‘k’ may be used uninitialized [-Werror=maybe-uninitialized]
 1059 |         assert_int_equal(score, i * j * k);
      |                                       ^
lib/compression/tests/test_lzx_huffman.c:1020:22: note: ‘k’ was declared here
 1020 |         size_t i, j, k;
      |                      ^
lib/compression/tests/test_lzx_huffman.c:1059:35: error: ‘j’ may be used uninitialized [-Werror=maybe-uninitialized]
 1059 |         assert_int_equal(score, i * j * k);
      |                                   ^
lib/compression/tests/test_lzx_huffman.c:1020:19: note: ‘j’ was declared here
 1020 |         size_t i, j, k;
      |                   ^

Signed-off-by: Andreas Schneider <asn@samba.org>
Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>

Autobuild-User(master): Andreas Schneider <asn@cryptomilk.org>
Autobuild-Date(master): Sun Dec  4 09:12:30 UTC 2022 on sn-devel-184
2022-12-04 09:12:30 +00:00
Douglas Bagnall
d9c192546f lib/compression/lzxpress: fix our slow compression
This uses the same hash table method as lzxpress_huffman, though the
code can't be directly reused as the sizes of the offsets is
different, and there is not a block processing step here.

This will worsen the compression ratio compared to the exhaustive
search we previously used, though we still perform better than
Windows. To put numbers on it, the test files used to compress to 0.91
of Windows' compression size, and now they compress to 0.96.

On the other hand this is many orders of magnitude faster. It is
difficult to say exactly how much faster -- while the testsuite time
has only improved 200-fold (from 7 minutes to 2 seconds), most of the
remaining 2 seconds is used in data generation and management, not
compression. OSSFuzz consistently finds new vectors that time out
after a minute; on these we'll see nearly an order of magnitude of
orders of magnitude inprovement.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>

Autobuild-User(master): Joseph Sutton <jsutton@samba.org>
Autobuild-Date(master): Fri Dec  2 00:00:04 UTC 2022 on sn-devel-184
2022-12-02 00:00:04 +00:00
Douglas Bagnall
caa643e36e lib/compression/lzxpress: shift encoding into helper functions
This makes it easier to rework the encoding decision to depend on a
hash table match rather than the current exhaustive search.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:40 +00:00
Douglas Bagnall
fb35cf29a4 lib/compression/lzxpress compression: use a write context struct
This will make it possible to move encoding operations into helper
functions, which will make it easier to restructure the code to use a
hash table for faster matching.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:40 +00:00
Douglas Bagnall
e4066b2be6 lib/compression: more tests for lzxpress plain compression
These are based on (i.e. copied and pasted from) the LZ77 + Huffman
tests.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:40 +00:00
Douglas Bagnall
ce7ea07d07 testdata: move compression examples to re-use with lzxpress plain
Everything that is in testdata/compression/lzxpress-huffman/ can also
be used for lzxpress plain tests, which is something we really need.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:40 +00:00
Douglas Bagnall
9589f5282b lib/compression/lzx-plain: relax size requirements on long file
We are going to change from a slow exact match algorithm to a fast
heuristic search that will not always get the same results as the
exhaustive search.

To be precise, a million zeros will compress to 112 rather than 93 bytes.

We don't insist on an exact size, because that is not an issue here.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
c2db7fda4e lib/comression: convert test_lzxpress_plain to cmocka
Mainly so I can go

 make bin/test_lzxpress_plain && bin/test_lzxpress_plain
 valgrind bin/test_lzxpress_plain
 rr bin/test_lzxpress_plain
 rr replay

in a tight loop.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
e5f9deed0d lib/compression: add test scripts README
Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
1a3d8da731 lib/compression: test util to generate fuzzing seeds
Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
6a7c0ca23c lib/compression: Windows utility to generate test vectors
If compiled on Windows using Cygwin, MSYS2, or similar, this will output
compressed versions of files exactly as specified by MZ-XCA, if the
following conditions are met:

1. The file > 300 bytes.
2. The compressed file is smaller than the decompressed file.

Otherwise it returns the data unchanged. Without warning; that's just
how the API works.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
7804570a37 lib/compression: script to test 3 byte hash
Compression uses a 3 byte hash remember LZ77 matches in a 14-bit table.
This script runs the hash over all 16M combinations, then again over
all ASCII combinations, counting collisions to find hot-spots.

If you think you have a better hash, you are probably right, but you
should try it here -- alter h() -- before committing to it. This one is
literally the first one I thought of.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
dadecede54 lib/compression: helper script to make unbalanced data
Huffman tree re-quantisation and perhaps other code paths are only
triggered by pathological data like this.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
bce33816ec lib/compression: add a debug script to describe headers
Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
e795985067 lib/compression/tests: add lzhuffman timer functions
With LZXHUFF_DEBUG_VERBOSE set, we measure the compression and
decompression rate relative to the decompressed size.

On reasonably long strings on my laptop, compiled with -O0, it turns
out to between 20 and 500 MB/s, both ways, depending on the complexity
of the string. Very short strings are of course dominated by overhead.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
77048aaa61 lib/compression: debug routines for lzxpress-huffman
If you need to see a Huffman tree (and sometimes you do), set
DEBUG_HUFFMAN_TREE to true at the top of lzxpress_huffman.c, and run:

  make bin/test_lzx_huffman && bin/test_lzx_huffman

Actually, that will show you hundreds of trees, and you'll be glad of
that if you are ever trying to understand this.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
955214ef6e lib/compression/lzhuff: add debug flag to skip LZ77
Encoding without LZ77 matches is valid, and it is useful for isolating
bugs.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
d4e3f0c88e lib/compression: LZ77 + Huffman compression
This compresses files as described in MS-XCA 2.2, and as decompressed
by the decompressor in the previous commit.

As with the decompressor, there are two public functions -- one that
uses a talloc context, and one that uses pre-allocated memory. The
compressor requires a tightly bound amount of auxillary memory
(>220kB) in a few different buffers, which is all gathered together in
the public struct lzxhuff_compressor_mem. An instantiated but not
initialised copy of this struct is required by the non-talloc
function; it can be used over and over again.

Our compression speed is about the same as the decompression speed
(between 20 and 500 MB/s on this laptop, depending on the data), and
our compression ratio is very similar to that of Windows.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
f86035c65b lib/compression: add LZ77 + Huffman decompression
This format is described in [MS-XCA] 2.1 and 2.2, with exegesis in
many posts on the cifs-protocol list[1].

The two public functions are:

ssize_t lzxpress_huffman_decompress(const uint8_t *input,
				    size_t input_size,
				    uint8_t *output,
				    size_t output_size);

uint8_t *lzxpress_huffman_decompress_talloc(TALLOC_CTX *mem_ctx,
					    const uint8_t *input_bytes,
					    size_t input_size,
					    size_t output_size);

In both cases the caller needs to know the *exact* decompressed size,
which is essential for decompression. The _talloc version allocates
the buffer for you, and uses the talloc context to allocate a 128k
working buffer. THe non-talloc function will allocate the working
buffer on the stack.

This compression format gives better compression for messages of
several kilobytes than the "plain" LXZPRESS compression, but is
probably a bit slower to decompress and is certainly worse for very
short messages, having a fixed 256 byte overhead for the first Huffman
table.

Experiments show decompression rates between 20 and 500 MB per second,
depending on the compression ratio and data size, on an i5-1135G7 with
no compiler optimisations.

This compression format is used in AD claims and in SMB, but that
doesn't happen with this commit.

I will not try to describe LZ77 or Huffman encoding here. Don't expect
an answer in MS-XCA either; instead read the code and/or Wikipedia.

[1] Much of that starts here:

https://lists.samba.org/archive/cifs-protocol/2022-October/

but there's more earlier, particularly in June/July 2020, when
Aurélien Aptel was working on an implementation that ended up in
Wireshark.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Pair-programmed-with: Joseph Sutton <josephsutton@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Douglas Bagnall
f6cda06dfb lib/compression: move lzxpress_plain test into tests/
We are going to add more tests for lib/compression, and they can't all
be called "testsuite.c".

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
2022-12-01 22:56:39 +00:00
Volker Lendecke
605d646935 lib: Fix the 32-bit build
Signed-off-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Jeremy Allison <jra@samba.org>
2022-07-23 23:29:38 +00:00
Douglas Bagnall
637e7cbdba lzxpress: compress shortcut if we've reached maximum length
A simple degenerate case for our compressor has been a large number of
repeated bytes that will match the maximum length (~64k) at all 8192
search positions, 8191 of which searches are in vain because the
matches are not of greater length than the first one.

Here we recognise the inevitable and reduce runtime proportionately.

Credit to OSS-Fuzz.

REF: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=47428

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>

Autobuild-User(master): Douglas Bagnall <dbagnall@samba.org>
Autobuild-Date(master): Tue May 17 23:11:21 UTC 2022 on sn-devel-184
2022-05-17 23:11:21 +00:00
Douglas Bagnall
04309bc682 lzxpress/test: time performance of long boring sequences
We get *very* slow when long runs of the bytes are the same. On this
laptop the test takes 18s; with the next commit it will be 0.006s.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2022-05-17 22:13:35 +00:00
Douglas Bagnall
505d2879fa compression:tests: align test names with functions
You'll thank me if you're ever debugging these and wondering why
'lzxpress4' calls 'lzxpress2' (or is it the other way round?).

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2022-05-12 02:22:35 +00:00
Douglas Bagnall
05c760165b compression: add a few comments, including MS-XCA pointers.
Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2022-05-12 02:22:35 +00:00
Douglas Bagnall
383a7cfed9 compression: remove always false constant comparison
We set `uncompressed_pos = 0;` unconditionally, just ~10 lines up.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2022-05-12 02:22:35 +00:00
Douglas Bagnall
e36cb10b16 compression: lzxpress decompress empty string as empty string
This mirrors the behaviour of lzxpress_compress, which "encodes" an
empty string as an empty string.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2022-05-12 02:22:35 +00:00
Douglas Bagnall
1ca4449294 compression: fix lzxpress decompress with trailing flags
Every so often, lzxpress adds a 32-bit block of indicator flags to
help decode the next clump of 32 code words. A naive compressor (such
as we have) might do this at the very end for flags that aren't
actually used because there are no more bytes to decompress. If that
happens we need to stop processing, or we'll come to worse outcome at
the next CHECK_INPUT_BYTES.

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2022-05-12 02:22:35 +00:00
Douglas Bagnall
d8a90d2a8f compression:tests: test lzxpress in some edge cases
Empty strings and trailing flag blocks.

(found with Honggfuzz and a round-trip fuzzer that aborts if the
strings differ).

Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
2022-05-12 02:22:35 +00:00
Joseph Sutton
075df819cc compression: Move maximum length calculation out of inner loop
This makes the code clearer.

Signed-off-by: Joseph Sutton <josephsutton@catalyst.net.nz>
Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
2022-05-12 02:22:35 +00:00
Joseph Sutton
877f007f32 compression: Use correct values for max len and offset
Signed-off-by: Joseph Sutton <josephsutton@catalyst.net.nz>
Reviewed-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
2022-05-12 02:22:35 +00:00