IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
There are four functions, allowing compression and decompression in
the two formats we support so far. The functions will accept bytes or
unicode strings which are treated as utf-8.
The LZ77+Huffman decompression algorithm requires an exact target
length to decompress, so this is mandatory.
The plain decompression algorithm does not need an exact length, but
you can provide one to help it know how much space to allocate. As
currently written, you can provide a short length and it will often
succeed in decompressing to a different shorter string.
These bindings are intended to make ad-hoc investigation easier, not
for production use. This is reflected in the guesses about output size
that plain_decompress() makes if you don't supply one -- either they
are stupidly wasteful or ridiculously insufficient, depending on
whether or not you were trying to decompress a 20MB string.
>>> a = '12345678'
>>> import compression
>>> b = compression.huffman_compress(a)
>>> b
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 #....
>>> len(b)
262
>>> c = compression.huffman_decompress(b, len(a))
>>> c
b'12345678' # note, c is bytes, a is str
>>> a
'12345678'
>>> d = compression.plain_compress(a)
>>> d
b'\xff\xff\xff\x0012345678'
>>> compression.plain_decompress(d) # no size specified, guesses
b'12345678'
>>> compression.plain_decompress(d,5)
b'12345'
>>> compression.plain_decompress(d,0) # 0 for auto
b'12345678'
>>> compression.plain_decompress(d,1)
b'1'
>>> compression.plain_decompress(a,444)
Traceback (most recent call last):
compression.CompressionError: unable to decompress data into a buffer of 444 bytes.
>>> compression.plain_decompress(b,444)
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 #...
That last one decompresses the Huffman compressed file with the plain
compressor; pretty much any string is valid for plain decompression.
Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Jeremy Allison <jra@samba.org>
Mainly so I can go
make bin/test_lzxpress_plain && bin/test_lzxpress_plain
valgrind bin/test_lzxpress_plain
rr bin/test_lzxpress_plain
rr replay
in a tight loop.
Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>
This format is described in [MS-XCA] 2.1 and 2.2, with exegesis in
many posts on the cifs-protocol list[1].
The two public functions are:
ssize_t lzxpress_huffman_decompress(const uint8_t *input,
size_t input_size,
uint8_t *output,
size_t output_size);
uint8_t *lzxpress_huffman_decompress_talloc(TALLOC_CTX *mem_ctx,
const uint8_t *input_bytes,
size_t input_size,
size_t output_size);
In both cases the caller needs to know the *exact* decompressed size,
which is essential for decompression. The _talloc version allocates
the buffer for you, and uses the talloc context to allocate a 128k
working buffer. THe non-talloc function will allocate the working
buffer on the stack.
This compression format gives better compression for messages of
several kilobytes than the "plain" LXZPRESS compression, but is
probably a bit slower to decompress and is certainly worse for very
short messages, having a fixed 256 byte overhead for the first Huffman
table.
Experiments show decompression rates between 20 and 500 MB per second,
depending on the compression ratio and data size, on an i5-1135G7 with
no compiler optimisations.
This compression format is used in AD claims and in SMB, but that
doesn't happen with this commit.
I will not try to describe LZ77 or Huffman encoding here. Don't expect
an answer in MS-XCA either; instead read the code and/or Wikipedia.
[1] Much of that starts here:
https://lists.samba.org/archive/cifs-protocol/2022-October/
but there's more earlier, particularly in June/July 2020, when
Aurélien Aptel was working on an implementation that ended up in
Wireshark.
Signed-off-by: Douglas Bagnall <douglas.bagnall@catalyst.net.nz>
Pair-programmed-with: Joseph Sutton <josephsutton@catalyst.net.nz>
Reviewed-by: Joseph Sutton <josephsutton@catalyst.net.nz>