Code available here: https://github.com/boegholm/CompressionDetectingStream
If you have a stream of GZip compressed data, you can wrap the stream object in a deflating GZipStream using a single line. With compressedStream
being a stream of gzip compresseed data, the content stream will access the uncompressed content:
using Stream content = new GZipStream(compressedStream, CompressionMode.Decompress);
But how do you decide whether to wrap a stream in a deflating stream? If you don’t know the actual content of a stream, you can inspect the file signature for magic values. If the first two bytes are 0x1F0x8B, it is probably a GZ-stream, and can be uncompressed using the GZipStream-class. (from Wikipedias list of file signatures).
But after you read a portion of a stream to check file signature, you have to rewind the stream before handing it to GZipStream. You will have to repeat this process with as many formats you support. As not all Stream-implementations are seekable, a rewind may result in: System.NotSupportedException: Stream does not support seeking.
Streams without seek support
It is not unusual for streams to not support seeking, for example NetworkStream
, ChunkedEncodingReadStream
(from System.Net.Http.HttpConnection
) and the ConsoleStream
-family. Take an other example. Consider a tar-file processing program. If you invoke it, as below, it would process the tar-file consisting of Program.cs. On the left, it will be a pure tar-file while on the right the input-stream will be gz-compressed.
tar cf - Program.cs | dotnet run
tar czf - Program.cs | dotnet run
In the example above, the input-stream implementation will not be seekable and cannot be rewound. You could of course parameterize the program, that I’m reading gzipped tar instead of tar. This, however, is rather verbose and in the long run, you would probably want format autodetection; sometimes seamless.
This is relevant for programs like above, especially if reopening the stream is not an option. There are different approaches, such as pipelines. This post explores another approach. We suggest yet another stream implementation, to automatically determine encoding and choosing a feasible decoder.
Implementation
We introduce two new Stream-implementations. The first is a simple abstract wrapper Stream, with forwarding all calls to the wrapped stream, as virtual implementations.
This saves use quite some code later. Next, we define another stream-wrapper, allowing us to prefix any stream by a byte buffer.
Here, we only care for the implementation of the read-method, as this is our primary concern. In the end, we probably want to properly implement stream position.
In this implementation we copy bytes from the prefix and slices to a new prefix, until fully read. When prefix has been fully read, we read from the wrapped stream.
Then, the actual MagicDetectingSteam is yet another implementation of the abstract forwarding Stream.
Here we override the SourceStream getter to, upon first get, read 512 bytes from the underlying stream. Using this buffer, we analyze the stream signature we construct the correct decoding stream implementation.
This could be a GZ stream-implementation; this is done using the factory pattern:
In my implementation above, I only handle GZ-files, and everything else is passthrough, although Tar files get a special mention here. You guessed right, I’m also decoding tar files, although this is a story for another day. I separated the helper methods into an abstract class:
There are of course plenty of approaches to solving this problem, this example is just one. And please remember, this is just sample code.
Thanks for reading.