sotastream.utils.split module
- sotastream.utils.split.compute_md5(filepath: str)[source]
Computes an MD5 checksum over a file. Note that binary reading in this way is as fast as a subshell call.
- Parameters:
filepath – The file path as as string
- Returns:
The checksum as a hexdigest.
- sotastream.utils.split.smart_open(filepath: str, mode: str = 'rt', encoding: str = 'utf-8')[source]
Convenience function for reading and writing compressed or plain text files.
- Parameters:
filepath – The file to read.
mode – The file mode (read, write).
encoding – The file encoding.
- Returns:
a file handle.
- sotastream.utils.split.split_file_into_chunks(filepath: str, tmpdir: str = '/tmp/sotastream', split_size: int = 10000, native: bool = False, overwrite: bool = False) Path[source]
Splits a file into compressed chunks under a directory. The location will be in a directory named by the file’s checksum, within the provided temporary directory. Results are cached, providing for quick restarting.
- Parameters:
filepath – The input file path
tmpdir – The top-level temporary directory to write to
split_size – The size of each chunk in lines
native – If True, use Python to split, instead of a subshell
- Returns:
The directory where the chunks are stored, as a Path object
- sotastream.utils.split.split_native(filepath: str, destdir: Path, split_size: int)[source]
Split directly in Python by reading the file. This version is slower than the subshell version.
- Parameters:
filepath – The input file path
destdir – The output directory
split_size – The size of each chunk in lines