sotastream.utils.split module

sotastream.utils.split.compute_md5(filepath: str)[source]

Computes an MD5 checksum over a file. Note that binary reading in this way is as fast as a subshell call.

Parameters:

filepath – The file path as as string

Returns:

The checksum as a hexdigest.

sotastream.utils.split.smart_open(filepath: str, mode: str = 'rt', encoding: str = 'utf-8')[source]

Convenience function for reading and writing compressed or plain text files.

Parameters:
  • filepath – The file to read.

  • mode – The file mode (read, write).

  • encoding – The file encoding.

Returns:

a file handle.

sotastream.utils.split.split_file_into_chunks(filepath: str, tmpdir: str = '/tmp/sotastream', split_size: int = 10000, native: bool = False, overwrite: bool = False) Path[source]

Splits a file into compressed chunks under a directory. The location will be in a directory named by the file’s checksum, within the provided temporary directory. Results are cached, providing for quick restarting.

Parameters:
  • filepath – The input file path

  • tmpdir – The top-level temporary directory to write to

  • split_size – The size of each chunk in lines

  • native – If True, use Python to split, instead of a subshell

Returns:

The directory where the chunks are stored, as a Path object

sotastream.utils.split.split_native(filepath: str, destdir: Path, split_size: int)[source]

Split directly in Python by reading the file. This version is slower than the subshell version.

Parameters:
  • filepath – The input file path

  • destdir – The output directory

  • split_size – The size of each chunk in lines

sotastream.utils.split.split_subshell(filepath: str, destdir: Path, split_size: int)[source]

Split using a subshell (~8x faster).

Parameters:
  • filepath – The input file path

  • destdir – The output directory

  • split_size – The size of each chunk in lines