sotastream.augmentors.augmentors module

sotastream.augmentors.augmentors.Append(lines, functor)[source]
sotastream.augmentors.augmentors.Copy(lines, from_field=1, to_field=0)[source]
sotastream.augmentors.augmentors.CopySource(lines)[source]

Copy source field to target.

sotastream.augmentors.augmentors.DataSource(path: str, processChunk: ~typing.Callable = <function UTF8File>, ext: str = '.gz', buffer_size: int = 1000000, seed: int = 1234, shuffle: bool = True, worker_id: int = 0, num_workers: int = 1)[source]

Creates an infinibatch data source from a directory of files that all have extension {ext}.

Parameters:
  • path – directory containing chunks

  • processChunk – function to call on each chunk

  • ext – the file extension to glob over

  • buffer_size – how many lines infinibatch loads into memory at a time

  • seed – the random seed

  • shuffle – whether to shuffle results across shards

  • worker_id – For multiprocessing, this worker’s ID (0-based)

  • num_workers – For multiprocessing, the number of workers

sotastream.augmentors.augmentors.Identity(lines)[source]
sotastream.augmentors.augmentors.JustSourceTarget(lines)[source]

Removes all but fields 0 and 1

class sotastream.augmentors.augmentors.Mixer(iterators, probs)[source]

Bases: object

sotastream.augmentors.augmentors.Multiply(lines, n=2)[source]

Makes n copies of the underlying object.

sotastream.augmentors.augmentors.SPMDecoder(lines, spm_model)[source]

SPM decodes fields 0 and 1

sotastream.augmentors.augmentors.SPMEncoder(lines, spm_model)[source]

Runs the SPM encoder on fields 0 and 1

sotastream.augmentors.augmentors.Tagger(lines, tag='', fields=[0])[source]
sotastream.augmentors.augmentors.ToLower(lines, fields=[0, 1], check=None)[source]

Lowercases all specified fields. If check is set to a field id it conditions the lowercasing of the entire set on the fact if the checked field can be plausibly lowercased.

sotastream.augmentors.augmentors.ToTitle(lines, fields=[0, 1], check=None)[source]

Titlecases all specified fields. If check is set to a field id it conditions the titlecasing of the entire set on the fact if the checked field can be plausibly uppercased.

sotastream.augmentors.augmentors.ToUpper(lines, fields=[0, 1], check=None)[source]

Uppercases all specified fields. If check is set to a field id it conditions the uppercasing of the entire set on the fact if the checked field can be plausibly uppercased. This is used for things like Chinese source that has no case and would result in random target casing during inference

sotastream.augmentors.augmentors.UTF8File(path: str) Iterator[str][source]

Opens a file and returns a stream of Line objects.

sotastream.augmentors.augmentors.canBeLowercased(inputString)[source]

Check if the input string can be plausibly lowercased (is the lowercased version different from the non-lowercased one). We randomly sample 10 chars (with repetition if needed) which should be good enough. Note, this is rather meant as a quick way to identify if a script has casing rather than if a particular string in a script with casing can be lowercased. Both may be caught.

sotastream.augmentors.augmentors.canBeUppercased(inputString)[source]

Check if the input string can be plausibly uppercased (is the uppercased version different from the non-uppercased one). We randomly sample 10 chars (with repetition if needed) which should be good enough. Note, this is rather meant as a quick way to identify if a script has casing rather than if a particular string in a script with casing can be uppercased. Both may be caught.

sotastream.augmentors.augmentors.enumerate_files(dir: str, ext: str)[source]