sotastream.pipelines.mtdata_pipeline module

class sotastream.pipelines.mtdata_pipeline.MTDataPipeline(data_ids: List[str], mix_weights: List[float] | None = None, langs: Tuple[str, str] = None, **kwargs)[source]

Bases: Pipeline

Pipeline to mix datasets from mtdata.

To install mtdata, run pip install mtdata, or visit https://github.com/thammegowda/mtdata To see the list of available datasets, run mtdata list -id -l <src>-<tgt> where <src>-<tgt> are language pairs.

Example #1:

sotastream mtdata -lp en-de Statmt-news_commentary-16-deu-eng Statmt-europarl-10-deu-eng

Example #2:

sotastream mtdata -lp en-de Statmt-news_commentary-16-deu-eng Statmt-europarl-10-deu-eng –mix-weights 1 2

Example #3:

sotastream mtdata -lp en-de Statmt-news_commentary-16-deu-eng,Statmt-europarl-10-deu-eng

Example #1 mixes two datasets with equal weights (i.e., 1:1). Example #2 mixes two datasets with 1:2 ratio respectively. Example #3 simply concatenates both datasets separated by comma into a single dataset.

Therefore, the resulting mixture weights are proportional to the number of segments in each dataset.

The –langs|-lp <src>-<tgt> argument is used to enforce compatibility between the specified datasets and ensure correct ordering of source and target languages

classmethod add_cli_args(parser)[source]

Add CLI arguments to pipeline specific subparser. These arguments are shared across all pipelines and appear after the pipeline name in the CLI. For global args that appear before the pipeline name, see sotastream.cli.add_cli_args

classmethod get_data_sources_default_weights()[source]

A list of floats corresponding to the number of data sources and specifying the mixture weights among them. These will be provided to the argparse subcommand as the default values for the –mix-weights argument. To get the actual instantiated values, use self.mix_weights. The function is named in an overly explicit way to avoid confusion between these two sources.

classmethod get_data_sources_for_argparse()[source]

This returns a list of (name, description) pairs for each data source. This is used to instantiate the argparse subcommand with named positional arguments. These are not the actual instantiated data paths; for that, each class has The function name is quite verbose in order to minimize confusion.

Returns:

List[Tuple]: List of (name, description)

sotastream.pipelines.mtdata_pipeline.MTDataSource(dids: str | List[str], langs=None, progress_bar=False) Iterator[Line][source]

MTData dataset iterator.

Parameters:
  • dids – either a single dataset ID or a list of dataset ID. IDs are of form Group-name-version-lang1-lang2 e.g. “Statmt-news_commentary-16-deu-eng”

  • langs – source-target language order, e.g. “deu-eng”

Progress_bar:

whether to show progress bar

Returns:

Line objects