sotastream.filters.filters module

sotastream.filters.filters.BitextFilter(lines, end_range=2)[source]

Removes all fields up to end_range.

Parameters:
  • lines – the stream of input lines

  • end_range – One higher than the last 0-index field number that should be included.

sotastream.filters.filters.MatchFilter(lines, pattern='[\\=\\+\\#\\@\\^\\~\\<\\>]', fields=[0, 1], invert=False)[source]
sotastream.filters.filters.RegexFilter(lines, pattern, fields=[0, 1], invert=False)[source]

Removes a line if the pattern is found in one or more fields.

sotastream.filters.filters.SkipBlanks(lines, fields=[0, 1])[source]

Skips lines that are blank in any of the requested fields. Also zeroes out the third field if present (to reset docid). This is important for training document models, where a blank field can teach the model to drop / add sentences.

Parameters:
  • lines – The data stream

  • fields – fields to check for blankness