sotastream.filters.filters module
- sotastream.filters.filters.BitextFilter(lines, end_range=2)[source]
Removes all fields up to end_range.
- Parameters:
lines – the stream of input lines
end_range – One higher than the last 0-index field number that should be included.
- sotastream.filters.filters.MatchFilter(lines, pattern='[\\=\\+\\#\\@\\^\\~\\<\\>]', fields=[0, 1], invert=False)[source]
- sotastream.filters.filters.RegexFilter(lines, pattern, fields=[0, 1], invert=False)[source]
Removes a line if the pattern is found in one or more fields.
- sotastream.filters.filters.SkipBlanks(lines, fields=[0, 1])[source]
Skips lines that are blank in any of the requested fields. Also zeroes out the third field if present (to reset docid). This is important for training document models, where a blank field can teach the model to drop / add sentences.
- Parameters:
lines – The data stream
fields – fields to check for blankness