Ylilauta VRT files

Path langsim.modules.input.ylilauta
Executable yes

Input reader for Ylilauta corpus.

Based on standard VRT text collection module, with a small amount of special processing added for Ylilauta.

See also

pimlico.modules.input.text_annotations.vrt_text:
Reading text from VRT files.

This is an input module. It takes no pipeline inputs and is used to read in data

Inputs

No inputs

Outputs

Name Type(s)
corpus YlilautaOutputType

Options

Name Description Type
files (required) Comma-separated list of absolute paths to files to include in the collection. Paths may include globs. Place a ‘?’ at the start of a filename to indicate that it’s optional. You can specify a line range for the file by adding ‘:X-Y’ to the end of the path, where X is the first line and Y the last to be included. Either X or Y may be left empty. (Line numbers are 1-indexed.) comma-separated list of (line range-limited) file paths
exclude A list of files to exclude. Specified in the same way as files (except without line ranges). This allows you to specify a glob in files and then exclude individual files from it (you can use globs here too) comma-separated list of strings
encoding_errors What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select ‘strict’ (default), ‘ignore’, ‘replace’. See Python’s str.decode() for details string
encoding Encoding to assume for input files. Default: utf8 string