Ylilauta VRT files¶
Path | langsim.modules.input.ylilauta |
Executable | yes |
Input reader for Ylilauta corpus.
Based on standard VRT text collection module, with a small amount of special processing added for Ylilauta.
See also
pimlico.modules.input.text_annotations.vrt_text
:- Reading text from VRT files.
This is an input module. It takes no pipeline inputs and is used to read in data
Inputs¶
No inputs
Outputs¶
Name | Type(s) |
---|---|
corpus | YlilautaOutputType |
Options¶
Name | Description | Type |
---|---|---|
files | (required) Comma-separated list of absolute paths to files to include in the collection. Paths may include globs. Place a ‘?’ at the start of a filename to indicate that it’s optional. You can specify a line range for the file by adding ‘:X-Y’ to the end of the path, where X is the first line and Y the last to be included. Either X or Y may be left empty. (Line numbers are 1-indexed.) | comma-separated list of (line range-limited) file paths |
exclude | A list of files to exclude. Specified in the same way as files (except without line ranges). This allows you to specify a glob in files and then exclude individual files from it (you can use globs here too) | comma-separated list of strings |
encoding_errors | What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select ‘strict’ (default), ‘ignore’, ‘replace’. See Python’s str.decode() for details | string |
encoding | Encoding to assume for input files. Default: utf8 | string |