Ylilauta VRT files ~~~~~~~~~~~~~~~~~~ .. py:module:: langsim.modules.input.ylilauta +------------+--------------------------------+ | Path | langsim.modules.input.ylilauta | +------------+--------------------------------+ | Executable | yes | +------------+--------------------------------+ Input reader for Ylilauta corpus. Based on standard VRT text collection module, with a small amount of special processing added for Ylilauta. .. seealso:: :mod:`pimlico.modules.input.text_annotations.vrt_text`: Reading text from VRT files. This is an input module. It takes no pipeline inputs and is used to read in data Inputs ====== No inputs Outputs ======= +--------+------------------------------------------------------------------+ | Name | Type(s) | +========+==================================================================+ | corpus | :class:`~langsim.modules.input.ylilauta.info.YlilautaOutputType` | +--------+------------------------------------------------------------------+ Options ======= +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------+ | Name | Description | Type | +=================+=====================================================================================================================================================================================================================================================================================================================================================================================================+=========================================================+ | files | (required) Comma-separated list of absolute paths to files to include in the collection. Paths may include globs. Place a '?' at the start of a filename to indicate that it's optional. You can specify a line range for the file by adding ':X-Y' to the end of the path, where X is the first line and Y the last to be included. Either X or Y may be left empty. (Line numbers are 1-indexed.) | comma-separated list of (line range-limited) file paths | +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------+ | exclude | A list of files to exclude. Specified in the same way as `files` (except without line ranges). This allows you to specify a glob in `files` and then exclude individual files from it (you can use globs here too) | comma-separated list of strings | +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------+ | encoding_errors | What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select 'strict' (default), 'ignore', 'replace'. See Python's str.decode() for details | string | +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------+ | encoding | Encoding to assume for input files. Default: utf8 | string | +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------+