Europarl corpus reader ~~~~~~~~~~~~~~~~~~~~~~ .. py:module:: langsim.modules.input.europarl +------------+--------------------------------+ | Path | langsim.modules.input.europarl | +------------+--------------------------------+ | Executable | no | +------------+--------------------------------+ This is an input module. It takes no pipeline inputs and is used to read in data Inputs ====== No inputs Outputs ======= +--------+----------------------------------------------------------+ | Name | Type(s) | +========+==========================================================+ | corpus | :class:`~langsim.modules.input.europarl.info.OutputType` | +--------+----------------------------------------------------------+ Options ======= +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------+ | Name | Description | Type | +=================+=====================================================================================================================================================================================================================================================================================================================================================================================================+=========================================================+ | files | (required) Comma-separated list of absolute paths to files to include in the collection. Paths may include globs. Place a '?' at the start of a filename to indicate that it's optional. You can specify a line range for the file by adding ':X-Y' to the end of the path, where X is the first line and Y the last to be included. Either X or Y may be left empty. (Line numbers are 1-indexed.) | comma-separated list of (line range-limited) file paths | +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------+ | exclude | A list of files to exclude. Specified in the same way as `files` (except without line ranges). This allows you to specify a glob in `files` and then exclude individual files from it (you can use globs here too) | comma-separated list of strings | +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------+ | encoding_errors | What to do in the case of invalid characters in the input while decoding (e.g. illegal utf-8 chars). Select 'strict' (default), 'ignore', 'replace'. See Python's str.decode() for details | string | +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------+ | encoding | Encoding to assume for input files. Default: utf8 | string | +-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------+