Est Ref normalization¶
| Path | langsim.modules.input.est_ref_normalize |
| Executable | yes |
Special normalization routine for Estonian Reference Corpus.
Splits up sentences into separate lines. This is easy to do, since the corpus puts a double space between sentences. There are also double spaces in other places, so we only split on double spaces after punctuation. Other double spaces are removed.
We also lower-case the whole corpus.
Inputs¶
| Name | Type(s) |
|---|---|
| corpus | TarredCorpus<TextDocumentType> |
Outputs¶
| Name | Type(s) |
|---|---|
| corpus | RawTextDocumentTypeTarredCorpus |
Options¶
| Name | Description | Type |
|---|---|---|
| forum | Set to T for processing the forum data, which is slightly different to the newspaper data | bool |