Est Ref normalization¶
Path | langsim.modules.input.est_ref_normalize |
Executable | yes |
Special normalization routine for Estonian Reference Corpus.
Splits up sentences into separate lines. This is easy to do, since the corpus puts a double space between sentences. There are also double spaces in other places, so we only split on double spaces after punctuation. Other double spaces are removed.
We also lower-case the whole corpus.
Inputs¶
Name | Type(s) |
---|---|
corpus | TarredCorpus<TextDocumentType> |
Outputs¶
Name | Type(s) |
---|---|
corpus | RawTextDocumentTypeTarredCorpus |
Options¶
Name | Description | Type |
---|---|---|
forum | Set to T for processing the forum data, which is slightly different to the newspaper data | bool |