Est Ref normalization

Path langsim.modules.input.est_ref_normalize
Executable yes

Special normalization routine for Estonian Reference Corpus.

Splits up sentences into separate lines. This is easy to do, since the corpus puts a double space between sentences. There are also double spaces in other places, so we only split on double spaces after punctuation. Other double spaces are removed.

We also lower-case the whole corpus.

Inputs

Name Type(s)
corpus TarredCorpus<TextDocumentType>

Outputs

Name Type(s)
corpus RawTextDocumentTypeTarredCorpus

Options

Name Description Type
forum Set to T for processing the forum data, which is slightly different to the newspaper data bool