Corrupt text

Path langsim.modules.fake_language.corrupt
Executable yes

Introduce random noise into a corpus.

The input corpus is expected to be character-level encoded integer indexed text. (You could also run it on word-level encoded data, but the results might be odd.)

Produces a new corpus with a new character vocabulary, which might not be identical to the input vocabulary, depending on the options. E.g. some characters might be removed or added.

If a token called ‘OOV’ is found in the vocabulary, it will never be subject to a mapping or mapped to.

Types of noise, with corresponding parameters:

  • Random character substitutions: randomly sample a given proportion of characters and choose a character at random from the unigram distribution of the input corpus to replace each with

    • char_subst_prop: proportion of characters (tokens) to sample for substitution. Use 0 to disable this corruption
  • Systematic character mapping: perform a systematic substitution throughout the corpus of a particular character A (randomly chosen from input vocab) for another B (randomly chosen from output vocab). This means that the resulting Bs are indistinguishable from those that were Bs in the input. A is removed from the output vocab, since it is never used now. When multiple mappings are chosen, it is not checked that they have different Bs.

    A number of characters is chosen using frequencies so that the expected proportion of tokens affected is at least the given parameter. Since the resulting expected proportion of tokens may be higher due to the sampling of characters, the actual expected proportion is output among the corruption parameters as actual_char_subst_prop.

    • char_map_prop: proportion of characters (types) in input vocab to apply a mapping to. Use 0 to disable this corruption
  • Split characters: choose a set of characters. For each A invent a new character B and map half of its occurrences to B, leaving half as they were. Each of these results in adding a brand new unicode character to the output vocab

    As with char_map_prop, a number of characters is chosen using frequencies so that the expected proportion of tokens affected is at least the given parameter. Since the resulting expected proportion of tokens may be higher due to the sampling of characters, the actual expected proportion is output among the corruption parameters as actual_char_split_prop.

    • char_split_prop: proportion of characters (types) to apply this splitting to

Inputs

Name Type(s)
corpus TarredCorpus<IntegerListsDocumentType>
vocab Dictionary
frequencies NumpyArray

Outputs

Name Type(s)
corpus IntegerListsDocumentTypeTarredCorpus
vocab Dictionary
mappings NamedFile()
close_pairs NamedFile()
corruption_params NamedFile()

Options

Name Description Type
char_map_prop Proportion of character types in input vocab to apply a random mapping to another character to. Default: 0 float
char_split_prop Proportion of character types in input vocab to apply splitting to. Default: 0 float
char_subst_prop Proportion of characters to sample for random substitution. Default: 0 float