Learned embedding analysis

Path langsim.modules.local_lm.embed_anal
Executable yes

Various analyses thrown together for including things in a paper.

To simplify things, we assume for now that there are exactly two languages (vocabs, corpora). We could generalize this later, but for now it makes the code much easier and we only do this for the paper.


Name Type(s)
model NeuralSixgramKerasModel
vocabs list of Dictionary
frequencies list of NumpyArray


Name Type(s)
analysis NamedFile()
pairs NamedFile()


Name Description Type
oov If given, look for this special token in each vocabulary which represents OOVs. These are not filtered out, even if they are rare string
lang_names (required) Comma-separated list of language IDs to use in output comma-separated list of strings
min_token_prop Minimum frequency, as a proportion of tokens, that a character in the vocabulary must have to be shown in the charts float