Neural sixgram samples prep

Path langsim.modules.local_lm.neural_sixgram_samples
Executable yes

Prepare positive samples for neural sixgram training data.

Instead of doing random shuffling, etc, on the fly while training, which takes quite a lot of time, we do it once before and just iterate over the result at training time.

The output is then used by neural_sixgram2 to train the Xsym model.

Inputs

Name Type(s)
vocabs list of Dictionary
corpora list of TarredCorpus<IntegerListsDocumentType>
frequencies list of NumpyArray

Outputs

Name Type(s)
samples NeuralSixgramTrainingData

Options

Name Description Type
cross_sentences By default, the sliding window passed over the corpus stops at the end of a sentence (or whatever sequence division is in the input data) and starts again at the start of the next. Instead, join all sequences within a document into one long sequence and pass the sliding window over that bool
oov If given, use this special token in each vocabulary to represent OOVs. Otherwise, they are represented by an index added at the end of each vocabulary’s indices string
shuffle_window We simulate shuffling the data by reading samples into a buffer and taking them randomly from there. This is the size of that buffer. A higher number shuffles more, but makes data preparation slower int
corpus_offset To avoid training on parallel data, in the case where the input corpora happen to be parallel, jump forward in the second corpus by this number of utterances, putting the skipping utterances at the end instead. Default: 10k utterances int