Neural sixgram (Xsym) trainer, v1

Path langsim.modules.local_lm.neural_sixgram
Executable yes

A special kind of six-gram model that combines 1-3 characters on the left with 1-3 characters on the right to learn unigram, bigram and trigram representations.

This is one of the most successful representation learning methods among those here. It’s also very robust across language pairs and different sizes of dataset. It’s therefore the model that I’ve opted to use in subsequent work that uses the learned representations.


Name Type(s)
vocabs list of Dictionary
corpora list of TarredCorpus<IntegerListsDocumentType>
frequencies list of NumpyArray


Name Type(s)
model KerasModelBuilderClass


Name Description Type
embedding_size Number of dimensions in the hidden representation. Default: 200 int
plot_freq Output plots to the output directory while training is in progress. This slows down training if it’s done very often. Specify how many batches to wait between each plot. Fewer means you get a finer grained picture of the training process, more means training goes faster. 0 (default) turns off plotting int
context_weights Coefficients that specify the relative frequencies with which each of the different lengths of contexts (1, 2 and 3) will be used in training examples. For each sample, a pair context lengths is selected at random. Six coefficients specify the weights given to (1,1), (1,2), (1,3), (2,2), (2,3) and (3,3). The opposite orderings have the same probability. By default, they are uniformly sampled (‘1,1,1,1,1,1’), but you may adjust their relative frequencies to put more weight on some lengths than others. The first 6 values are the starting weights. After that, you may specify sets of 7 values: num_epochs, weight1, weight2, …. The weights at any point will transition smoothly (linearly) from the previous 6-tuple to the next, arriving at the epoch number given (i.e. 1=start of epoch 1 / end of first epoch). You may use float epoch numbers, e.g. 0.5 <function context_weights at 0x7f3b52ef3050>
composition2_layers Number and size of layers to use to combine pairs of characters, given as a list of integers. The final layer must be the same size as the embeddings, so is not included in this list comma-separated list of ints
epochs Max number of training epochs. Default: 5 int
predictor_layers Number and size of layers to use to take a pair of vectors and say whether they belong beside each other. Given as a list of integers. Doesn’t include the final projection to a single score comma-separated list of ints
limit_training Limit training to this many batches. Default: no limit int
l2_reg L2 regularization to apply to all layers’ weights. Default: 0. float
unit_norm If true, enforce a unit norm constraint on the learned embeddings. Default: false bool
word_internal Only train model on word-internal sequences. Word boundaries will be included, but no sequences spanning over word boundaries bool
dropout Dropout to apply to embeddings during training. Default: 0.3 float
oov If given, use this special token in each vocabulary to represent OOVs. Otherwise, they are represented by an index added at the end of each vocabulary’s indices string
word_boundary If using word_internal, use this character (which must be in the vocabulary) to split words. Default: space <type ‘unicode’>
composition3_layers Number and size of layers to use to combine triples of characters, given as a list of integers. The final layer must be the same size as the embeddings, so is not included in this list comma-separated list of ints
store_all Store updated representations from every epoch, even if the validation loss goes up. The default behaviour is to only store the parameters with best validation loss, but for these purposes we probably want to set this to T most of the time. (Defaults to F for backwards compatibility) bool
composition_dropout Dropout to apply to composed representation during training. Default: same as dropout float
batch Training batch size. Default: 100 int
sim_freq How often (in batches) to compute the similarity of overlapping phonemes between the languages. -1 (default) means never, 0 means once at the start of each epoch int
corpus_offset To avoid training on parallel data, in the case where the input corpora happen to be parallel, jump forward in the second corpus by this number of utterances, putting the skipping utterances at the end instead. Default: 10k utterances int
cross_sentences By default, the sliding window passed over the corpus stops at the end of a sentence (or whatever sequence division is in the input data) and starts again at the start of the next. Instead, join all sequences within a document into one long sequence and pass the sliding window over that bool
validation Number of samples to hold out as a validation set for training. Simply taken from the start of the corpus. Rounded to the nearest number of batches int
embedding_activation Activation function to apply to the learned embeddings before they’re used, and also to every projection into the embedding space (the final layers of compositions). By default, ‘linear’ is used, i.e. normal embeddings with no activation and a linear layer at the end of the composition functions. Choose any Keras named activation string