Neural sixgram (Xsym) trainer, v1¶

Path	langsim.modules.local_lm.neural_sixgram
Executable	yes

A special kind of six-gram model that combines 1-3 characters on the left with 1-3 characters on the right to learn unigram, bigram and trigram representations.

This is one of the most successful representation learning methods among those here. It’s also very robust across language pairs and different sizes of dataset. It’s therefore the model that I’ve opted to use in subsequent work that uses the learned representations.

Inputs¶

Name	Type(s)
vocabs	`list` of `Dictionary`
corpora	`list` of TarredCorpus<IntegerListsDocumentType>
frequencies	`list` of `NumpyArray`

Outputs¶

Name	Type(s)
model	`KerasModelBuilderClass`

Options¶

Name	Description	Type
embedding_size	Number of dimensions in the hidden representation. Default: 200	int
plot_freq	Output plots to the output directory while training is in progress. This slows down training if it’s done very often. Specify how many batches to wait between each plot. Fewer means you get a finer grained picture of the training process, more means training goes faster. 0 (default) turns off plotting	int
context_weights	Coefficients that specify the relative frequencies with which each of the different lengths of contexts (1, 2 and 3) will be used in training examples. For each sample, a pair context lengths is selected at random. Six coefficients specify the weights given to (1,1), (1,2), (1,3), (2,2), (2,3) and (3,3). The opposite orderings have the same probability. By default, they are uniformly sampled (‘1,1,1,1,1,1’), but you may adjust their relative frequencies to put more weight on some lengths than others. The first 6 values are the starting weights. After that, you may specify sets of 7 values: num_epochs, weight1, weight2, …. The weights at any point will transition smoothly (linearly) from the previous 6-tuple to the next, arriving at the epoch number given (i.e. 1=start of epoch 1 / end of first epoch). You may use float epoch numbers, e.g. 0.5	<function context_weights at 0x7f3b52ef3050>
composition2_layers	Number and size of layers to use to combine pairs of characters, given as a list of integers. The final layer must be the same size as the embeddings, so is not included in this list	comma-separated list of ints
epochs	Max number of training epochs. Default: 5	int
predictor_layers	Number and size of layers to use to take a pair of vectors and say whether they belong beside each other. Given as a list of integers. Doesn’t include the final projection to a single score	comma-separated list of ints
limit_training	Limit training to this many batches. Default: no limit	int
l2_reg	L2 regularization to apply to all layers’ weights. Default: 0.	float
unit_norm	If true, enforce a unit norm constraint on the learned embeddings. Default: false	bool
word_internal	Only train model on word-internal sequences. Word boundaries will be included, but no sequences spanning over word boundaries	bool
dropout	Dropout to apply to embeddings during training. Default: 0.3	float
oov	If given, use this special token in each vocabulary to represent OOVs. Otherwise, they are represented by an index added at the end of each vocabulary’s indices	string
word_boundary	If using word_internal, use this character (which must be in the vocabulary) to split words. Default: space	<type ‘unicode’>
composition3_layers	Number and size of layers to use to combine triples of characters, given as a list of integers. The final layer must be the same size as the embeddings, so is not included in this list	comma-separated list of ints
store_all	Store updated representations from every epoch, even if the validation loss goes up. The default behaviour is to only store the parameters with best validation loss, but for these purposes we probably want to set this to T most of the time. (Defaults to F for backwards compatibility)	bool
composition_dropout	Dropout to apply to composed representation during training. Default: same as dropout	float
batch	Training batch size. Default: 100	int
sim_freq	How often (in batches) to compute the similarity of overlapping phonemes between the languages. -1 (default) means never, 0 means once at the start of each epoch	int
corpus_offset	To avoid training on parallel data, in the case where the input corpora happen to be parallel, jump forward in the second corpus by this number of utterances, putting the skipping utterances at the end instead. Default: 10k utterances	int
cross_sentences	By default, the sliding window passed over the corpus stops at the end of a sentence (or whatever sequence division is in the input data) and starts again at the start of the next. Instead, join all sequences within a document into one long sequence and pass the sliding window over that	bool
validation	Number of samples to hold out as a validation set for training. Simply taken from the start of the corpus. Rounded to the nearest number of batches	int
embedding_activation	Activation function to apply to the learned embeddings before they’re used, and also to every projection into the embedding space (the final layers of compositions). By default, ‘linear’ is used, i.e. normal embeddings with no activation and a linear layer at the end of the composition functions. Choose any Keras named activation	string