Neural sixgram (Xsym) trainer, v1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. py:module:: langsim.modules.local_lm.neural_sixgram

+------------+-----------------------------------------+
| Path       | langsim.modules.local_lm.neural_sixgram |
+------------+-----------------------------------------+
| Executable | yes                                     |
+------------+-----------------------------------------+

A special kind of six-gram model that combines 1-3 characters on the left with 1-3 characters on the right
to learn unigram, bigram and trigram representations.

This is one of the most successful representation learning methods among those here. It's also very robust
across language pairs and different sizes of dataset. It's therefore the model that I've opted to use in
subsequent work that uses the learned representations.


Inputs
======

+-------------+------------------------------------------------------------------------------------------------------------------------+
| Name        | Type(s)                                                                                                                |
+=============+========================================================================================================================+
| vocabs      | :class:`list <pimlico.datatypes.base.MultipleInputs>` of :class:`Dictionary <pimlico.datatypes.dictionary.Dictionary>` |
+-------------+------------------------------------------------------------------------------------------------------------------------+
| corpora     | :class:`list <pimlico.datatypes.base.MultipleInputs>` of TarredCorpus<IntegerListsDocumentType>                        |
+-------------+------------------------------------------------------------------------------------------------------------------------+
| frequencies | :class:`list <pimlico.datatypes.base.MultipleInputs>` of :class:`NumpyArray <pimlico.datatypes.arrays.NumpyArray>`     |
+-------------+------------------------------------------------------------------------------------------------------------------------+

Outputs
=======

+-------+----------------------------------------------------------+
| Name  | Type(s)                                                  |
+=======+==========================================================+
| model | :class:`~pimlico.datatypes.keras.KerasModelBuilderClass` |
+-------+----------------------------------------------------------+

Options
=======

+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| Name                 | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Type                                         |
+======================+==========================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+==============================================+
| embedding_size       | Number of dimensions in the hidden representation. Default: 200                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | int                                          |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| plot_freq            | Output plots to the output directory while training is in progress. This slows down training if it's done very often. Specify how many batches to wait between each plot. Fewer means you get a finer grained picture of the training process, more means training goes faster. 0 (default) turns off plotting                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | int                                          |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| context_weights      | Coefficients that specify the relative frequencies with which each of the different lengths of contexts (1, 2 and 3) will be used in training examples. For each sample, a pair context lengths is selected at random. Six coefficients specify the weights given to (1,1), (1,2), (1,3), (2,2), (2,3) and (3,3). The opposite orderings have the same probability. By default, they are uniformly sampled ('1,1,1,1,1,1'), but you may adjust their relative frequencies to put more weight on some lengths than others. The first 6 values are the starting weights. After that, you may specify sets of 7 values: num_epochs, weight1, weight2, .... The weights at any point will transition smoothly (linearly) from the previous 6-tuple to the next, arriving at the epoch number given (i.e. 1=start of epoch 1 / end of first epoch). You may use float epoch numbers, e.g. 0.5 | <function context_weights at 0x7f3b52ef3050> |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| composition2_layers  | Number and size of layers to use to combine pairs of characters, given as a list of integers. The final layer must be the same size as the embeddings, so is not included in this list                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | comma-separated list of ints                 |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| epochs               | Max number of training epochs. Default: 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | int                                          |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| predictor_layers     | Number and size of layers to use to take a pair of vectors and say whether they belong beside each other. Given as a list of integers. Doesn't include the final projection to a single score                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | comma-separated list of ints                 |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| limit_training       | Limit training to this many batches. Default: no limit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | int                                          |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| l2_reg               | L2 regularization to apply to all layers' weights. Default: 0.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | float                                        |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| unit_norm            | If true, enforce a unit norm constraint on the learned embeddings. Default: false                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | bool                                         |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| word_internal        | Only train model on word-internal sequences. Word boundaries will be included, but no sequences spanning over word boundaries                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | bool                                         |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| dropout              | Dropout to apply to embeddings during training. Default: 0.3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | float                                        |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| oov                  | If given, use this special token in each vocabulary to represent OOVs. Otherwise, they are represented by an index added at the end of each vocabulary's indices                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | string                                       |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| word_boundary        | If using word_internal, use this character (which must be in the vocabulary) to split words. Default: space                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | <type 'unicode'>                             |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| composition3_layers  | Number and size of layers to use to combine triples of characters, given as a list of integers. The final layer must be the same size as the embeddings, so is not included in this list                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | comma-separated list of ints                 |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| store_all            | Store updated representations from every epoch, even if the validation loss goes up. The default behaviour is to only store the parameters with best validation loss, but for these purposes we probably want to set this to T most of the time. (Defaults to F for backwards compatibility)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | bool                                         |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| composition_dropout  | Dropout to apply to composed representation during training. Default: same as dropout                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | float                                        |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| batch                | Training batch size. Default: 100                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | int                                          |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| sim_freq             | How often (in batches) to compute the similarity of overlapping phonemes between the languages. -1 (default) means never, 0 means once at the start of each epoch                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | int                                          |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| corpus_offset        | To avoid training on parallel data, in the case where the input corpora happen to be parallel, jump forward in the second corpus by this number of utterances, putting the skipping utterances at the end instead. Default: 10k utterances                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | int                                          |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| cross_sentences      | By default, the sliding window passed over the corpus stops at the end of a sentence (or whatever sequence division is in the input data) and starts again at the start of the next. Instead, join all sequences within a document into one long sequence and pass the sliding window over that                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | bool                                         |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| validation           | Number of samples to hold out as a validation set for training. Simply taken from the start of the corpus. Rounded to the nearest number of batches                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | int                                          |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+
| embedding_activation | Activation function to apply to the learned embeddings before they're used, and also to every projection into the embedding space (the final layers of compositions). By default, 'linear' is used, i.e. normal embeddings with no activation and a linear layer at the end of the composition functions. Choose any Keras named activation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | string                                       |
+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+