Codebase for SCiL 2019 paper

Unsupervised Learning of Cross-Lingual Symbol Embeddings Without Parallel Data
Mark Granroth-Wilding and Hannu Toivonen (2019)

This codebase contains the code used to prepare data and train models for this paper. The code is released on Github.

For more information about the paper, including downloadable pre-trained embeddings, see here.

It uses Pimlico. Pimlico pipeline config files can be found in the pipelines directory. Most of the code consists of Pimlico modules (documented here).

The code has been cleaned up for release, which involved removing a lot of old code from various experiments carried out over a number of years. Hopefully, I’ve not removed anything important, but get in touch with Mark if something seems to be missing.

In the paper, the model is called Xsym. In this code, it is called neural_sixgram.

Getting started

To start using the code, see Pimlico’s guide for initializing Pimlico with someone else’s code.

In short…

  • Download the codebase and extract it.
  • Download the bootstrap.py script from Pimlico to the root directory of the codebase.
  • In the root directory, run: python bootstrap.py pipelines/char_embed_corpora.conf
  • Check that the setup has worked:
    • cd pipelines
    • ./char_embed_corpora.conf status
    • Pimlico should do some initial setup and then show a long list of modules in the pipeline
  • Delete bootstrap.py

Pipelines

There are two pipelines. These cover the corruption experiment and the main model training described in the paper.

In addition to this, if you want to reproduce everything we did, you’ll need to preprocess the data for low-resourced Uralic languages to clean it up. That process is implemented and documented in a separate codebase, which also uses Pimlico.

char_embed_corpora

Main model training pipeline.

This pipeline loads a variety of corpora and trains Xsym on them. It produces all the models described in the paper. To train on these corpora, you’ll need to download them and then update the paths in the [vars] section to point to their locations.

There are two slightly different implementations of the training code, found in the Pimlico modules neural_sixgram and neural_sixgram2. If you’re training the model yourself, you should use the more recent and more efficient neural_sixgram2.

The pipeline also includes training on some language pairs not reported in the paper.

char_embed_corrupt

Language corruption experiments to test Xsym’s robustness to different types of noise.

This pipeline implements the language corruption experiments reported in the paper. It takes real language data (Finnish forum posts) and applies random corruptions to it, training Xsym on uncorrupted and corrupted pairs.

Corpora

Ylilauta

Finnish forum posts.

Estonian Reference Corpus

Corpus of written Estonian from a variety of sources. Here we use just the subsets: tasakaalus_ajalehed and foorumid_lausestatud.

Danish Wikipedia dump

Text dump of Danish Wikipedia.

Europarl

The Europarl corpus of transcripts from the European Parliament.

Download the full source release. We use the Swedish, Spanish and Portuguese parts.

Multilingual Resource Collection of the University of Helsinki Language Corpus Server (UHLCS)

Data used to be available from the homepage, but is now available through the CSC. You’ll need to request access to the specific language datasets used.

The data you get is messy, in inconsistent formats and encodings. See the code distributed separately for how to preprocess this and get it into a useable textual form, which we use below.

Documentation