Just as we’ve recently witnessed broader and broader use cases from some of our industrial customers, numerous academic groups are finding success using SigOpt to optimize and improve the accuracy of Natural Language models. Recently, a team of researchers from Luleå Technical University in Sweden have optimized Transformer’s performance on named entity recognition in the Swedish language. With little public training data for Natural Language tasks in Swedish, it took the team six days straight to organize, clean, and prepare the analogy test data.
Read on to find out how they executed their research, and how they used SigOpt to achieve better results:
How would you summarize your research and for whom is it most useful?
The research shows that embeddings from relatively smaller corpora can outperform ones from far larger corpora, given the right hyper-parameter combination, and presents an analogy test set for evaluating Swedish embeddings. The work uses both intrinsic and extrinsic tests (using named entity recognition or NER) to evaluate word2vec and subword embeddings in English & Swedish with the Transformer architecture. It also demonstrates, qualitatively, the performance during NER. This work is most useful for natural language processing (NLP) researchers in various sub-fields and Swedish researchers, particularly, who need a simple and quick intrinsic evaluation of word embeddings.
How did you decide on the Transformer Encoder model architecture for use in your research?
The model architecture of the Transformer Encoder was chosen based on it’s acclaimed performance in recent work by the creators and other researchers.
How did you employ SigOpt? Specifically, which parameters did you tune? Did you consider tuning other parameters, and why or why not?
The template code provided by SigOpt was adapted for the research. Three hyper-parameters were tuned for an observation budget of 45: the network optimizer (Adam and RMSProp), Transformer layers (6-12) and attention heads (2-6). We could have tuned additional hyper-parameters but this would have required more time since that would require more observation budget.
Why did you choose SigOpt over evolutionary search, grid search, or open source options?
We chose SigOpt based on recommendation from an existing user that it’s a very useful tool. Besides, grid search takes longer time and effort.
Where do you feel that your research succeeded in its goal versus where did it fail to achieve the desired ends?
The research succeeded in many areas, including demonstrating optimal embeddings for NLP from relatively smaller corpora, based on the right hyper-parameter combination, providing the first analogy test set for Swedish word embeddings, first evaluation of the Swedish embeddings by Facebook researchers – Grave et al (2018) and demonstrating qualitatively the Transformer training process in NER. Despite the achievements, there were limitations; these are conducting single runs on embedding trainings and using only one downstream task of NER because of scarcity of publicly available Swedish corpora.
Is there anything specific to the Swedish language that requires a different approach to tuning the Transformer model or any other Natural Language models?
For now, we don’t see anything specific to the Swedish language to tune the Transformer differently but results confirmed that subword (shallow character-ngram model) embeddings perform better for Swedish, a morphologically rich language.
Where do you plan to continue this research in the future? Is there more room to grow with question-answering problems, or do you plan to tackle other aspects of Natural Language?
Future considerations will involve language model (LM) pre-training, such as with BERT, ALBERT or some of the new state-of-the-art models. There’s more room to grow performance for question-answering, NER problems and many other tasks in NLP, including low-resourced languages, like Yoruba (an African language currently being worked on).
LTU has one of the largest compute resource clusters in Northern Europe, and Tosin says he’s excited to train larger models on a new Kubernetes cluster that runs GPUs, a huge infrastructural investment for the university.
If you’re interested in trying out SigOpt to tune and optimize Natural Language models, you might be interested in reading a blog series on tuning BERT by Meghana Ravikumar, or using SigOpt for free to help you more efficiently solve your research problems in academia. You can also sign up for Experiment management here.