November 2, 2020

How to get started with ML Reproducibility Challenge 2020

Ananya harsh jha

Disclaimer: all authors are members of the PyTorch Lightning team

Learn how you can help mitigate the deep learning Reproducibility crises and sharpen your skills at the same time, with the help of PyTorch Lightning Bolts research toolbox.

Image for post
Inspired by v1 @ NeurIPS 2018

What is reproducibility and why should you care

The ability to reproduce results from experiments has been the core foundation of any scientific domain. However, the reproducibility of results has plagued the entire domain of machine learning, which in a lot of cases, heavily depends on stochastic optimization without guarantees of convergence. In deep learning, where more often than not the key to reproducibility lies in the tiniest of details, a lot of authors fail to mention the most crucial parameter or training procedure which has led them to their state of the art results.

Reproducibility is important not just to identify new areas of research, but also to make them more explainable, which is crucial when we try to use such algorithms to replace human decision-making. Standardizing submissions for reproducibility does not necessarily imply replicating the exact set of results published in the main paper, but rather giving other researchers guidelines to reach the same conclusion presented in the paper on their own task and compute power.

The reproducibility challenge

To mitigate this issue, after the initial Reproducibility in Machine Learning workshop at ICML 2017, Dr. Joelle Pineau and her colleagues started the first version of the Reproducibility challenge at ICLR 2018. The main goal of this challenge was to encourage people to reproduce results from ICLR 2018 submissions, where the papers are readily available on OpenReview. This was followed by a v2 of the challenge at ICLR 2019 and then a v3 at NeurIPS 2019, where the accepted papers were made available via OpenReview.

In the paper `Improving Reproducibility in Machine Learning Research`, Pineau et al. list the following as the causes of the reproducibility gap in machine learning:

Dr. Pineau has also released the reproducibility checklist:

Image for post
v2.0 @ NeuIPS 2020

The purpose of this checklist is to serve as a guide for authors and reviewers about the expected standards of reproducibility of results being submitted to these conferences.

ML reproducibility challenge 2020

This year, the ML Reproducibility Challenge expanded its scope to cover 7 top AI conferences in 2020 across machine learning, natural language processing, and computer vision: NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR and ECCV. The challenge is open to everyone, all you need to do is select and claim a published paper from the list, and attempt to reproduce its central claims.

The objective is to assess if the conclusions reached in the original paper are reproducible; for many papers replicating the presented results exactly isn’t possible, so the focus of this challenge is to follow the process described in the paper and attempt to reach the same conclusions.

Jesse Dodge, The Reproducibility Challenge as an Educational Tool

Reproducibility with Lightning

Obviously, some of the obstacles of reproducing results are dependent on the way research scientists organize their project, however for the others, you can use PyTorch Lightning to reduce this gap.

The creators and core contributors of PyTorch Lightning have been advocates for reproducibility in machine learning and deep learning research. In fact, the v3 of the Reproducibility challenge at NeurIPS 2019 officially recommended using PyTorch Lightning for submissions to the challenge.

The main philosophy of Lightning is decoupling engineering from research, thus making the code more readable. Our team at Lightning strives to offer a standard for writing deep learning repositories in a way that makes it much easier for anyone to know what your code is doing, and where the interesting pieces for research are.

For the purpose of making research more reproducible we created PyTorch Lightning Bolts, which is our toolbox for state of the art models, DataModules and model components. The idea of Bolts is to enable you to start your project on top of pre-built components and quickly iterate over your research instead of worrying about setting up the project or trying to reproduce previously posted results.

Image for post
Image by author

For example, if you are working on improving the standard ImageGPT, just subclass the existing implementation and start your awesome new research:

Image by author

1. Data

If your work involves some of the standard datasets used for research, utilize the available LightningDataModules, and use seed values to specify the exact split on which you ran your experiments!

A DataModule encapsulates the five steps involved in data processing in PyTorch:

  1. Download / tokenize / process.
  2. Clean and (maybe) save to disk.
  3. Load inside Dataset.
  4. Apply transforms (rotate, tokenize, etc…).
  5. Wrap inside a DataLoader.

This class can then be shared and used anywhere:

Image by author

In Bolts you can find implementations for:

2. Model checkpointing

Lightning offers automatic checkpointing so you can resume your training at any point. When you create and save your models with PyTorch Lightning, we automatically save the hyper-parameters defined within the Lightning Module. The checkpoint also includes the optimizers, LR Schedulers, callbacks, and anything else required to perfectly reconstruct the results from the experiment that you just ran to post a new state of the art!

Image by author
Image for post
Our model checkpoints contain more than just the state_dict! Image by author.

Read more on checkpointing.

3. Pre-trained weights and experiment logs

With Bolts, we provide a bunch of pre-trained weights along with the logs from experiments used to achieve a certain result. We provide verified results so you can have a tested starting point for different papers you wish to reproduce, instead of spending time trying to replicate a claim from a paper. We are striving to add more model checkpoints and replicable results in Bolts in the coming months.

Below, we show the results of SimCLR pre-training experiment on CIFAR-10, that we replicated based on the original paper.

Image for post
Table outlining the results achieved by a model present in Bolts! Image by author.

This is an example of logs from the Bolts docs, which in this case represents a fine-tuning process after a self-supervised learning model has been pre-trained. With seeded splits within DataModules, anyone can replicate the same results that we have shown here!

Image for post
Image by author

The docs also contain the exact hyper-parameters using which our results were generated.

Image by author

Good luck!

Participating in the reproducibility challenge is a great way to deepen your knowledge in deep learning, and to also contribute to the entire scientific community. We invite you all to take part and consider contributing your model to bolts to increase visibility and to have it tested against our robust testing suite. If you’d like to ask questions, feedback, or find people to collaborate with, check out our slack channel.

Happy coding!

co-authors for the article: Ananya Harsh Jha and Eden Afek