Classifying the PLAsTiCC dataset

The Photometric LSST Astronomical Time-Series Classification Challenge (PLAsTiCC) was a challenge released in 2018 to develop photometric classification methods for upcoming data from LSST. The authors of this challenge produced a realistic dataset for 3-years of LSST observations and simulated followup spectroscopic observations. The resulting dataset contains light curves for 3,492,888 astronomical objects, but only 7,846 of these objects have spectroscopic followup to confirm their types. The authors released this blinded dataset through the Kaggle platform and challenged the community to develop new methods for photometric classification.

An early version of avocado won this challenge, achieving the best score on a weighted log-loss metric of 1,094 classifiers submitted to this challenge. In this document, we show how to reproduce avocado’s classifications for the PLAsTiCC dataset.

All time estimates are for running on a machine with the following specifications:

  • CentOS 6.5
  • Intel(R) Xeon(R) CPU E3-1270
  • 32 GB RAM

Setup

Installating avocado

First, install avocado and all of its dependencies following the installation instructions. This will install a set of scripts that can be used to interact with avocado datasets.

Setting up a working directory

Create and move to a new working directory for avocado. All of the datasets, classifiers and predictions will be stored in this directory. Every script listed after this should be run from the base of this working directory. For example:

mkdir ~/plasticc
cd ~/plasticc

Downloading the PLAsTiCC dataset

A script is included in avocado to download the PLAsTiCC dataset from zenodo. This will download the dataset and preprocess it to get it into the format used internally by avocado. Running this script takes ~30 minutes.

avocado_download_plasticc

Augmenting the PLAsTiCC dataset

The avocado_augment script is included to augment datasets. To generate an augmented dataset with the name “plasticc_augment”, run the following command in the working directory:

avocado_augment plasticc_train plasticc_augment

This will take several hours to run. Optionally, if a SGE grid system is available, the augmentation can be split to run in the grid system across several jobs with the following command:

avocado_augment_submit plasticc_train plasticc_augment --num_jobs 100 --qsub_arguments '-q all.q'

This will split the augmenting procedure into 100 jobs, and submit them to the queue ‘all.q’. Job files and output will be stored in the jobs directory of the working directory. Modify –qsub_arguments as appropriate for your system, or similar scripts can be created for other job systems.

Featurizing the datasets

Featurizing datasets is a slow process and takes ~100 hours for the full PLAsTiCC dataset. It is highly recommended to run these jobs in parallel.

To featurize sequentially:

avocado_featurize plasticc_train
avocado_featurize plasticc_test --num_chunks 500
avocado_featurize plasticc_augment

To submit featurize jobs to an SGE queue:

avocado_featurize_submit plasticc_train --qsub_arguments '-q all.q'
avocado_featurize_submit plasticc_test --qsub_arguments '-q all.q' --num_jobs 500
avocado_featurize_submit plasticc_augment --qsub_arguments '-q all.q'

Training the classifier

Several different classifiers can be trained using the same augmented dataset. To train a standard classifier with flat weights named “flat_weight”, run:

avocado_train_classifier plasticc_augment flat_weight

This will take approximately 30 minutes.

Generating predictions

To generate predictions for the full dataset with our “flat_weight” classifier, run:

avocado_predict plasticc_test flat_weight

This will take approximately 1 hour to run.

(optional) Converting predictions to the kaggle format

The predictions generated by avocado will be saved in an HDF5 file by default. These can be converted to a CSV file used by kaggle with the following command:

avocado_convert_kaggle plasticc_test flat_weight

(optional) Training a redshift-weighted classifier

As shown in Boone (2019), a redshift-weighted classifier can be used to generate predictions that are independent of the redshift distribution and rates in the training sample. This is especially important for augmented datasets where the exact form of augmentation will otherwise leak into the classification. To train and generate predictions with a redshift-weighted classifier, run the following commands:

avocado_train_classifier plasticc_augment redshift_weight --object_weighting redshift
avocado_predict plasticc_test redshift_weight

(optional) Training classifiers on biased samples

In Boone (2019), we illustrate the bias of a classically trained classifier when the redshift distributions of the training samples are modified. To reproduce these results, run the following commands:

avocado_train_classifier plasticc_augment flat_weight_bias_high --simulate_plasticc_bias high_redshift
avocado_train_classifier plasticc_augment flat_weight_bias_low --simulate_plasticc_bias low_redshift
avocado_train_classifier plasticc_augment redshift_weight_bias_high --object_weighting redshift --simulate_plasticc_bias high_redshift
avocado_train_classifier plasticc_augment redshift_weight_bias_low --object_weighting redshift --simulate_plasticc_bias low_redshift

avocado_predict plasticc_test flat_weight_bias_high
avocado_predict plasticc_test flat_weight_bias_low
avocado_predict plasticc_test redshift_weight_bias_high
avocado_predict plasticc_test redshift_weight_bias_low

(optional) Reproducing the figures in Boone 2019

A Jupyter notebook that was used to produce all of the figures in Boone (2019) is included with avocado. It can be found on github. To run this notebook, copy it to the working directory after running all of the previous steps in this document, and open it using Jupyter. Note that the augmentation procedure is not deterministic, so the results will vary slightly between runs. The plots of augmented light curves will need to be adjusted to select objects in the new augmented sample.