|
1 | 1 | # AstroCLIP |
2 | 2 | Multimodal contrastive pretraining for astronomical data |
3 | 3 |
|
4 | | - |
| 4 | +<a href="https://arxiv.org/abs/2310.03024" style='vertical-align:middle; display:inline;'><img |
| 5 | + src="https://img.shields.io/badge/astro--ph.IM-arXiv%3A2310.03024-B31B1B.svg" class="plain" style="height:25px;" /></a> |
5 | 6 |
|
6 | 7 |
|
7 | | -## Requirements |
| 8 | +The goal of this project is to demonstrate the ability of contrastive pre-training between two different kinds of astronomical data modalities (multi-band imaging, and optical spectra), to yield a meaningful embedding space which captures physical information about galaxies and is shared between both modalities. |
8 | 9 |
|
9 | | -This repo should only have basic pytorch and huggingface requirements. The following should install all that is needed (so far) |
| 10 | + |
10 | 11 |
|
11 | | -```bash |
12 | | -pip install datasets timm lightning |
| 12 | +## Results |
| 13 | + |
| 14 | +We encourage you to take a look at our [NeurIPS 2023 AI4Science submission](https://arxiv.org/abs/2310.03024) (still under review) for a longer form description of our results, but here are the main takeaways: |
| 15 | + - Both image and spectra encoders are able to extract meaningful physical information from the input data. |
| 16 | + - The embeddings of both images and spectra are well aligned, allowing us to retrieve spectra that correspond to a given image, and vice-versa. |
| 17 | + |
| 18 | +The notebook used to generate the plots of the paper can be found [here](notebooks/PaperPlots.ipynb). |
| 19 | + |
| 20 | +Below is a visualization of the learned embeddings, by taking the 2 first PCA components of spectra and image embeddings. As one can see, images and spectra discover similar main factors of variations. |
| 21 | + |
| 22 | + |
| 23 | +Visualizing the structure of the latent space by UMAP dimensionality reduction further higlights some of its information content. Below is an example of a UMAP of the spectra embeddings: |
| 24 | + |
| 25 | + |
| 26 | + |
| 27 | + |
| 28 | +## Products: Datasets and Trained Models |
| 29 | + |
| 30 | +### Dataset |
| 31 | + |
| 32 | +As part of this project, we compile and make available a combined dataset of DESI Legacy Survey g,r,z images, and DESI Early Data Release spectra. These images are a subset of the [ssl-legacysurvey](https://github.com/georgestein/ssl-legacysurvey) sample compiled by @georgestein from the Legacy Survey DR9. Scripts used to match these datasets are available [here](scripts/cross_match_data.py). |
| 33 | + |
| 34 | +For convenience, we provide a Hugging Face Datasets loading script which will automatically download the data needed and prepare the dataset on your computer. |
| 35 | + |
| 36 | +```python |
| 37 | +from datasets import load_dataset |
| 38 | + |
| 39 | +# This downloads about 60 GB of data |
| 40 | +dset = load_dataset('astroclip/datasets/legacy_survey.py') |
13 | 41 | ``` |
14 | 42 |
|
15 | | -## Usage |
| 43 | +For an example of getting started with this dataset, for example to simply predict redsfhit from the spectra, you can take a look at this notebook [notebook](notebooks/dev/ConvolutionalPrototyping.ipynb). |
| 44 | + |
| 45 | + |
| 46 | +### Training scripts and model weights |
| 47 | + |
| 48 | +**[Coming soon]** |
| 49 | + |
16 | 50 |
|
17 | | -Please take a look at this initial prototyping notebook to see how the data looks like and how to use it: [notebook](notebooks/dev/ConvolutionalPrototyping.ipynb) |
| 51 | +## Requirements |
| 52 | + |
| 53 | +This repo should only have basic pytorch and huggingface requirements. The following should install all that is needed (when run from this repository): |
| 54 | + |
| 55 | +```bash |
| 56 | +pip install . |
| 57 | +``` |
18 | 58 |
|
0 commit comments