Skip to content

Commit ed2c36f

Browse files
authored
Create 2024-12-03
1 parent 010d778 commit ed2c36f

File tree

1 file changed

+105
-0
lines changed

1 file changed

+105
-0
lines changed

_posts/2024-12-03

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
---
2+
layout: post
3+
title: "The Multimodal Universe: 100TB of Astronomical Scientific Data"
4+
authors: The Multimodal Universe Collaboration; Eirini Angeloudi, Jeroen Audenaert, Micah Bowles, Benjamin M. Boyd, David Chemaly, Brian Cherinka, Ioana Ciucă, Miles Cranmer, Aaron Do, Matthew Grayling, Erin E. Hayes, Tom Hehir, Shirley Ho, Marc Huertas-Company, Kartheik G. Iyer, Maja Jablonska, Francois Lanusse, Henry W. Leung, Kaisey Mandel, Juan Rafael Martínez-Galarza, Peter Melchior, Lucas Meyer, Liam H. Parker, Helen Qu, Jeff Shen, Michael J. Smith, Connor Stone, Mike Walmsley, John F. Wu
5+
shorttitle: "Multimodal Universe"
6+
date: 2024-12-03 11:00
7+
smallimage: astroclip_update.jpeg
8+
image: astroclip_update.jpeg
9+
blurb: 100TB of cross-matched, standardized astronomy data that brings together images, spectra, and time-series data from leading surveys to accelerate machine learning breakthroughs.
10+
shortblurb: 100TB of cross-matched, standardized astronomy data that brings together images, spectra, and time-series data from leading surveys to accelerate machine learning breakthroughs.
11+
splashimage: /images/blog/astroclip_update.jpeg
12+
link: https://openreview.net/forum?id=EWm9zR5Qy1#discussion
13+
github_link: https://github.com/MultimodalUniverse/MultimodalUniverse
14+
permalink: /blog/multimodaluniverse/
15+
---
16+
17+
Astronomy has always been a data-rich science — but in recent years, the sheer volume and complexity of that data have skyrocketed. Today, many researchers turn to machine learning (ML) to handle tasks involving imaging, spectra, and time-series measurements of millions of astrophysical phenomena. However, many of the astronomical surveys in use today store data in specialized ways, making integration extremely time-consuming. Indeed, researchers might spend weeks or months just on data engineering.
18+
19+
That’s why we’re excited to have partnered with the **Multimodal Universe** collaboration to introduce a new, large-scale curated collection of standardized data designed to accelerate ML research in astronomy. If you’ve ever dreamed of having a unified resource that seamlessly ties together images, spectra, time-series, and more, from multiple surveys, we built the Multimodal Universe with you in mind.
20+
21+
<p align="center">
22+
<img src="/images/blog/mmu_dset_examples.jpg" alt="Examples of data in the MMU dataset" width="95%" style="mix-blend-mode: darken;">
23+
</p>
24+
25+
---
26+
27+
## Why “Multimodal”?
28+
29+
**Multimodal data** refers to data that comes in multiple formats or “modalities” for a given object. For example, an image of a galaxy is a two-dimensional array of pixel intensities, while a spectrum encodes brightness at different wavelengths, and a time series captures how the brightness of a source evolves over time. Each of these modalities offers a unique window into the physics of the source under study, which is why pairing them in a single dataset can be particularly powerful.
30+
31+
---
32+
33+
## What’s in the Multimodal Universe?
34+
35+
We’ve combined publicly available data from **major astronomical surveys** into one consistently cross-matched framework, summarized in the table below. Images, spectra, hyperspectral data cubes, time-series data… they’re all in here! Each dataset has been carefully pre-processed, documented, and aligned to play nicely with one another right out of the box.
36+
37+
38+
39+
Up-to-date instructions on how to download the data, plus details about cross-matching and referencing the original sources, can be found on the [Multimodal Universe GitHub](https://github.com/MultimodalUniverse/MultimodalUniverse/).
40+
41+
---
42+
43+
## Key Principles and Features
44+
45+
By collating these diverse surveys and ensuring that each dataset aligns with the rest, the Multimodal Universe follows a few guiding principles:
46+
47+
1. **Multimodal Alignment**
48+
We provide **careful cross-matching** between surveys, so you can instantly gather all available data — images, spectra, time-series, etc. — for a given source.
49+
50+
2. **Standardized Data Formats**
51+
We unify data storage and metadata standards, making it simpler to combine or swap in new surveys.
52+
53+
3. **Comprehensive Documentation**
54+
We detail relevant selection effects and biases. If your research depends on subtle coverage issues or redshift limits, we’ve got you covered.
55+
56+
4. **Public Availability of All Scripts**
57+
All the code used to download, process, and collate the data is public. This ensures **transparency** and makes it easy to replicate the entire pipeline or trace the lineage of each dataset from the ground up.
58+
59+
---
60+
61+
## A Catalyst for Machine Learning in Astronomy
62+
63+
We’ve provided a suite of **benchmarks** in the paper that highlight key scenarios in which this dataset shines. For instance, we replicate the **AstroCLIP** project [1] by combining Legacy Survey images with DESI spectra in just a few lines of code, whereas the original paper required a large data engineering effort.
64+
65+
Even better, by unifying the underlying data framework, **pipelines** developed for one survey or modality can be **directly transferred** to others. This paves the way for large-scale ML models that draw from multiple instruments and data formats simultaneously.
66+
67+
Finally, challenges like **distribution shifts**, **uncertainty quantification**, and **model calibration** are crucial in scientific ML. The Multimodal Universe’s breadth and diversity of data naturally test the limits of ML model generalizability:
68+
69+
---
70+
71+
## Where to Find It and What’s Next
72+
73+
We host the Multimodal Universe dataset in full at the Flatiron Institute, with the first official release corresponding to the data listed in the table. However, this is an **ongoing project** and will be regularly updated:
74+
75+
- **New surveys** and **instruments** will be incorporated as they release public data.
76+
- **Infrastructure improvements** will support better data discovery and access patterns.
77+
- **Cross-matched catalogs** will be refined to seamlessly link complementary observations.
78+
79+
We envision this living dataset as a **central hub** for ML-driven astronomy, drastically cutting down on the data-engineering overhead that has historically slowed progress.
80+
81+
---
82+
83+
## Getting Started
84+
85+
1. **Visit the Landing Page**
86+
Head to the [Multimodal Universe GitHub](https://github.com/MultimodalUniverse/MultimodalUniverse/) for the latest version, plus scripts for data retrieval and usage.
87+
88+
2. **Grab Your Citations**
89+
We provide a simple script that automatically **generates BibTeX citations** and acknowledgements for the specific surveys you use. That’s one less thing to worry about when you publish results!
90+
91+
3. **Contribute Your Data**
92+
Check out our [contribution guide](https://github.com/MultimodalUniverse/MultimodalUniverse/) if you have data (observations, simulations, or curated samples) you’d like to include.
93+
94+
Whether you’re building a classifier to find elusive supernovae or training a generative model to imagine galaxies we haven’t observed yet, the Multimodal Universe is here to jumpstart your ML x astronomy research. We can’t wait to see what you discover!
95+
96+
*-- Liam Parker*
97+
98+
---
99+
100+
## References
101+
102+
1. Parker, Liam, et al. "AstroCLIP: a cross-modal foundation model for galaxies." Monthly Notices of the Royal Astronomical Society 531.4 (2024): 4990-5011.
103+
104+
105+
*-- Liam Parker*

0 commit comments

Comments
 (0)