Galaxy Explorer · Madhu Siddharth Suthagar

Overview

Galaxy Explorer is a cloud-native big-data platform that turns one of the largest scientific catalogues in the world — the European Space Agency's Gaia DR3 star catalogue — into a live, interactive 3D map of the solar neighbourhood. It lets anyone fly through 557 million scientifically validated stars in the browser, scrub the epoch to watch the sky shear over millennia, and run astronomical analytics — stellar density fields, a Hertzsprung–Russell diagram, and unsupervised open-cluster discovery — on demand.

The engineering story is a data-reduction and query-cost story. I started from a raw firehose of roughly 1.8 billion sources, 157 columns and 750+ GB, and engineered it down to a clean, physically meaningful 557 M-star, 13-column serving table — then re-architected the warehouse so that the heavy scan happens once and every interaction afterwards is a cheap, pruned read. The same preprocessing pipeline feeds a second application, a scientifically accurate Night Sky Viewer planetarium, covered at the end of this report. I designed and built the entire system solo, from raw astrometry to a deployed Cloud Run container.

Problem & Goal

Gaia DR3 is a gift and a problem at the same time. It contains positions, parallaxes, proper motions and multi-band photometry for ~1.8 billion stars — but at 750+ GB across 157 columns it is far too large to open in memory, and far too expensive to query naively. Two distinct hard problems follow.

The data-engineering problem: reduce a 1.8-billion-row, 157-column raw catalogue into a clean, trustworthy, physically meaningful dataset — on a single machine — without ever loading it whole.
The serving-cost problem: drive interactive 3D exploration and on-the-fly analytics over hundreds of millions of stars from a browser, at low latency and near-zero marginal cost. The naive cloud approach — point a warehouse query at the whole catalogue every time a user moves a slider — full-scans tens of gigabytes per interaction, which is both slow and a real, recurring dollar cost.

The goal was a system where rigour and cost-awareness are designed in from the first byte: a reproducible preprocessing pipeline that shrinks and structures the data, and a warehouse layer that pays the big scan once and then prunes every subsequent query down to only the sky and distance the user is actually looking at.

The Dataset — Gaia DR3

Gaia is an ESA space observatory that has measured the positions, distances (via parallax), motions and colours of nearly two billion stars. Its third data release, Gaia DR3, is the backbone of modern Milky Way science. I worked from the full gaia_source catalogue: ~1.8 billion rows, 157 columns of astrometry and photometry, 750+ GB on disk.

One detail in the catalogue is worth calling out because the whole spatial-indexing strategy hangs on it: Gaia's 64-bit source_id encodes each star's position on the sky. The most significant bits are the star's HEALPix pixel at level 12, so source_id / 2^35 recovers the level-12 sky pixel with zero spatial computation, and integer-dividing again rolls that up the HEALPix hierarchy to coarser levels. That free, baked-in spatial key becomes the partition and clustering column for everything downstream.

Stage	Sources	Columns	Footprint
Raw Gaia DR3 `gaia_source`	~1.8 B	157	750+ GB
Quality-filtered + feature-engineered (GCS Parquet → external BigQuery table)	557 M	~20	~77 GB
Native serving table — clustered & slimmed	557 M	13	pruned per query

Data Engineering Pipeline

The reduction from 1.8 billion raw sources to a 557 M-star serving catalogue runs as a multi-stage Polars pipeline. Polars' lazy, multi-threaded, larger-than-memory execution let me process the catalogue one Parquet shard at a time on a single machine — never holding more than a partition in memory — while staying in a columnar format end to end.

Column projection — 157 → ~12

The first and cheapest win is throwing away what I don't need. Of 157 columns, the explorer needs roughly a dozen — positions (ra, dec), parallax, proper motions (pmra, pmdec), three-band photometry and source_id. Projecting to those columns and re-encoding as Zstandard-compressed Parquet collapses the per-row footprint before any heavy work begins, so every later stage reads a fraction of the original I/O.

Scientific quality filtering — 1.8 B → 557 M

Raw Gaia rows include blended sources, spurious astrometric solutions and contaminated photometry that would poison a 3D map. I apply the community-standard Gaia DR3 quality cuts as a single vectorised Polars mask, keeping only high-confidence stars:

Astrometric reliability: RUWE < 1.4 (the standard renormalised-unit-weight-error threshold), astrometric_excess_noise < 1, astrometric_sigma5d_max ≤ 2, at least 15 good along-scan observations and at most 5 bad ones, and a valid 5-/6-parameter solution.
Photometric integrity: a flux-over-error floor, caps on contaminated and blended BP/RP transits, and a quadratic colour-excess mask (phot_bp_rp_excess_factor < 1.3 + 0.06·(BP−RP)²) that rejects sources whose colours are corrupted by a nearby neighbour.
Cleanliness: drop duplicated sources, non-single-star solutions, and anything flagged as a quasar or galaxy candidate.

The mask retains roughly 31% of the catalogue — 557 million stars whose positions, distances and colours can be trusted. This is the step that makes everything downstream scientifically defensible rather than just visually plausible.

Spatial indexing with HEALPix

Rather than compute a spatial index, I derive it for free from source_id: healpix_12 = source_id / 2^35 gives the level-12 pixel, and healpix_2 = healpix_12 / 4^10 rolls it up to HEALPix level 2 — 192 equal-area sky sectors. HEALPix (Hierarchical Equal-Area isoLatitude Pixelisation) tiles the celestial sphere into nested, equal-area cells, so a coarse level-2 sector cleanly contains all of its finer children. I physically partition the Parquet by healpix_2, so any region-of-sky query reads only the relevant shards.

HEALPix tessellation of the celestial sphere into equal-area pixels — Figure 1: HEALPix tiles the sky into nested, equal-area pixels. Gaia's source_id encodes each star's level-12 pixel, which I roll up to 192 level-2 sectors used as the partition / cluster key.

Kinematic feature engineering

For 3D rendering and time projection I convert each star from celestial coordinates into a Cartesian state. A unit direction vector (x, y, z) comes from ra/dec; the physical distance comes from the parallax (distance_pc = 1000 / parallax in milliarcseconds, then distance_ly = distance_pc × 3.26156); and a transverse-velocity vector (vx, vy, vz) is derived from the proper motions pmra and pmdec. Stars with unreliable, near-zero parallax (< 0.2 mas) are parked at a 10,000-ly sentinel and excluded from analytics, so noisy distances never reach the map.

Distance binning & aligned partitioning

Finally I bucket each star into a 50-light-year shell (distance_bin = floor(distance_ly / 50) × 50) and lay the data out by both keys — sky sector (healpix_2) and distance shell (distance_bin). This is a deliberate choice: these are the exact two columns BigQuery later clusters on, so the on-disk physical layout and the warehouse's pruning keys are aligned. A "show me this patch of sky between 200 and 400 ly" query touches a small, contiguous slice instead of the whole galaxy.

BigQuery: Sub-Second Queries at Low Cost

The processed catalogue lands in Google Cloud Storage as partitioned Parquet, exposed to BigQuery as an external table over ~77 GB / 557 M rows. This is where the project's central engineering problem lives — and the part I'm most proud of.

The clustered native table

External BigQuery tables cannot be clustered, so every query against them full-scans all 77 GB. The original design also used ORDER BY RAND() to sample stars — meaning each time a user moved the distance slider, BigQuery scanned the entire 77 GB catalogue at roughly $0.38 per interaction (on-demand pricing of $5/TB). That is unacceptable for an interactive app.

The fix is a one-time materialisation into a native, clustered, column-slimmed table. I pay the full 77 GB scan exactly once to build it; from then on, every interactive query prunes by healpix_2 and distance_bin and reads only the blocks in view:

CREATE OR REPLACE TABLE gaia_ly.stars_native
CLUSTER BY healpix_2, distance_bin AS
SELECT source_id, healpix_2, distance_bin, distance_ly,
       x, y, z, vx, vy, vz,
       phot_g_mean_mag, bp_rp, has_rvs
FROM   gaia_ly.stars            -- external table over ~77 GB of Parquet in GCS
WHERE  distance_ly > 0;

The interactive /stars query then drops ORDER BY RAND() entirely and relies on cluster pruning plus a LIMIT, so a sector-and-shell request reads a small fraction of the table instead of all 77 GB. Every BigQuery job in the build is also wrapped in a free dry-run cost estimator that prints bytes-scanned and a dollar figure before anything executes — cost is a first-class part of the workflow, not an afterthought.

Workload	Naive approach	Engineered approach
Interactive star fetch	external table, `ORDER BY RAND()` → full 77 GB scan (~$0.38) every interaction	native clustered table pruned by `(healpix_2, distance_bin)` + `LIMIT` → a small fraction of 77 GB
Stellar density map	`GROUP BY` over 77 GB, on demand	precomputed `density_voxels` → kilobytes scanned
HR / colour–magnitude diagram	`GROUP BY` over 77 GB, on demand	precomputed `hr_bins` → kilobytes scanned

Precomputed aggregate tables

Analytics that don't need individual stars are pushed into tiny precomputed tables, so the app reads kilobytes instead of re-scanning the catalogue:

density_voxels — one row per (sky sector × distance shell): COUNT(*), mean magnitude and colour, and a flux-weighted centroid (GROUP BY healpix_2, distance_bin). This is the 3D stellar-density field the "Structure" page renders — a whole galaxy's worth of structure in a few thousand rows.
hr_bins — a 2D histogram of colour (BP−RP) against absolute magnitude, computed in-warehouse via absmag = g − 5·log10(distance_pc) + 5 and binned to a 0.05 × 0.2 grid. It powers a live Hertzsprung–Russell diagram in which the main sequence and giant branch emerge straight from the data.

The core result: because external tables can't be clustered, the original design re-scanned all 77 GB — about $0.38 — on every interaction. Materialising a native table CLUSTER BY healpix_2, distance_bin turns each interaction into a pruned scan of just the sky sectors and distance shells in view, and pushing the density map and HR diagram into precomputed aggregate tables drops those workloads from 77 GB scans to kilobytes.

Stellar Families — HDBSCAN in 6D

The "Families" page finds open clusters and co-moving groups — stars born together that still drift through the galaxy as a pack — with no labels, straight from the data. I restrict to the solar neighbourhood within 600 light-years, where Gaia distances are most reliable and the classic clusters live, and run HDBSCAN over the full 6D phase space: position (x, y, z) and space velocity (vx, vy, vz).

The key insight is that a cluster is defined less by where its stars are than by how they move together: members share a space velocity to within ~1 km/s while the surrounding field spreads over ~40 km/s. So I standardise position and velocity separately, then up-weight velocity by 3.5× so shared motion dominates the clustering. HDBSCAN runs with cluster_selection_method="leaf", which returns the many small, dense knots that are real clusters rather than collapsing them into one giant root blob.

A post-processing pass keeps only genuinely compact groups — rejecting any "cluster" whose positional RMS exceeds 60 ly or that contains more than 8,000 members as diffuse field — and then auto-labels recovered groups by matching their centroids against a reference list of famous clusters (Hyades, Pleiades / M45, Praesepe / M44, Coma Berenices, the Ursa Major moving group, α Persei), with each known cluster naming only its single nearest match. The output is two native tables — a per-group cluster_catalog and the member cluster_stars — served to a clickable 3D catalogue in the frontend.

Real-Time Kinematic Projection

Because every star carries a distance and a transverse-velocity vector, the explorer can project the sky to any epoch — letting you scrub forward or backward through time and watch the solar neighbourhood shear. Each star's position at year t is a first-order, constant-velocity propagation from Gaia's reference epoch (J2016.0):

position(t) = direction × distance + velocity × Δt, Δt = t − 2016

The transverse velocity is converted from proper motion in milliarcseconds-per-year into light-years-per-year by scaling with each star's distance, and the has_rvs flag is carried through to mark the subset with measured radial velocities. This propagation is done server-side in the API, so the browser receives ready-to-render points and the projection scales to tens of thousands of stars without bogging down the client.

Full-Stack System Design

Backend — cloud-native API

The backend is a containerised FastAPI service exposing four endpoints — /stars, /density, /hr, /clusters — each mapping to one of the engineered tables. Queries are built with parameterised ScalarQueryParameters (SQL-injection-safe), and the star sampler uses a density-adaptive LIMIT — more points for wide distance ranges, fewer per sector — to keep payloads bounded. Crucially, the service degrades gracefully: if BigQuery or a precomputed table is unavailable, each endpoint falls back to synthetic mock data, so the UI always renders and local development needs no GCP credentials at all.

Frontend — GPU-accelerated rendering

The frontend is a React + React Three Fiber (Three.js / WebGL) app built with Vite. Tens of thousands of stars are drawn as a single optimised BufferGeometry particle system, holding a smooth 60 FPS, colour-coded by magnitude, with hover inspection. Two pages sit on the same data layer: Structure (the 3D density field plus the live HR diagram) and Families (the HDBSCAN clusters in 3D with a clickable catalogue).

Galaxy Explorer query interface — Figure 2a: Parameterised query interface — choose a distance range, sky sector and epoch.

GPU-rendered 3D star field spanning ~17,000 light-years — Figure 2b: GPU-rendered 3D star field, the solar neighbourhood out to ~17,000 ly.

Deployment

Production is a single Docker container on Google Cloud Run: the built React bundle is served by FastAPI from the same origin as the API, so there is no CORS hop and one stateless service scales horizontally to zero. The same image runs locally with a MOCK_DATA flag for credential-free frontend development.

Engineering Challenges Solved

1.8 billion rows on one machine: the raw catalogue can't be loaded whole. Solved with a shard-at-a-time Polars pipeline in columnar, Zstandard-compressed Parquet — memory stays flat regardless of catalogue size.
External tables can't be clustered: a 77 GB full scan (~$0.38) on every interaction is a non-starter. Solved by materialising a native table CLUSTER BY healpix_2, distance_bin so interactive queries prune to the blocks in view.
Re-aggregating the galaxy per request: density maps and HR diagrams don't need raw stars. Solved by precomputing density_voxels and hr_bins, turning 77 GB GROUP BYs into kilobyte reads.
Zero-cost spatial indexing: instead of an expensive spatial join, I exploit the HEALPix pixel baked into Gaia's source_id and align the on-disk partitioning with the warehouse cluster keys.
Clusters vs. the field: naive density clustering returns one giant blob. Solved with 6D phase-space features, velocity up-weighting, HDBSCAN leaf selection, and a compactness post-filter that rejects the diffuse field.
Resilient serving: a missing table shouldn't break the app. Solved with per-endpoint mock fallbacks, so the UI renders and the project is demoable with no cloud at all.

Night Sky Viewer

The second application built on the same Gaia pipeline is a scientifically accurate planetarium. From your latitude, longitude and a draggable time slider, it renders the real sky as a dome (stereographic) projection — true star colours, constellations, the Sun, Moon and naked-eye planets, the Milky Way band, and day/night twilight shading. Its star field is the naked-eye-visible slice of the processed Gaia catalogue (magnitude ≤ 6.5), cross-matched with the HYG database for proper names and Bayer designations and d3-celestial for constellation lines and the Milky Way outline.

The astronomy is real, not decorative. A dedicated skyview package converts equatorial RA/Dec to horizontal altitude/azimuth with vectorised NumPy via local sidereal time, applies a Bennett atmospheric-refraction term near the horizon, and propagates each star's position from the J2016.0 epoch using its Gaia proper motion. Sun, Moon and planet positions, the Moon's illuminated phase and the twilight state come from astropy with bundled ephemerides and IERS tables, so the deployed app needs no network at runtime. Star colours are physical: a colour index is mapped to an effective temperature (Ballesteros relation), then to an sRGB blackbody tint — hot stars blue-white, Sun-like stars yellow, cool stars orange-red.

Launch Night Sky Viewer 🚀

Night Sky Viewer controls — Figure 3a: Location, time and magnitude controls.

Night Sky Viewer dome projection output — Figure 3b: The dome projection — constellations, planets and the Milky Way for a given place and time.

A magnitude slider (6.5 for dark rural skies down to ~2 for bright urban skies) lets you simulate light pollution, city presets and one-tap GPS set your location, and a time-lapse "play" mode advances the clock so you can watch the whole sky rotate.

Tech Stack

Data engineering: Polars (larger-than-memory, shard-at-a-time processing), NumPy, Zstandard-compressed Parquet, HEALPix spatial indexing.
Warehouse: Google BigQuery (external tables, native clustered tables, precomputed aggregates, dry-run cost estimation), Google Cloud Storage data lake.
Machine learning: HDBSCAN 6D phase-space clustering, scikit-learn (StandardScaler).
Backend: FastAPI, Uvicorn, Pydantic, the BigQuery Python client with parameterised queries.
Frontend: React, React Three Fiber, Three.js (WebGL), Vite, Axios.
Planetarium: Streamlit, astropy (ephemerides, sidereal time, refraction), NumPy; HYG and d3-celestial overlays; SIMBAD / Hipparcos / Tycho-2 cross-matching for names.
Infrastructure: Docker, Google Cloud Run (single-container deploy), graceful mock-data fallback for credential-free local dev.

Outcomes & Impact

Engineered a 750+ GB, ~1.8-billion-source, 157-column raw Gaia DR3 catalogue down to a clean, physically meaningful 557 M-star, 13-column serving table — entirely on a single machine.
Cut interactive query cost from a ~77 GB / ~$0.38 full scan on every interaction to cluster-pruned reads, and pushed density and HR analytics into precomputed tables scanned in kilobytes.
Made 557 M+ stars explorable in 3D at 60 FPS in the browser, with server-side epoch projection for travelling through time.
Recovered known open clusters and moving groups — Hyades, Pleiades, Praesepe, Coma Berenices and more — unsupervised, via 6D HDBSCAN in position + velocity.
Shipped the whole system as a single Cloud Run container with resilient mock-data fallbacks, plus a second app — a scientifically accurate Night Sky planetarium — on the same pipeline.

Conclusion

Galaxy Explorer is the project I point to when I want to show that I can take a genuinely large scientific dataset and engineer it all the way to a fast, cheap, interactive product. It pairs the data engineer's instincts — shrink the data, structure it around how it will be queried, and make the warehouse pay the big scan only once — with the discipline to keep every number scientifically defensible and the cost of every query visible. From 750 GB of raw astrometry to a 3D galaxy you can fly through in a browser, it's the clearest demonstration of how I think about big data: not as something you scan, but as something you design.