NeurIPS 2026 Competition

Draft · Under review

The Virtual Embryo Challenge

Generative modeling of mouse embryogenesis across space, scale, and time — under genetic perturbation.

~1M

cells

time points

tasks

tracks

The gap

Embryogenesis is fundamental — and largely unmodelled

A single fertilised cell becomes a complete organism through spatiotemporally coordinated gene regulation, cell-fate transitions, tissue morphogenesis, and organ formation. Disruptions cause congenital defects, which still affect 1 in 33 newborns and remain a leading cause of infant mortality.

Large embryo atlases and spatial-transcriptomics datasets give us snapshots, but they don't reveal how cell states transition, how local molecular changes propagate to tissue- and organ-level phenotypes, or how development responds to perturbation.

The Virtual Embryo Challenge establishes a standardised benchmark for predictive embryogenesis: a curated dataset, an evaluation pipeline, baseline models, and three tasks that jointly stress spatial context, multiscale reasoning, temporal dynamics, and perturbation response.

Tasks

Three tasks, one shared atlas

Each task uses staged train / validation / hidden-test splits over the same whole-embryo + heart-focused resource. Hidden labels are never released; final rankings reflect generalisation to held-out stages, embryos, and genotypes.

T1Task 1

Temporal gene-expression distribution prediction

Forecast the gene-expression distribution at unseen future stages from earlier single-cell data.

SplitsTrain E7.75 · E8.5 · E9.5 → val E10.5 → test E12.5

Why hardModels must capture developmental trends rather than interpolate between adjacent observed stages.

T2Task 2

Spatial-temporal multiscale prediction

Predict expression + cell-type composition + 3D spatial organization jointly across stages from 4D MERFISH.

SplitsFuture: → val E10.5 → test E12.5 · Intermediate: → val E7.5 → test E8.5

Why hardDistinguish models that match global expression from those that recover where cell states sit in space.

T3Task 3

Mutant perturbation prediction

Predict mutant gene expression + 3D spatial organization under unseen genetic knock-outs.

SplitsTrain β-catenin (E8.75) → val Gata4 (E8.75) → test Mab21l2 (E9.5)

Why hardGeneralize from wild-type development + one observed perturbation to held-out genetic contexts across collection time points.

Two parallel tracks

Human-designed vs agent-designed, scored side by side

Both tracks address the same three tasks and are scored on the same metrics and hidden test sets. Prizes are awarded separately so the leaderboards directly contrast human-designed and agent-designed approaches to predictive embryogenesis modelling.

Track 1Human Team

Conventional ML-competition workflow.

Methods designed and supervised by human participants. Algorithm/model design → submission → evaluation. Standard NeurIPS competition track.

Track 2Agent Team

Coding agents / LLM-driven recursive systems.

Methods produced by coding agents or LLM-based evolutionary algorithms that iteratively select, mutate, and evaluate candidate solutions without human supervision. The agent system + complete evolutionary trace must be shared before prize evaluation.

Data

Multimodal whole-embryo perturbation resource

~1 million cells across 11 developmental time points, spanning early gastrulation through cardiac progenitor emergence, heart-tube formation, looping, and later morphogenesis.

Single-cell

Single-cell multi-omics

Whole-embryo per-cell RNA + chromatin accessibility (Multiome) across staged embryos from E6.75 to E12.5.

Spatial

3D MERFISH

Coronal sections decoded into per-cell 3D positions plus measured RNA across the same developmental window — the 4D atlas powering Task 2.

Annotation

Cell-type + anatomical labels

Per-cell cell-type, tissue-domain, and anatomical-region calls plus morphology-derived features (when available).

Perturbations

Conditional knock-outs at E8.75 / E9.5

β-catenin and Gata4 (E8.75) plus Mab21l2 (E9.5) — three cardiac developmental regulators with paired wild-type controls, profiled by 3D MERFISH.

Evaluation

Three metrics, automatic scoring, hidden labels

Scores are computed on held-out embryos after schema validation (gene order, cell-type vocabulary, coordinate convention, missing-value policy). Sub-scores per task; an overall composite for ranking.

Gene-expression accuracy

Primarily pseudobulk Pearson correlation; complemented by MSE/MAE, gene-wise marginal-variance agreement, distributional distances (MMD, energy, sliced Wasserstein, FID-like), and DEG-overlap with the preceding stage. Bootstrap CIs across embryos.

Cell-type composition accuracy

A frozen cell-type probe classifier — trained by the organizers and locked before evaluation — assigns predicted-vs-observed cell-type proportions; compared via base-2 Jensen-Shannon divergence at global, regional, and per-condition levels.

Spatial organization accuracy

Fused Gromov-Wasserstein distance combining expression similarity with spatial-structure preservation. Penalises predictions that get the marginals right but the geometry wrong. Augmented with neighborhood MMD / energy / sliced Wasserstein.

Timeline

Three phases · launch → development → final

2026-06-30Site + submission portal + eval platform live
2026-07-20Starter kit released; website opens to participants
2026-07-30P1 · Test phase begins (workflow + leaderboard validation)
2026-08-15P2 · Development phase begins; validation dataset released
2026-10-25P3 · Final test phase begins (new held-out dataset)
2026-11-02Final submissions due; official evaluation starts
2026-11-18Winners announced at NeurIPS

Prizes & support

$104K total from the Laude Institute Moonshots Seed Grant

$54K

Winner prizes

Per track ($27K × 2): one $8K first prize, two $5K second prizes, three $3K third prizes. Tracks are scored on the same hidden tests but awarded separately.

$30K

Travel awards

15–20 grants for early-career researchers to attend the NeurIPS workshop.

$20K

Outreach & education

Website, starter-kit repo, tutorials, reproducible walkthroughs, baseline documentation, participant communication channels.

Evaluation runs on the Stanford Sherlock and GenBio GPU clusters (NVIDIA H100 / H200 80 GB).

Get started

Explore the data that powers it

The challenge is grounded in the same atlas you can browse on this site: 3-D spatial-transcriptomics specimens by Theiler stage, a whole single-cell time-lapse from gastrula to birth, and the EMA anatomical references. Use them now to understand the modality coverage and stage spacing before the starter kit drops.

Browse the atlas by modality → Open the single-cell UMAP → Spatial @ TS15 (E9.5) →

FAQ

Common questions

More detailed answers — on submission format, data schema (gene order, cell-type vocabulary, coordinate convention, missing-value policy), and worked metric examples — land alongside the starter kit.

Who can participate?: Anyone — academic, industry, independent — who can submit a prediction file conforming to the required format. Each person joins one team; each team works on one track.
What does the starter kit include?: Baseline implementations for each task, data-loading utilities, evaluation scripts, example submission files, documentation, and reproducibility walkthroughs. Released two weeks before P1.
How is cheating prevented?: All ground-truth labels are hidden. Submission counts are limited in the final phase to reduce leaderboard probing. Duplicate registrations, unauthorized data use, code sharing across teams, or falsified results are grounds for disqualification.
Can the Agent Team track use any framework?: Yes — coding agents, recursive LLM systems, evolutionary search. To be eligible for prizes, the agent system and full evolutionary trace must be shared before prize evaluation.
Is the data really released?: Public training data + validation targets are released for method development. Hidden test ground truth (expression, cell-type composition, spatial organisation, perturbation outcomes) is withheld until competition close.
Will the test data overlap with public atlases?: Hidden splits are defined by developmental stage, perturbation condition, or combinations thereof — distinct from public splits. Models must learn generalisable developmental dynamics, not memorise.

Organisers

Hosted by the Qiu Lab at Stanford University, in collaboration with researchers at Harvard, UC San Diego, MBZUAI, CMU, and GenBio. The full organising committee and contributor roles will be listed alongside the starter-kit release. The challenge is supported by the Laude Institute Moonshots Seed Grant.

Contact & stay in the loop

Reach the organisers at the address below for questions about the competition, data, prize logistics, or partnerships. Public submission portal, GitHub repository, and discussion forum go live ~2 weeks before P1.

[email protected] GitHub · Discussion — coming soon

This page summarises the NeurIPS 2026 competition proposal currently under review. Dates, datasets, prize amounts, and exact metric formulations are subject to change between proposal acceptance and launch.