Learned Routing for Distributed Diffusion Models - ON-1209

Project type: Research
Desired discipline(s): Engineering - computer / electrical, Engineering, Computer science, Mathematical Sciences, Mathematics
Company: Bagel Labs
Project Length: 6 months to 1 year
Preferred start date: 07/06/2026
Language requirement: English
Location(s): Toronto, ON, Canada
No. of positions: 1
Desired education level: PhD
Open to applicants registered at an institution outside of Canada: No

About the company: 

Bagel Labs is a Toronto-based machine learning research company focused on large-scale generative modeling for computer vision. The company trains diffusion and flow-matching systems for image, video, and world-model generation, with a research program focused on reducing the compute cost of frontier-scale model training. A core technical direction is the development of distributed model architectures that compose many independently trained expert models at inference time, in place of monolithic centralized training runs. The company’s methods are designed to lower the energy and capital cost of training large generative systems and to broaden access to frontier-quality models for academic and industrial researchers. The team brings together former academic researchers and senior research engineers with backgrounds in generative modeling, distributed training, and large-scale ML systems. Bagel Labs operates research and engineering capacity in Toronto and partners with academic groups in Canada and abroad on publishable research. The company maintains active collaborations with university supervisors and contributes to the open research community through papers, code, and benchmarks. Mitacs Accelerate would support a graduate-level research collaboration that strengthens both the company’s scientific output and the academic training of the participating intern.

Describe the project.: 

Bagel Labs develops decentralized diffusion models (DDMs). These are flat ensembles of independently trained generative experts that can be pretrained without gradient, parameter, or activation synchronization and composed at inference. Recent company research, including the open-weight Paris model and an accompanying interpretability study, finds that inference-time expert routing is the dominant lever on sample quality in DDMs. Sparse routing that aligns experts to the current denoising state outperforms numerically stable full-ensemble routing by a large margin, identifying expert-data alignment, rather than trajectory stability, as the governing principle of high-quality decentralized Generation.

This project is a forward research program on routing methods for distributed diffusion ensembles. The central research question is how alignment aware routers can be designed, trained, and analyzed so that ensembles of independently trained experts approach monolithic-model quality. The work targets a publishable contribution to the literature on mixture-of-experts diffusion, decentralized training, and inference-time composition. The intern will treat the publicly released Paris weights as a frozen black-box baseline and conduct standalone analysis on independent compute and openly licensed data.

Specific tasks: (1) implementing alignment-aware router-training algorithms for distributed diffusion experts; (2) running controlled ablations across cluster-distance, expert-confidence, and learned routing policies; (3) characterizing tradeoffs across top-1, top-k, full-ensemble, and timestep-adaptive routing; (4) studying the effect of data-partitioning strategy on downstream sample quality; and (5) building a reproducible benchmark on open models and datasets. The four-month milestone is a research prototype with figures and next-step recommendations; the eight-month milestone is a publishable research package: paper draft, reusable code, and benchmark.

Outputs are selectively published: generic methods, evaluation code, and benchmarks are intended for academic dissemination; internal benchmarks and proprietary artifacts remain confidential.

Required expertise/skills: 

This is a research-track project requiring a graduate student with strong fundamentals in deep generative modeling and the ability to drive an open-ended scientific question to a publishable result. Required: research experience in diffusion or flow-matching generative models; fluency in PyTorch and distributed multi-GPU training; experience designing controlled ablations and evaluating image generation systems with FID, LPIPS, IS, and CLIP score; familiarity with transformer-based generative architectures (DiT- style models), mixture-of-experts systems, and routing or gating Networks.

Preferred: working knowledge of probability-flow ODE solvers (Euler, Heun, DPM-Solver) and trajectory diagnostics (Jacobian spectral norm, sensitivity analysis); experience on H100, H200, or A100 multi-node training under Slurm or comparable schedulers; familiarity with Hugging Face Diffusers, Accelerate, DeepSpeed, or FSDP, and with experiment tracking in Weights & Biases. Experience with embedding-based clustering (DINOv2, k-means) and large-scale image datasets is an asset.

The candidate must produce reproducible code, write clear technical reports, and contribute substantively to a publishable academic paper. Because this is an industry–academic collaboration, the candidate is expected to follow data and IP boundaries and to communicate regularly with both academic and industrial supervisors. Familiarity with Trainium, Neuron, or TPU is an asset but not required.