We are looking for a Staff Software Engineer to design and build the systems that power how our organization collects, processes, and manages large multimodal datasets.
These systems ingest multimodal data from highly diverse sources, including web crawling, APIs, external providers, large document corpora, and robotics or sensor streams and transform it into structured datasets used by our ML research and evaluation teams.
As a senior member of the team, you will help define the architecture of this platform as it evolves from prototype systems into production data infrastructure. The platform directly feeds into model capability development.
You will:
- Develop distributed ingestion and processing pipelines that acquire multimodal datasets from diverse sources and run across multi-node infrastructure
- Parse, filter, transform, and deduplicate multimodal datasets
- Use AI models for data augmentation, including synthetic data generation workflows
- Design platforms that support reproducibility, and traceability across evolving datasets
- Build tools that allow technical staff to manage, explore, and iterate on large multimodal datasets
Our Stack
- Languages: Python, Rust
- Infrastructure: AWS, Azure, Nebius
- Storage: S3, NFS
You'll be a great fit if...
- You have 5+ years of experience building backend systems or large-scale data pipelines
- You have built distributed systems that ingest, process, or manage very large datasets
- You are comfortable working with messy, high-volume time-aligned multimodal data such as video, images, text, robotics signals, or sensor datasets
- You are comfortable building pipelines that integrate machine learning systems such as LLMs or vision models into data processing workflows
- You have a track record of technical leadership, owning large systems, and mentoring other engineers