Staff Software Engineer, Data Systems

We are looking for a Staff Software Engineer to design and build the systems that power how our organization collects, processes, and manages large multimodal datasets.

These systems ingest multimodal data from highly diverse sources, including web crawling, APIs, external providers, large document corpora, and robotics or sensor streams and transform it into structured datasets used by our ML research and evaluation teams.

As a senior member of the team, you will help define the architecture of this platform as it evolves from prototype systems into production data infrastructure. The platform directly feeds into model capability development.

You will:

Develop distributed ingestion and processing pipelines that acquire multimodal datasets from diverse sources and run across multi-node infrastructure
Parse, filter, transform, and deduplicate multimodal datasets
Use AI models for data augmentation, including synthetic data generation workflows
Design platforms that support reproducibility, and traceability across evolving datasets
Build tools that allow technical staff to manage, explore, and iterate on large multimodal datasets

Our Stack

Languages: Python, Rust
Infrastructure: AWS, Azure, Nebius
Storage: S3, NFS

You'll be a great fit if...

You have 5+ years of experience building backend systems or large-scale data pipelines
You have built distributed systems that ingest, process, or manage very large datasets
You are comfortable working with messy, high-volume time-aligned multimodal data such as video, images, text, robotics signals, or sensor datasets
You are comfortable building pipelines that integrate machine learning systems such as LLMs or vision models into data processing workflows
You have a track record of technical leadership, owning large systems, and mentoring other engineers