Senior ML Platform Engineer - Sovereign AI Engineering
Description Every nation has data. Few can protect it. Fewer still can act on it. Dream is the sovereign AI and national cyber-defense company for governments. We help nations secure their most critical systems, connect fragmented information at a national scale, and turn their most sensitive data into decisions, all fully sovereign. This is more than a job. It's a Dream job, where you'll work at a global scale alongside some of the best AI researchers, cyber operators, and government experts in the world. We're building AI that nations own and control, deployed where almost no one else can operate. ingesting and structuring complex data, and driving practical actions that can literally impact the lives of billions of people around the world. This role helps make that real. The Dream Job It starts with you - an engineer driven to build the ML platform that turns research into reliable, production-grade intelligence. You care about reproducibility, low-friction experimentation, and infrastructure that earns the trust of the scientists and researchers who depend on it daily. You'll architect and ship Dream's ML platform - training pipelines, model serving, feature stores, experiment tracking, and compute orchestration - turning models into production capabilities across cloud and on-prem, including air-gapped deployments. A significant part of the platform supports large language models, with unique challenges across training, evaluation, and inference in mission-critical environments. If you want to make a meaningful impact, join Dream's mission and build the ML platform that drives Sovereign AI products - this role is for you. The Dream-Maker Responsibilities Build and operate ML training infrastructure - distributed training pipelines, compute scheduling, and reproducible experiment workflows that data scientists rely on daily. Own model serving and inference systems - packaging, deployment, autoscaling, A/B testing, canary rollouts, and latency/cost optimization for production models. Run feature stores, model registries, and dataset versioning - enabling self-serve feature engineering, model lineage, and reproducible experiments across teams. Build experiment tracking and evaluation infrastructure - automated evals, comparison dashboards, drift detection, and monitoring that give teams visibility into model behavior and performance. Build and maintain production pipelines for training, fine-tuning workflows, and serving domain models - owning reliability, reproducibility, and scale. Build and maintain the monitoring and observability layer - model performance tracking, data and prediction drift detection, data quality validation, and alerting. Improve performance and cost across the ML stack - training throughput, inference latency, batch vs. real-time tradeoffs, and compute cost management. Ship shared tooling - libraries, templates, CI/CD for models, IaC, and runbooks - while collaborating across Data Platform, AI, Data Science, Engineering, and DevOps. Own architecture, documentation, and operations end-to-end. The Dream Skill Set 5+ years in software engineering, with 2+ years focused on ML infrastructure, MLOps, or data-intensive systems Engineering craft - Strong Python, distributed systems design, testing, secure coding, API design, CI/CD discipline, and production ownership. ML platform & serving - Model serving frameworks (e.g., Triton, TorchServe, vLLM, Ray Serve); model packaging, deployment pipelines, and inference optimization Training infrastructure - Distributed training pipelines (e.g., frameworks like PyTorch, JAX) experiment orchestration and reproducibility ML lifecycle tooling - Feature stores, model registries, experiment tracking (e.g., MLflow, Weights & Biases); dataset versioning and lineage Data pipelines - Building training and inference data pipelines; familiarity with tools like Spark, Airflow/Dagster, and streaming ingestion Comfortable with AI coding tools like Cursor, Claude Code, or Copilot Nice to Have: Experience operating in constrained environments - on-premise, private cloud, or air-gapped deployments Hands-on experience with simulation environments, synthetic data generation, or reinforcement learning workflows Platform & infra - Kubernetes, AWS, Terraform or similar IaC, CI/CD, observability, incident response Hands-on data science or applied ML experience Never Stop Dreaming... If you think this role doesn't fully match your skills but are eager to grow and break glass ceilings, we’d love to hear from you!