Our team needs a Cloud Architect to lead the design and scaling of infrastructure behind a compute-intensive, real-time collaborative data science platform. This is a hands-on, cross-functional role ideal for someone who thrives in fast-paced environments and enjoys building robust systems that support diverse use cases across engineering, research, and real-time computation.
Key Responsibilities:
1) Develop a cloud-agnostic compute orchestration layer supporting GCP, AWS, Alibaba Cloud, and on-prem clusters
2) Architect secure, scalable systems that enable user-isolated containers and short-lived compute environments
3) Build and maintain a metadata system to track sessions, resource usage, and user environments
4) Design an intelligent auto-scaler that optimizes performance and cost based on platform usage
5) Enhance observability with logging, tracing, and incident response integration
Ideal Candidate:
1) Experienced in designing systems that prioritize performance, reliability, and developer usability
2) Skilled at balancing uptime, cost-efficiency, and scalability with practical trade-offs
3) Passionate about building foundational infrastructure tools for other engineers
4) Comfortable owning infrastructure end-to-end: provisioning clusters, writing orchestration logic, and debugging production issues
5) T-shaped skillset with deep cloud expertise and strong generalist engineering capabilities
Required Skills & Tools:
1) Kubernetes: Experience deploying and managing workloads, ideally on GKE
2) Node.js + TypeScript: For backend services and orchestration layers
3) Terraform or Pulumi: Managing infrastructure as code in version-controlled environments
4) Docker: Containerization and secure environment isolation
5) Multi-cloud deployment: Background in GCP, Alibaba Cloud, and on-prem environments is a plus
6) Monitoring/Tracing: Familiarity with tools like Prometheus, Grafana, and OpenTelemetry