Role Overview

As a Staff AI Engineer at MontyCloud, you will design, build, and operate production-grade agentic AI systems powering intelligent cloud operations at scale. This role focuses on developing scalable AI agents, orchestration pipelines, and cloud-native AI infrastructure while contributing to engineering standards, reliability, and operational excellence across the platform. You will work at the intersection of AI, cloud infrastructure, and autonomous operations to deliver systems that are reliable, observable, secure, and production-ready.

Key Responsibilities

  • Engineering & Delivery
    • Architect, build, and operate production-grade AI agents and multi-agent systems for cloud management use cases
    • Design and own AI inference and orchestration pipelines optimized for scalability, latency, reliability, and cost efficiency
    • Build safety and reliability guardrails for autonomous AI systems operating on live cloud infrastructure
    • Develop human-in-the-loop workflows, rollback strategies, scoped permissions, and audit mechanisms for AI-driven operations
    • Collaborate with platform, infrastructure, and data engineering teams to embed AI-native automation into cloud management workflows
  • Standards & Technical Quality
    • Implement observability and monitoring for agentic systems, including agent tracing, MCP interaction auditing, output quality monitoring, and cost governance
    • Contribute to engineering standards for agentic design patterns, agentic AI architectures, MCP server and tool design, Prompt engineering, RAG and Graph-RAG pipelines, LLMOps practices, and Foundation model integrations.
    • Conduct rigorous technical reviews of AI architectures, systems, and features to improve engineering quality and reliability
    • Document technical decisions, trade-offs, and implementation patterns clearly for broader engineering adoption
  • Innovation & Opportunity Identification
    • Identify opportunities where agentic AI can improve product capabilities or operational efficiency
    • Build proof-of-concepts and prototypes to validate technical feasibility and scalability
    • Evaluate emerging AI technologies, LLMs, multi-modal models, and agentic frameworks for adoption suitability
    • Stay current with advancements in agentic AI, orchestration frameworks, and production AI engineering practices

Desired Skills and Requirements

Must Have

  • Agentic AI & Multi-Agent Systems
    • Production-grade agentic AI system design and development
    • Agentic AI system design & architecture - Multi-agent architectures and orchestration, Agent-to-agent communication, Agent memory and planning strategies, Tool integration and MCP server design
    • Agent orchestration frameworks - LangGraph, Strands Agents, CrewAI, AutoGen, or equivalent agentic AI frameworks
  • LLMOps & AI Platform Engineering
    • AI Governance & Lifecycle Management - Prompt versioning and governance, evaluation frameworks, regression detection
    • AI Observability & Monitoring - Output quality monitoring, Agent tracing and observability
    • AI Cost Management - Cost governance for high-scale AI workloads
  • Cloud & Infrastructure
    • Cloud AI Platforms & Services - AWS cloud ecosystem, AWS Bedrock, AgentCore
    • Cloud-Native Infrastructure & Deployment - Cloud-native AI deployments, Kubernetes, Docker
    • Infrastructure as Code (IaC) - Terraform
  • Foundation Models & AI Integrations
    • Foundation model API integration - OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, Hugging Face
    • MCP and AI tool integration architecture
  • RAG & Knowledge Systems
    • Retrieval-Augmented Generation (RAG) and Graph-RAG architectures
    • Embedding strategies, retrieval and reranking systems
    • Knowledge graph integrations
  • Technical Leadership and Communication
    • Cross-functional technical collaboration and engineering ownership
    • Technical communication and documentation
    • Architecture reviews and technical decision-making
    • Proactive problem identification and solution development

Good to Have

  • Domain Experience
    • AI systems for cloud operations and infrastructure automation
    • Developer tooling platforms
  • AI Deployment & Optimization
    • Serverless AI deployment patterns
    • AI inference cost optimization
  1. Advanced AI Techniques Exposure
    • Model fine-tuning and RLHF
    • Advanced model evaluation techniques
    • Multi-modal AI systems
  • Industry & Community Exposure
    • Experience working in AI-first or cloud-native product companies
    • Open-source contributions, technical blogs, conference talks, or published research in AI/agentic systems

Experience

  • 8+ years of overall software engineering experience
  • Hands-on experience in building and operating applied AI systems in production environments
  • Experience designing and deploying agentic AI systems and orchestration pipelines
  • Proven ability to lead complex technical implementations across engineering teams
  • Experience deploying and operating AI workloads in cloud-native environments
  • Strong engineering judgment across scalability, reliability, security, and operational excellence

Education

  • Bachelor’s or Master’s degree in Computer Science / Artificial Intelligence / Machine Learning / Engineering / or any related technical discipline
  • Equivalent practical experience in advanced AI system design and distributed cloud platforms may also be considered