Role Overview:
We are looking for a Staff Cloud Solutions Engineer to serve as the deep technical authority within our Cloud SME team. This is a hands-on individual contributor role focused on owning cloud operations capability domains, building reusable automation frameworks, designing AI-powered CloudOps solutions, and defining how cloud operations are engineered at scale across multi-account environments.
Your work will directly shape the platform capabilities used by MSP partners and their customers globally. You will own capability domains end-to-end, lead advanced R&D with a strong emphasis on generative AI and agentic workflows, and set the technical bar for the team's output.
The Staff Cloud Solutions Engineer carries equal seniority and organisational impact as the Architecture track at this level.
Non-Negotiables:
- Must have hands-on experience managing production-grade workloads across 150+ AWS accounts with $250K+/month cloud spend.
- Must have performed 8+ customer assessments such as Formal Technical Reviews (FTRs) or Well-Architected Framework Reviews (WAFRs) with documented optimization outcomes.
- Must have designed and built reusable automation frameworks for cloud operations (cost, security, compliance, governance) that were adopted across multiple teams or customers.
- Must have contributed to internal tooling or platform capabilities beyond one-off solution delivery.
- Must demonstrate a pattern of identifying repetitive operational problems and automating them into reusable solutions.
- Must have hands-on experience designing and implementing AI-powered solutions using generative AI services (Amazon Bedrock, Bedrock Agents, or equivalent), including agentic workflows and AI orchestration patterns.
Key Responsibilities:
Capability Domain Ownership:
- Own and evolve cloud operations capability domains, driving both platform capabilities and internal best practices across areas such as cost optimization, security automation, governance, compliance, and AI-driven operations.
- Define how your capability domain works end-to-end: what gets automated, how it scales, and what the platform delivers to partners and customers.
Cloud Operations & Automation at Scale:
- Design and build reusable automation frameworks on top of cloud provider services that reduce undifferentiated operational work across large multi-account environments.
- Own delivery of technically complex capabilities spanning multiple cloud services, requiring deep expertise in security, cost, compliance, or governance.
- Automate operations using Python, CloudFormation, Systems Manager, and infrastructure-as-code tooling for proactive scaling, cost-triggered remediations, and security auto-remediation.
- Implement measurable cost optimizations across large cloud footprints through rightsizing, commitment coverage analysis, idle resource elimination, and anomaly detection.
- Define and enforce security baselines, encryption standards, and least-privilege access patterns through automated audits and guardrails.
AI-Powered CloudOps & Agentic Workflows:
- Design and build AI-powered cloud operations capabilities using Amazon Bedrock, including agentic workflows with Bedrock Agents, custom orchestrators for complex task automation, and Model Context Protocol (MCP) integrations.
- Define how AI and generative AI capabilities are applied within your capability domains to automate decision-making, anomaly detection, remediation, and operational intelligence.
- Build and validate agentic AI patterns that can operate across multi-account cloud environments at scale.
- Evaluate new foundation models, AI services, and orchestration frameworks for applicability to cloud operations automation.
R&D & Technical Leadership:
- Evaluate emerging cloud provider services, build advanced proof-of-concept implementations, and determine which capabilities should be productized.
- Own the lifecycle from identifying cloud operations problems through building validated solutions to translating them into product-ready specifications for the engineering team.
- Set the technical bar for the team. Define best practices, review architectural decisions, and ensure consistency across deliverables.
Cross-Team Collaboration:
- Work directly with Product Management to deliver feature specifications grounded in real cloud environment behavior.
- Collaborate with Platform Engineering to translate validated R&D patterns into production-grade implementations.
- Partner with Site Reliability Engineering on infrastructure governance, incident response, and cloud spend optimization.
- Contribute to thought leadership through technical blogs and documentation.
What Success Looks Like:
- You identify repetitive cloud operations problems and automate them into reusable capabilities that reach thousands of cloud accounts.
- You design AI-powered CloudOps solutions and agentic workflows that transform how the platform delivers value to partners.
- You translate deep cloud expertise into platform features, not one-off deliverables.
- You proactively explore new cloud provider services and AI capabilities, turning them into practical, validated approaches before the rest of the organization asks for them.
- Other team members and the platform itself depend on the frameworks and patterns you build.
Qualifications:
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- 10+ years of experience in cloud operations and engineering, with deep focus on AWS services.
- Proven track record of building reusable automation and tooling beyond individual engagement delivery.
- Strong proficiency in AWS services such as EC2, S3, RDS, Lambda, Organizations, Control Tower, Security Hub, Config, Cost Explorer, and Bedrock.
- Expert-level scripting in Python or Bash. Experience with CloudFormation, Terraform, or CDK.
- Hands-on experience with Amazon Bedrock, Bedrock Agents, agentic AI workflows, and generative AI application design.
- Deep understanding of multi-account cloud architecture, IAM design, and governance at scale.
- Strong analytical and problem-solving skills with the ability to work independently on ambiguous, complex problems.
- Excellent written and verbal communication skills.
- AWS Certified Solutions Architect (Professional) required. AWS Certified AI Practitioner or Machine Learning Specialty preferred. Additional specialty certifications (Security, FinOps, Networking) preferred.
Preferred Skills:
- Experience with FinOps platforms such as CloudHealth, Apptio Cloudability, or equivalent.
- Experience with Model Context Protocol (MCP) servers, AI orchestration frameworks, and custom agent design.
- Familiarity with DevOps practices, CI/CD pipelines, and chaos engineering.
- Experience contributing to platform tooling or open-source cloud operations projects.
- Knowledge of compliance frameworks: SOC 2, HIPAA, PCI-DSS, CIS Benchmarks.
Why Join MontyCloud?
- Own capability domains whose impact reaches a global MSP partner network and thousands of cloud accounts.
- Design and build AI-powered CloudOps solutions and agentic workflows that shape how the industry approaches cloud operations.
- Build reusable frameworks and automation that other engineers and the platform depend on.
- Work directly with Product, Engineering, and SRE teams on platform capabilities.
- Operate as a technical authority in a team where depth and automation are valued over hierarchy and people management.
- Enjoy a flexible, hybrid work culture that supports work-life balance.