AI Engineer · LLM Systems · Platform & SRE
Building agent systems that hold up in production.
I design multi-agent orchestration (LangGraph supervisor–worker), multi-provider routing behind a Portkey gateway, and hybrid retrieval across vectors, a knowledge graph, and rerank — then make quality a number, not a feeling, with golden-dataset evals and Langfuse tracing. All on a platform/SRE foundation hardened in fintech at 1,500+ TPS.
— Architecting elegance in chaos.




Selected Architectures
Schematics of how I design AI and platform systems — drawn, not screenshotted.
Agentic SRE Operations
The infra & SRE operation run by agents — a supervisor routes incidents to triage, remediation, and FinOps specialists through MCP tools, human-in-the-loop.
LangGraph Multi-Agent
A StateGraph supervisor routes through conditional edges to Planner, Worker, and Tool nodes — each updating shared state and cycling back until the task is done.
Multi-Provider LLM Gateway
A cheap classifier routes to a Portkey-style gateway with a fallback chain across Claude, GPT, and Gemini — balancing quality, latency, and cost.
RAG · Retrieval Stack
Hybrid grounding across Qdrant vectors, a Neo4j knowledge graph, and Typesense full-text — reranked with Cohere into a cited, trustworthy answer.
LangChain LCEL Pipeline
Composable retrieval-augmented chains — prompt | model | parser — where a retriever injects grounded context and the parser returns typed, structured output.
Event-Driven Payments
The platform foundation: PCI-DSS microservices over an event bus (SNS/SQS/Kafka) feeding a ledger and real-time reconciliation at 1.5K+ TPS.
Things I've built
AI Delivery Pipeline
Agentic automation of the software lifecycle — Jira → AI code-gen → tests → PR with human review. ~40% less manual overhead.
Cloud Cost Optimizer
FinOps engine ingesting AWS/Azure billing, detecting idle resources via rules, generating safe decommission plans. MCP server + Claude Agent SDK sub-agents.
AI Engineer Lab
A runnable, line-by-line reference for production LLM systems: routing, agents, RAG, evals — real APIs, graceful degradation, interview-grade notes.
Payment Infrastructure
Event-driven microservices processing 1,500+ TPS with PCI-DSS compliance — from a 100 TPS monolith to 99.95% success at 450ms latency.
The road here
Head of Infrastructure & SRE — Agentic Operations · Deuna
- Run the infra & SRE operation through AI agents: incident triage, remediation, and FinOps via multi-agent orchestration + MCP.
- Built an agentic pipeline automating the delivery lifecycle (Jira → code-gen → tests → PR), cutting manual overhead ~40%.
- Multi-provider LLM integration (Claude, OpenAI, Gemini) with structured output, retries, and fallback; token/cost optimization.
- Extended SLO/SLI and Datadog observability to AI behavior, latency, and cost.
Senior DevOps / Platform Engineer · Housecall Pro
- Led enterprise platform transformation for a SaaS product serving millions of users.
- Architected multi-cloud solutions (AWS/Azure) with global RDS replication.
- Partnered with C-level leadership to reduce annual cloud spend by 20%.
What I bring
Agentic AI & Orchestration
Multi-agent systems (supervisor + specialist), tool & function calling, MCP servers, classifier routing.
RAG & Retrieval
Embeddings, vector search (Qdrant), reranking (Cohere), knowledge graphs (Neo4j), grounded answers.
LLM Evals & Observability
Golden datasets, prompt-regression, hallucination & safety checks, Langfuse and Datadog tracing.
Cloud & Platform Foundation
AWS, Kubernetes, Terraform, GitOps, SRE — the resilient base beneath the AI.
Technical insights
The Productivity Paradox of AI
Are we entering a productivity boom that elevates our capabilities, or a bubble that erodes our engineering muscle memory?