Bỏ qua đến nội dung
DevOps Lab

GitOps trên AWS EKS

Portfolio kiến trúc GitOps / platform — hero, overview, architecture, VPC ingress diagram.

Live Production Environment — AWS EKS

Enterprise GitOps
on AWS EKS

Production-Grade Multi-Cluster Kubernetes with Advanced Auto-Scaling, GitOps Automation & Enterprise Observability

FluxCD OperatorTerraformCalico CNIKarpenterKEDAGo 1.23Grafana OperatorKyvernoGitHub ActionsKustomizeEFS RWX

Built by Duy · gitops.duyne.me

2EKS Clusters
6Applications
366+Passing Tests
100%GitOps Automated
RWXShared Storage (EFS)

📋 Project Overview

A complete, production-ready GitOps implementation demonstrating modern cloud-native practices on AWS. This project showcases enterprise-grade infrastructure automation, multi-cluster management, and comprehensive CI/CD pipelines.

Live Production Environment: This is not a toy project — it's a fully operational, production-grade infrastructure running on AWS with real applications, advanced auto-scaling (Karpenter + KEDA), Calico CNI for network policies, monitoring, and security controls.
Calico CNI

Advanced networking with network policies and BGP routing for zero-trust pod-to-pod communication.

Karpenter + KEDA

Fast node provisioning in seconds. Event-driven pod autoscaling with custom metrics. HPA: 3–11 replicas.

Grafana Cloud

Enterprise observability with Alloy agents, Prometheus metrics, Loki logs, and real-time dashboards.

Kyverno

Policy-as-code for Kubernetes. Security baseline enforcement written in YAML, not Rego.

External Secrets

Zero secrets in Git. AWS Secrets Manager synced to Kubernetes via ESO with IRSA authentication.

EFS RWX Storage

ReadWriteMany shared storage across multiple pods with init container pattern and HPA compatibility.

🏗️ Architecture

PROD Cluster — us-east-2MANUAL SYNC ✓
  • 3 managed nodes + Karpenter SPOT (1 per AZ)
  • 3 NAT gateways (AZ-a, AZ-b, AZ-c — full HA)
  • Manual sync — approval required
  • HTTPS/TLS via ACM certificates
  • DNS: *.gitops.duyne.me via Route 53
🎯 Key Architectural Decisions
Separate Clusters: Complete blast radius isolation between DEV and PROD
PROD 3-AZ Deployment: Public + Private subnets in us-east-2a/b/c — survive full AZ outage
3 NAT Gateways (PROD): 1 per AZ — private subnet traffic stays in same AZ for egress
TopologySpreadConstraints: Pods distributed across 3 AZs — maxSkew:1 per zone
Pod Disruption Budgets: minAvailable: 2 on all PROD apps — safe during rolling deploys
GitOps-First: All changes flow through Git — no manual kubectl in PROD
Infrastructure as Code: 100% Terraform-managed infrastructure
9 VPC Endpoints: Private subnet connectivity — STS, EC2, EKS, EFS, SSM, ELB… ENI per AZ

🌐 VPC & Networking Architecture

🗺️ PROD VPC — Full topology 3-AZ (us-east-2)
PROD VPC — 10.0.0.0/16🌐 InternetUsers / GitHub Actions OIDCInternet Gateway (IGW)attached to VPCAWS VPC — 10.0.0.0/16 (PROD)us-east-2 · 3 AZs · DNS hostnames: enabled · EKS API: private-only · /16 = 65536 IPsPUBLIC 10.0.1.0/24us-east-2a · map_public_ip: true · 251 usable IPs⚖️ ALB:443 HTTPS · Cross-AZnode: alb-sg → 8080🔀 NAT GW-aEIP: 1.2.3.4private → internetRT: public-a0.0.0.0/0 → IGW · s3-prefix → S3 GW EPTag: kubernetes.io/role/elb=1PRIVATE 10.0.11.0/24us-east-2a · no public IP · workers + pods⚙️ Worker Node t3.mediumAL2023 · max 17 pods · Karpenter SPOTgo-apidemofluxkyvernoSG: node-sg · IRSA: eks.amazonaws.com SA annotationRT: private-a0.0.0.0/0 → NAT GW-a · Tag: internal-elb=1📁 EFS Mount Target AZ-a 10.0.11.x:2049SG: efs-sg · port 2049 NFS · VPC EP: elasticfilesystemSPARE 10.0.21.0/24 (Reserved)Future: DB subnet / second node group / internal LB tiercount = 0AZ: us-east-2aPUBLIC 10.0.2.0/24us-east-2b · map_public_ip: true · 251 usable IPs⚖️ ALBCross-AZ targetsfailover AZ-b🔀 NAT GW-bEIP: 5.6.7.8private → internetRT: public-b0.0.0.0/0 → IGW · s3-prefix → S3 GW EPTag: kubernetes.io/role/elb=1PRIVATE 10.0.12.0/24us-east-2b · no public IP · workers + pods⚙️ Worker Node t3.mediumAL2023 · max 17 pods · Karpenter SPOTapi-appkarpenterkyvernoesoSG: node-sg · IRSA: eks.amazonaws.com SA annotationRT: private-b0.0.0.0/0 → NAT GW-b · Tag: internal-elb=1📁 EFS Mount Target AZ-b 10.0.12.x:2049SG: efs-sg · port 2049 NFS · VPC EP: elasticfilesystemSPARE 10.0.22.0/24 (Reserved)Future: RDS subnet / Redis / second workers tiercount = 0AZ: us-east-2bPUBLIC 10.0.3.0/24us-east-2c · map_public_ip: true · 251 usable IPs⚖️ ALBFailover zone:443 HTTPS🔀 NAT GW-cEIP: 9.10.11.12private → internetRT: public-c0.0.0.0/0 → IGW · s3-prefix → S3 GW EPTag: kubernetes.io/role/elb=1PRIVATE 10.0.13.0/24us-east-2c · no public IP · workers + pods⚙️ Worker Node t3.mediumAL2023 · max 17 pods · Karpenter SPOTgo-apidemofluxSG: node-sg · IRSA: eks.amazonaws.com SA annotationRT: private-c0.0.0.0/0 → NAT GW-c · Tag: internal-elb=1📁 EFS Mount Target AZ-c 10.0.13.x:2049SG: efs-sg · port 2049 NFS · VPC EP: elasticfilesystemSPARE 10.0.23.0/24 (Reserved)Future: Analytics tier / isolated workload / staging-in-prodcount = 0AZ: us-east-2c🔌 VPC Endpoints (Interface — ENI in each private subnet /AZ).sts · .ec2 · .eks · .elasticfilesystem · .secretsmanager · .ssm · .ssmmessages · .ec2messages · .elasticloadbalancing · S3 Gateway (FREE)vpce-sg: allow 443 from 10.0.11/12/13.0/24 · Private DNS: *.amazonaws.com → private IP · Karpenter: AWS_ISOLATED_VPC=trueHTTPS ingressALB → podNAT egresspod → VPC EPSpare subnets (dashed border) = count=0 in Terraform, reserved CIDR for future tiers

Internet → IGW → public subnets (ALB, NAT) → private worker subnets; VPC interface endpoints cho AWS APIs khi cluster private.

🔌 9 Interface Endpoints + 1 Gateway — Tại sao cần từng cái?

Private cluster = không có route internet từ worker node. Mỗi AWS API call phải qua VPC Endpoint. Thiếu một endpoint → service crash theo cách khó debug.

VPC Interface endpoints và S3 Gateway.stsIRSA token exchangeThiếu → all AWS SDK fail.ec2Karpenter provisionThiếu → cluster not scale.ekskubectl / kubelet regThiếu → node not join.elasticfilesystemEFS RWX volumesThiếu → ContainerCreating.secretsmanagerESO secret syncThiếu → ESO crash.ssmSSM Session ManagerThiếu → no node shell.ssmmessagesSSM data channelThiếu → session hang.ec2messagesEC2 ↔ SSM relayThiếu → SSM cmd fail.elasticloadbalancingALB Controller APIThiếu → Ingress pending🗄️ S3 GatewayFREE · RT entry (no ENI)ECR pull · AL2023 yum☸️ EKS Control Plane (AWS Managed) — Private endpoint onlykubectl → VPC EP .eks → control plane ENI in each private subnet (AZ-a/b/c) · Kubelet registers via private endpoint · No public access🔒 Security & Terraformvpce-sg inbound: 443 from 10.0.11.0/24 + 10.0.12.0/24 + 10.0.13.0/24 · Private DNS enabled → *.amazonaws.com resolves to private IPTerraform: for_each = toset(['sts','ec2','eks','elasticfilesystem','secretsmanager','ssm','ssmmessages','ec2messages','elasticloadbalancing'])
Karpenter + Private VPC: Phải set AWS_ISOLATED_VPC=true và clusterCIDR manually. Thiếu ec2 + sts endpoint → Karpenter khởi động nhưng không provision được node.
🔒 Security Groups — Least Privilege Design
Security GroupInboundOutboundAttached to
alb-sg443 from 0.0.0.0/0 (internet) 80 redirect → 4438080 → node-sg onlyALB (managed by LBC)
node-sg8080 from alb-sg All from node-sg (pod-to-pod) 443 from control-plane-sgAll (NAT GW / VPC EP)EC2 worker nodes
efs-sg2049 (NFS) from node-sg CIDR onlyDeny allEFS Mount Targets
vpce-sg443 from node-sg CIDR (10.0.11.0/24, 10.0.12.0/24, 10.0.13.0/24)Deny allInterface VPC Endpoints
eks-control-plane-sg443 from node-sg (kubelet → API server)All to node-sgEKS managed (auto-created)

🎯 Interview Deep Dive — Networking & Design

❓ Q1 — Phân biệt NAT Gateway vs Internet Gateway
Internet Gateway (IGW)
  • Attach vào VPC — cho phép traffic hai chiều giữa internet và resource có public IP.
  • EC2 trong public subnet + Elastic IP + route 0.0.0.0/0 → IGW = reachable từ internet.
  • Bản thân IGW không làm NAT — nó chỉ là gateway; OS của EC2 giữ private IP.
  • Dùng cho: ALB/NLB public, bastion host, public-facing EC2.
NAT Gateway
  • Đặt trong public subnet — cho phép private subnet đi ra internet một chiều (outbound only).
  • Private EC2/pod không có public IP → route 0.0.0.0/0 → NAT GW → IGW → internet.
  • NAT GW làm SNAT: thay source IP private → Elastic IP của NAT GW.
  • Internet không thể initiate connection vào private resource qua NAT GW.
  • Dùng cho: EKS pods pull image, Lambda update, RDS patch — không expose ra ngoài.
IGW vs NATInternet Gateway — 2-way (public subnet)🌍 InternetIGWEC2 (Elastic IP)public subnetroute: 0.0.0.0/0 → IGWEC2 must have public IP↔ bidirectionalNAT Gateway — outbound only (private subnet)🌍 InternetblockedNAT GWpub subnetEKS Pod / EC2private subnet (no public IP)route: 0.0.0.0/0 → NAT GWSNAT: private IP → NAT EIP→ outbound onlyCâu trả lời ngắn cho phỏng vấnIGW: attach VPC, enable 2-way internet cho public subnet.NAT GW: đặt trong public subnet, cho private subnet đi ra ngoài 1 chiều.Cả 2 dùng cùng nhau:Private subnet → route 0.0.0.0/0 → NAT GW (in public subnet)NAT GW → IGW → Internet (NAT GW dùng IGW để ra ngoài)Không có IGW = NAT GW không hoạt động được.Cost: NAT GW ~$32/tháng + $0.045/GB. PROD: 2 NAT GW (1 per AZ).
Hay bị hỏi thêm: Tại sao PROD cần 3 NAT GW? Mỗi AZ một cái để tránh cross-AZ data transfer fee (~$0.01/GB) và HA. DEV dùng 1 NAT GW để tiết kiệm ~$32/tháng.
❓ Q2 — PrivateLink vs Transit Gateway — khi nào dùng cái nào?
PrivateLink vs Transit GatewayAWS PrivateLink — 1 service, N consumersConsumer VPC AVPC Endpoint(Interface type)Consumer VPC BVPC Endpoint(can overlap CIDR)Consumer VPC CAWS account BProvider VPCNLB + Endpoint Service(e.g. internal API)✅ Dùng khi:• Expose 1 service cho nhiều VPC/account• CIDR overlap OK (chỉ expose service port)• SaaS provider model• Không cần full routing❌ Không dùng khi:• Cần nhiều services 2 chiều• Cần pod-to-pod full meshProject dùng:VPC Endpoints cho S3, ECR, STStiết kiệm NAT GW costTransit Gateway — hub-and-spoke full routingVPC dev10.1.0.0/16VPC staging10.2.0.0/16VPC prod10.0.0.0/16On-Prem VPNTGWRoute tablesattachments✅ Dùng khi:• N VPC cần nói chuyện với nhau• Multi-account hub-and-spoke• On-prem VPN/Direct Connect shared• Centralized egress / inspection❌ Không dùng khi:• Chỉ expose 1 service (dùng PrivateLink)• CIDR overlap giữa các VPCCost:$0.05/attachment-hour + $0.02/GB≈ $36/tháng per VPC attachment
Câu trả lời 1 dòng: PrivateLink = expose đúng một dịch vụ, một chiều, CIDR consumer không cần khớp. Transit Gateway = full routing hub-and-spoke giữa nhiều VPC/on-prem — cần thiết kế route table và CIDR không overlap.
❓ Q3 — Ship logs từ RDS & RabbitMQ về Elasticsearch
RDS PostgreSQL → Elasticsearch
  • Enable RDS logs: log_destination=csvlog, log_statement=all, log_min_duration_statement=1000 (slow query).
  • RDS export sang CloudWatch Logs (parameter group: shared_preload_libraries=pgaudit nếu cần audit).
  • Option A (managed): CloudWatch Logs → Kinesis Firehose → Elasticsearch Service (có thể thêm Lambda transform).
  • Option B (self-hosted): Logstash/Filebeat trong VPC → subscribe CloudWatch Logs stream → push ES.
RabbitMQ → Elasticsearch
  • RabbitMQ expose metrics qua Management Plugin HTTP API (:15672) và Prometheus endpoint (:15692).
  • Logs: /var/log/rabbitmq/*.log — Filebeat sidecar (K8s) hoặc agent trên EC2.
  • Filebeat: input.type: log → output.elasticsearch.hosts hoặc qua Logstash pipeline.
  • Amazon MQ (managed RabbitMQ): logs auto → CloudWatch Logs → cùng pipeline như RDS.
terraform / filebeat
# RDS: slow query + audit logs
resource "aws_db_parameter_group" "pg" {
  parameter {
    name  = "log_min_duration_statement"
    value = "1000" # ms
  }
  parameter {
    name  = "shared_preload_libraries"
    value = "pgaudit"
  }
}
resource "aws_db_instance" "this" {
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"] # auto push to CW Logs
}

# CW Logs → Kinesis Firehose → ES
resource "aws_cloudwatch_log_subscription_filter" "rds" {
  log_group_name  = "/aws/rds/instance/prod-db/postgresql"
  filter_pattern  = ""
  destination_arn = aws_kinesis_firehose_delivery_stream.es.arn
}

# Filebeat (K8s sidecar) — ví dụ comment trong filebeat.yml
# output.elasticsearch.hosts: ["https://es.prod.internal:9200"]
# index: "rabbitmq-logs-%{+yyyy.MM.dd}"
RDS và RabbitMQ log pathsRDS path:RDS PostgresCW Logs GroupKinesis FirehoseElasticsearch← managed, no agent on RDSRabbitMQ path:RabbitMQ podFilebeat sidecarLogstashElasticsearch← or direct output.elasticsearch
❓ Q4 — Từ Internet vào private service trong K8s + xử lý tấn công

Flow chuẩn: Internet → Route 53 → ALB (public subnet) → AWS Load Balancer Controller (Ingress) → Kubernetes Service (ClusterIP) → Pod. ALB terminate TLS, check auth, rồi mới forward vào cluster.

Happy path external to private Kubernetes serviceHappy path:🌍 ClientHTTPSRoute 53api.domain.comALBTLS terminationWAF + Auth checkIngress (LBC)aws-load-balancer-controllerIngressClass: albService (CIP)ClusterIP, port 8080go-api podprivate subnetannotations:kubernetes.io/ingress.class: albalb.ingress.kubernetes.io/scheme: internet-facingalb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:...alb.ingress.kubernetes.io/target-type: ip (pod IP direct)alb.ingress.kubernetes.io/waf-acl-id: ... ← WAF attach
Hay bị hỏi: tấn công kiểu gì? Xử lý thế nào?
Kiểu tấn côngLayerGiải phápDetect
DDoS / FloodALB layerAWS Shield Standard (free) chặn L3/L4. WAF rate-based rule >1000 req/5min/IP → block. ALB connection limit per target.CloudWatch RequestCount spike → alarm → SNS
SQLi / XSSApplication layer qua ALBAWS WAF managed: AWSManagedRulesCommonRuleSet, AWSManagedRulesSQLiRuleSet. Attach Web ACL vào ALB.WAF sampled requests → CloudWatch → alert
Bypass Ingress → Pod trực tiếpVPC/SG layerSecurity Group pod chỉ allow từ ALB SG. NetworkPolicy (Calico): deny ingress trừ từ namespace ingress/controller.VPC Flow Logs → CloudWatch Insights
Path traversal / API abuseIngress routingIngress rule strict path. Kyverno enforce non-root. RBAC tối thiểu cho ServiceAccount.App logs → metric filter
Credential stuffingALB → AppWAF Bot Control. ALB Cognito/OIDC auth. Rate limit per IP trên /login.WAF logs → Athena
TLS downgrade / expired certALB TLS policyssl-policy: ELBSecurityPolicy-TLS13-1-2-2021-06. ACM auto-renew. HSTS qua response rule.ACM expiry alert 45 ngày trước
Trả lời ngắn: Layer 1 — AWS Shield auto chặn L3/L4 DDoS. Layer 2 — WAF Web ACL trên ALB: SQLi/XSS + rate limit. Layer 3 — Security Group: pod chỉ nhận từ ALB SG. Layer 4 — NetworkPolicy: pod-to-pod isolation. Layer 5 — IRSA/Pod Identity: pod không có quyền nếu escape container.
📋 Quick Reference — "Khi nào dùng gì"
ScenarioDùng gìTránh
Private subnet cần pull image từ ECRVPC Endpoint (ecr.dkr) + NAT fallbackKéo image public qua NAT không cần thiết
3+ VPC cần connectivityTransit Gateway hub-and-spokeMesh VPC peering N²
Expose internal API cho account khác, CIDR overlapPrivateLink (NLB + Endpoint Service)Peering (cần CIDR disjoint)
EKS pod → S3 không qua internetS3 Gateway Endpoint (free) + routeNAT egress cho S3
Multi-account + on-prem share Direct ConnectTGW + Direct Connect GatewayDX riêng từng account

🏛️ AWS Landing Zone & Multi-Account Architecture

Tại sao cần Multi-Account? — Single Account không scale được
❌ Single Account (như project hiện tại)
  • DEV + PROD cùng account → IAM policy lỗi → prod data bị xóa.
  • Service quota shared: một team dùng hết EC2 limit, team khác không deploy được.
  • Blast radius: một IAM role compromise → toàn bộ infra bị ảnh hưởng.
  • Cost allocation mơ hồ: không biết team nào tốn bao nhiêu.
  • Compliance: PCI/HIPAA workload cùng account với dev sandbox.
✅ Multi-Account giải quyết được
  • Blast radius isolation: account boundary là security boundary mạnh nhất AWS.
  • Service quota per-account: Prod không bị ảnh hưởng bởi dev experiments.
  • Cost clarity: cost per account → tag theo team/product.
  • Least privilege: Developer không có access vào prod account.
  • SCP: Prod account bị deny mọi thứ ngoài whitelist, kể cả root user.
Diagram — Single Account vs Multi-Account Blast Radius
AWS Landing Zone — blast radius comparison❌ Single AccountAWS Account (shared DEV+PROD)DEV VPC 10.1.0.0/16EKS devRDS devIAM: shared namespacequotas shared w/ prodPROD VPC 10.0.0.0/16EKS prodRDS prodsame IAM namespace as dev!Shared: CloudTrail · AWS Config · Cost Explorer · Service QuotasS3 buckets · KMS keys · IAM roles — ALL visible to each other💥 Blast Radius = Toàn bộ account1 IAM role bị compromise → access vào PROD VPC✅ Multi-AccountAWS OrganizationsDEV AccountID: 111122223333EKS dev + VPC 10.1.0.0/16own IAM namespaceown service quotasSCP: allow dev actions only📍 project hiện tại của DuyACCOUNT BOUNDARYPROD AccountID: 444455556666EKS prod + VPC 10.0.0.0/16own IAM namespaceown service quotasSCP: deny delete*, MFA req✅ Blast Radius = DEV Account onlyCompromised DEV IAM role không thể access PROD account
Diagram — OU Hierarchy & Core Accounts (AWS Control Tower)
OU hierarchy and core accountsRootAWS Organizations RootManagement OURoot (Payer) accountNO workloads!Security OUAuto by Control TowerAudit + Log ArchiveInfrastructure OUShared servicesNetwork + Shared SvcWorkloads OUApp team accountsSub-OUs: Prod / Non-ProdSandbox OUExperimentingMax SCP restrictions📋 AuditSecurity Hub adminGuardDuty adminConfig aggregatorInspector central🪵 Log ArchiveS3: CloudTrail (all)S3: VPC Flow LogsS3: Config historyS3: ALB access logs🌐 NetworkTransit GatewayCentral VPC EPsDNS ResolverEgress VPC🛠️ Shared SvcECR (private)Secrets ManagerSSM Param StoreArtifact registryProd Sub-OUStrict SCPsNonProd Sub-OUDev / StagingProd AccountEKS prod10.0.0.0/16Dev Account10.1.0.0/16← Duy's projectStaging10.2.0.0/16prod-likeSCPs: Root → OU → Account (additive deny, không override allow, kể cả root user bị chặn)Prod SCP: deny iam:CreateUser · deny cloudtrail:StopLogging · deny ec2:TerminateInstances without MFA · region lockAWS Control Tower: auto-creates Security OU · manages guardrails · Account Factory (vending machine tạo account mới với baseline config)Terraform Customizations for Control Tower (CfCT) hoặc Landing Zone Accelerator (LZA) cho enterprise📍 Project hiện tại của DuySingle account (chưa có LZ)

🔄 GitOps Workflow

Engineer pushes to main
GitHub Actions — CI
Push Docker Hub
Update Kustomize overlay
Flux source-controller polls Git (30s)
kustomize-controller reconciles PROD
Manual gate → PROD Kustomization
PROD gate: PROD Kustomization thường để spec.suspend: true mặc định; workflow promote-prod chỉ chạy sau Environment approval — Flux reconcile tự động sau đó.
Flux Operator — GitOps on Autopilot
Bootstrap CLI (cũ)
  • flux bootstrap CLI tạo manifest → push Git; khó IaC hóa hết.
  • Khó tune requests/limits cho từng controller.
FluxInstance / Operator
  • Flux Operator + FluxInstance CRD — declarative, Terraform/kubectl friendly.
  • Upgrade bằng cách đổi spec.distribution.version; operator reconcile controllers.
  • Kustomize patches cho concurrency, NetworkPolicy nội bộ flux-system.

🔐 GitHub Actions ↔ AWS OIDC

Static keys
  • IAM user long-lived keys trong GitHub Secrets — lộ là mất quyền vĩnh viễn.
  • Rotate thủ công, khó least-privilege theo từng workflow.
OIDC federation
  • GitHub mint JWT (TTL ngắn), STS AssumeRoleWithWebIdentity → temp creds.
  • Trust policy giới hạn repo/branch/environment; CloudTrail audit đầy đủ.

Chi tiết trust policy, workflow YAML và sơ đồ đầy đủ: GitHub Actions OIDC → AWS.

⚡ Advanced Auto-Scaling

Chiến lược multi-layer: Karpenter (node, 30–60s), KEDA (event-driven pod), HPA (CPU/mem 3–11 replicas). Cluster Autoscaler classic thường 3–5 phút cho node mới.

Karpenter v1.4.2
Node provisioning nhanh, SPOT, bin-packing, AL2023 AMI family.
30–60 sec / node
KEDA 2.18.x
Scale theo CloudWatch, Prometheus, Kafka; scale-to-zero khi cần.
Custom metrics
HPA
CPU > 70% / Memory > 80% — song song với KEDA trên metrics API.
3 → 11 replicas
📋 Component version notes
ComponentOldNewReason
AWS EKS1.311.34Standard support, containerd 2.x, DRA / storage improvements
Karpenterv1.0.5v1.4.2Stable v1 API; AL2023 AMI
KEDA2.16.12.18.xSupported release line
📊 Sample HPA status (PROD)
ApplicationReplicasMin/MaxCPUMemory
go-app-prod33 / 100%34%
demo-app-prod33 / 91%62%
api-app-prod33 / 111%34%

🔒 Security & Compliance

Tầng account (SCP, CloudTrail) → VPC (SG, NACL, flow logs) → cluster (Kyverno, NetworkPolicy, IRSA) → workload (non-root, read-only rootfs). Wiki HTML có thêm sơ đồ defense grid; semantics được giữ ở checklist và stack cards.

🛠️ Technology Stack

☁️ Cloud Platform
  • AWS EKS (Kubernetes 1.34)
  • VPC with public/private subnets
  • Application Load Balancers
  • AWS Secrets Manager
  • KMS for encryption
  • EFS for shared storage (RWX)
  • 9 VPC Endpoints
🔄 GitOps & CI/CD
  • FluxCD Operator (auto-reconcile, health checks)
  • GitHub Actions Reusable Workflows
  • Kustomize for config mgmt
  • Multi-platform Docker builds
  • Automated testing pipeline
  • FluxInstance + ResourceSet pattern
⚡ Auto-Scaling
  • Karpenter v1.4.2 — node scaling
  • KEDA 2.18.x — event-driven
  • HPA — 3–11 replicas
  • Fast provisioning (seconds)
  • Cost-optimized SPOT nodes
  • Multi-AZ distribution
📊 Observability
  • Grafana Cloud (SaaS)
  • Grafana Alloy agents
  • Prometheus metrics
  • Loki log aggregation
  • Kube State Metrics
  • Node Exporter
🔒 Security & Policy
  • Kyverno — policy engine
  • External Secrets Operator
  • AWS Secrets Manager
  • Calico network policies
  • IRSA for AWS access
  • Non-root containers
💻 Applications
  • Go REST API
  • Node.js Express apps
  • React frontends
  • REST APIs
  • Health check endpoints
  • Spring Security (Basic Auth)

🏗️ Infrastructure as Code

Terraform ModuleWhat It Creates
VPC ModulePublic/private subnets, NAT gateways (3, 1 per AZ), 9 VPC endpoints, route tables, IGW
EKS ModuleManaged Kubernetes cluster, managed node groups (2× t3.medium), OIDC provider for IRSA
Karpenter ModuleIAM roles, EC2NodeClass, NodePool, aws-auth ConfigMap entry
ALB Controller ModuleIAM role with IRSA, Helm release, IngressClass
External Secrets ModuleIAM role with IRSA, Helm release, ClusterSecretStore
🏷️ AWS Resource Tagging Convention — Cost Allocation & Module Attribution

Mọi Terraform module follow pattern tagging nhất quán: tag derive từ basename(abspath(path.module)) — Cost breakdown theo team/product; terraform-module tự cập nhật khi rename folder.

modules/_common/labels.tf
# modules/_common/labels.tf — dùng chung mọi module
locals {
  default_tags = {
    (basename(abspath(path.module))) = var.name
    "terraform-module"              = basename(abspath(path.module))
    "product"                       = var.product
    "team"                          = var.team
    "environment"                   = var.environment
    "managed-by"                    = "terraform"
  }
  merged_tags = merge(local.default_tags, var.tags)
}

variable "name"        { type = string }
variable "product"     { type = string }
variable "environment" { type = string }
variable "tags"        { type = map(string); default = {} }
variable "team"        { type = string; default = "" }
variable "squad" {
  type        = string
  default     = ""
  description = "Deprecated: use team instead"
}
locals {
  resolved_team = coalesce(var.team, var.squad, "unknown")
}
modules/eks/main.tf (pattern)
# modules/eks/main.tf — consuming the pattern
module "eks" {
  source      = "../modules/eks"
  name        = "prod-cluster"
  product     = "platform"
  team        = "infra"
  environment = "prod"
  tags = {
    "cost-center" = "engineering"
    "criticality" = "high"
  }
}

resource "aws_eks_cluster" "this" {
  name = var.name
  tags = local.merged_tags
}

🔑 IRSA vs EKS Pod Identity

Vấn đề cốt lõi — tại sao Pod cần IAM?

Pod trong EKS cần gọi AWS APIs (S3, Secrets Manager, DynamoDB…). Nếu không có cơ chế riêng, pod phải dùng node IAM role — mọi pod trên node chia sẻ quyền, vi phạm least privilege. IRSA và EKS Pod Identity đều giải quyết — theo hai cơ chế khác nhau.

Timeline: IRSA (2019): OIDC provider + STS AssumeRoleWithWebIdentity. EKS Pod Identity (Nov 2023): mapping namespace/SA → role trong control plane; trust policy đơn giản với Service pods.eks.amazonaws.com — recommended trên AWS docs.

Wiki HTML có sơ đồ so sánh IRSA vs EKS Pod Identity (flow lớn). Trang React giữ bảng quyết định + Terraform song song — đủ semantic để phỏng vấn.

⚔️ IRSA vs Pod Identity — Decision Table
Tiêu chíIRSAPod Identity
Setup phức tạpCao — OIDC + trust hardcodeThấp — một lệnh association
IAM Trust PolicyAssumeRoleWithWebIdentity + OIDC URL + subAssumeRole + TagSession + Service pods.eks.amazonaws.com
Multi-clusterN cluster = N OIDC = N trustMột role dùng lại nhiều cluster
Cross-accountNative qua trust policyCần thêm bước, phức tạp hơn
Cluster rotationOIDC URL đổi → update trustAssociation re-bind, ít đụng IAM hơn
FargateYesChưa (tính đến 2024)
AddonBuilt-in webhookeks-pod-identity-agent
Project dùngESO, Loki, cross-accountgo-api, VMAgent, workload mới
Terraform — IRSA
irsa.tf
# IRSA — OIDC + trust theo sub
resource "aws_iam_openid_connect_provider" "eks" {
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = [data.tls_certificate.eks.certificates[0].sha1_fingerprint]
  url               = aws_eks_cluster.prod.identity[0].oidc[0].issuer
}

data "aws_iam_policy_document" "irsa_assume" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]
    principals {
      type        = "Federated"
      identifiers = [aws_iam_openid_connect_provider.eks.arn]
    }
    condition {
      test     = "StringEquals"
      variable = "${aws_iam_openid_connect_provider.eks.url}:sub"
      values   = ["system:serviceaccount:go-api:go-api-sa"]
    }
  }
}
Terraform — Pod Identity
pod-identity.tf
# Pod Identity — trust generic
data "aws_iam_policy_document" "pod_identity_assume" {
  statement {
    actions = ["sts:AssumeRole", "sts:TagSession"]
    principals {
      type        = "Service"
      identifiers = ["pods.eks.amazonaws.com"]
    }
  }
}

resource "aws_eks_addon" "pod_identity" {
  cluster_name = aws_eks_cluster.prod.name
  addon_name   = "eks-pod-identity-agent"
}

resource "aws_eks_pod_identity_association" "go_api" {
  cluster_name    = aws_eks_cluster.prod.name
  namespace       = "go-api"
  service_account = "go-api-sa"
  role_arn        = aws_iam_role.go_api.arn
}
🛣️ Migration — project dùng cả hai
IRSA — giữ cho
  • External Secrets Operator — Secrets Manager
  • Loki S3 backend, Tempo traces bucket
  • Cross-account (staging → prod S3)
  • Fargate-based workloads
Pod Identity
  • go-api, api-app — S3 read, SSM
  • VMAgent — CloudWatch, S3 queue
  • Karpenter, ALB Ingress Controller
  • Workload mới — default
Rule of thumb: Pod Identity cho workload mới cùng account. IRSA khi cross-account hoặc Fargate. Dài hạn: migrate IRSA khi hết dependency cross-account.

📊 Grafana Cloud → Self-Hosted Operator

Tại sao migrate từ Grafana Cloud?
Grafana Cloud
  • Free tier: 10k series, 50GB logs/traces
  • Alloy push ra ngoài VPC qua HTTPS
  • Retention/dashboard version control hạn chế
  • Scale → giá tăng nhanh; dữ liệu ra ngoài VPC (compliance)
Self-hosted + Operator
  • Grafana, Prometheus, Loki, Tempo trong cluster — data không rời VPC
  • Dashboards/datasources/alerts = CRD → GitOps (Flux)
  • Retention kiểm soát (Loki compactor, Prometheus TSDB)
  • Cost chủ yếu EBS/S3; tradeoff: vận hành HA/upgrade/backup

Wiki gốc có sơ đồ Grafana Operator ~920×480 (Git → Flux → Operator → Grafana + Prometheus/Loki/Tempo/Alloy). Ở đây giữ prose + bảng + Helm snippet; xem HTML wiki nếu cần pixel-level diagram.

⚖️ Grafana Cloud vs Self-Hosted
CloudSelf-hosted
Cost @ ~50k series~$99/mo~$20/mo (EBS+S3)
Dashboard GitOpsClick-ops UIGrafanaDashboard CRD
Data localityRa ngoài VPCIn-VPC
Setup time~5 phút~2h + ops
AlertsUIGrafanaAlertRule CRD
🗺️ Migration path (rút gọn)
  • Deploy Grafana Operator + Grafana instance (HelmRelease/Flux)
  • Export dashboard từ Cloud → GrafanaDashboard CRD
  • kube-prometheus-stack (tắt Grafana bundled)
  • Loki + Alloy → in-cluster Loki thay Cloud
  • Tempo + OTLP trong app
  • GrafanaDatasource CRD → Prometheus/Loki/Tempo
  • GrafanaAlertRule CRD → Git
  • Cutover: dừng push Alloy ra Cloud
helm / CRD sketch
# Grafana Operator + instance (rút gọn)
helmRelease: grafana-operator
chart: grafana-operator/grafana-operator
---
apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: prod-grafana
  labels: { dashboards: prod }
spec:
  deployment: { spec: { replicas: 2 } }
  config:
    server: { root_url: https://grafana.gitops.duyne.me }
---
# kube-prometheus-stack: grafana.enabled: false
# loki: storage S3 IRSA cho chunks
Loki/Tempo S3 backend cần IRSA: service account annotate IAM role s3:PutObject vào bucket chunks/traces.

🚀 Vector — Log pipeline

Vector vs Alloy vs Fluent Bit vs Logstash

Project dùng Grafana Alloy (metrics + logs + traces). Vector (Rust) là alternative throughput cao, footprint thấp; cả hai là unified agent — thay Promtail + Fluent Bit + OTel collector rời.

Wiki có sơ đồ pipeline Vector ~960×400 (sources → VRL transforms → sinks). Dưới đây là bảng so sánh đầy đủ.

Tiêu chíVectorAlloyFluent BitLogstash
Ngôn ngữRustGoC — rất nhẹJVM — nặng
Memory (idle)~10–30 MB~50–80 MB~1–5 MB~300–500 MB
Throughput~10M ev/s (benchmark)~3–5M ev/s~1–2M ev/s~500K ev/s
Unified agentmetrics+logs+tracesOTel-native + Loki/TempoLogsLogs
TransformsVRL type-safeRiver + otelcolLua limitedGrok/plugins
K8s metadatakubernetes_logs sourceBuilt-in discoveryK8s filterQua Filebeat
Grafana stackNeutralFirst-classOKELK oriented
Dùng khiHigh throughput, VRL phức tạp, fan-out multi sinkĐã dùng Grafana stack, River quen thuộcEdge / chỉ forward log đơn giảnLegacy Grok / ELK

📈 VictoriaMetrics — Cluster → Operator

Tại sao migrate sang vm-operator?
Trước — Helm manual
  • Helm per cluster → drift staging/prod
  • vminsert/vmstorage/vmselect lệch config theo thời gian
  • Scrape/alert rules trong ConfigMap — khó review, không GitOps chặt
  • Ops overhead ~8h/tháng
Sau — Operator
  • VMCluster CRD — lifecycle vminsert/storage/select
  • VMAgent + VMServiceScrape/VMPodScrape — sharding, remote_write
  • VMRule CRD — validate + GitOps
  • Upgrade: đổi image tag → operator rolling
  • ResourceSetInputProvider → parity đa môi trường
  • Ops ~45 phút/tháng (ước lượng wiki)
vm-operator gần Prometheus Operator (ServiceMonitor → VMServiceScrape, PrometheusRule → VMRule) — migration từ kube-prometheus-stack tương đối thẳng.

Wiki có architecture SVG ~960×560 (GitOps → vm-operator → VMCluster/VMAgent/VMAlert/Grafana). Phần dưới: lý do migrate + YAML mẫu.

VMCluster CRD (rút gọn)
monitoring/vmcluster.yaml
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMCluster
metadata:
  name: vm-prod
  namespace: monitoring
spec:
  retentionPeriod: "3"
  replicationFactor: 2
  vminsert:
    replicaCount: 2
    image: { tag: v1.101.0-cluster }
  vmstorage:
    replicaCount: 3
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources: { requests: { storage: 100Gi } }
  vmselect:
    replicaCount: 2
    extraArgs:
      dedup.minScrapeInterval: 1m

Flux ResourceSet + ResourceSetInputProvider: một template VMCluster/VMAgent cho mọi env; chỉ defaultValues (replicas, storage, scrape_interval) thay đổi — tránh drift cấu hình.

🛡️ Kyverno — Admission control

Kyverno là gì?

Kyverno ("to govern") là policy engine trong cluster — dynamic admission controller giữa kubectl/CI và etcd.

RBAC kiểm soát ai làm gì — không kiểm soát manifest deploy root, không limit, image :latest. Kyverno lấp gap đó.

Request đi qua Authn/RBAC → Mutating webhook (Kyverno) → Validating webhook (Kyverno) → etcd.

📋 Bốn loại policy
Validate

Sai rule → Enforce block hoặc Audit log.

  • non-root
  • có limits
  • không :latest
  • đúng namespace/env
Mutate

Tự patch: seccomp, label env, default limits.

  • inject seccomp RuntimeDefault
  • imagePullPolicy Always
Generate

Namespace mới → NetworkPolicy/Quota/RoleBinding.

  • multi-tenant guardrails
VerifyImages

Cosign/Sigstore — chỉ image đã ký.

  • supply chain
  • SOC2/PCI provenance
⚖️ Audit vs Enforce
AuditEnforce
Khi vi phạmCho qua + PolicyReport403 block
Cluster impactKhông dừng workloadDeploy fail
Dùng khiGiai đoạn đầu / brownfieldSau khi policy đã test
ProjectĐang Audit — violations log, chưa block toàn phần— roadmap
🥊 Kyverno vs Gatekeeper (rút gọn)
FeatureGatekeeperKyverno
MutationBeta (Assign)GA, YAML patch
GenerationKhông nativeCó (NetworkPolicy...)
Image verifyExtensionBuilt-in Cosign
Policy languageRegoYAML native
PolicyReportKhôngCRD self-service
Exception TTLCustomPolicyException
Multi-platformOPA everywhereK8s only
Chọn Kyverno: policy YAML trong Git; kyverno test trong CI; Cosign built-in; PolicyException có TTL. Gatekeeper hợp nếu đã đầu tư OPA cho Terraform/API.
📄 Active policies (project)
PolicyTypeCategoryMục đích
detect-mixed-environmentsValidateIsolationCảnh báo trộn dev/prod
enforce-environment-separationValidateIsolationChặn cross-environment
kyverno-block-dev-in-prodValidateIsolationDev workload không vào prod ns
kyverno-block-prod-in-devValidateIsolationProd workload không vào dev ns
require-resource-quotasValidateCostQuota cho dev namespaces
warn-anti-patternsValidateBest practice:latest, thiếu limits, auto-sync prod
kyverno-policy.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: kyverno-block-dev-in-prod
spec:
  validationFailureAction: Audit
  rules:
    - name: block-dev-label-in-prod
      match:
        resources:
          namespaceSelector:
            matchLabels: { env: prod }
      validate:
        deny:
          conditions:
            - key: "{{ request.object.metadata.labels.env || '' }}"
              operator: Equals
              value: dev

📦 Repository structure

gitops-infra
Infrastructure as Code

Terraform modules đa môi trường: VPC, EKS, Karpenter, ALB Controller, ESO, observability addons, Kyverno. Remote state S3 + DynamoDB.

TerraformAWSHelmS3 State
gitops-deploy
Kubernetes Manifests

Kustomize base + DEV/PROD overlays, Flux Kustomization, FluxInstance, ResourceSet (monitoring, apps).

KustomizeFlux OperatorKyvernoCalico
go-api
Go REST API

Go REST API: router, middleware, zerolog, tests, multi-arch Docker, /metrics Prometheus.

GoPrometheusDocker
gitops-demo-app
Node.js Express

Express demo: multi-stage Docker, non-root, HEALTHCHECK, patterns GitOps.

Node.jsExpressDocker
Enterprise GitOps on AWS EKS · React port · DevOps Lab