
How to Set Up ArgoCD for Production in AWS
Step by step guide to implement GitOps with ArgoCD on AWS EKS, from installing ArgoCD with Terraform to managing multiple Kubernetes environments with automated freeze/unfreeze workflows.
Prerequisites
Before starting, make sure you have the following ready:
- An AWS EKS cluster up and running (provisioned via Terraform or the AWS console)
kubectlinstalled and configured to connect to your EKS cluster- Helm 3 installed
- Terraform >= 1.0.0 installed
- A GitHub repository to use as your GitOps repo
- Basic familiarity with Kubernetes resources and YAML manifests
Goals
By the end of this guide, you will:
- Install ArgoCD on EKS using Terraform and Helm
- Set up the App of Apps pattern to manage multiple environments
- Configure ArgoCD Image Updater to watch ECR for new image tags
- Automate environment freezing and production pushes with a Python script
- Understand the full weekly release workflow from dev to production
Why ArgoCD?
ArgoCD is a declarative, GitOps continuous delivery tool for Kubernetes. The core idea is simple: your Git repository is the single source of truth for what should be running in your cluster. ArgoCD watches the repo and keeps the cluster in sync.
This gives you a few important things:
- Automated deployment: ArgoCD continuously reconciles what's in Git with what's in the cluster, so changes get applied automatically when you merge a PR
- Rollbacks: every change is a Git commit, so rolling back is just reverting a commit
- Audit trail: Git history becomes your deployment history
- Monitoring: ArgoCD's UI shows sync status, health checks, and drift detection out of the box
Installing ArgoCD on EKS with Terraform
Instead of running helm install manually, let's declare the ArgoCD installation in Terraform. This way the installation itself is version-controlled and reproducible.
First, set up the Helm provider. This tells Terraform how to talk to your cluster:
provider.tf
provider "helm" {
kubernetes {
config_path = "~/.kube/config"
}
}
terraform {
required_version = ">= 1.0.0"
required_providers {
helm = {
source = "hashicorp/helm"
version = "= 2.5.1"
}
}
}Note: If you're running Terraform in a CI pipeline, you'll want to use
host,cluster_ca_certificate, andtokeninstead ofconfig_pathto avoid depending on a local kubeconfig file.
Now define the Helm release resource for ArgoCD:
argocd.tf
resource "helm_release" "argocd" {
name = "argocd"
repository = "https://argoproj.github.io/argo-helm"
chart = "argo-cd"
namespace = "argocd"
create_namespace = true
version = "3.35.4"
values = [file("values/argocd.yaml")]
}The values parameter points to a custom values file where you configure ArgoCD's behavior. Here's a minimal starting point:
values/argocd.yaml
global:
image:
tag: "v2.6.6"
dex:
enabled: false
server:
extraArgs:
- --insecureA few notes on these values:
- Dex is disabled because in this setup we're not using SSO. If you need OIDC or LDAP authentication, you'll want to enable it and configure your provider.
- The
--insecureflag disables TLS on the ArgoCD server itself. This is fine if you're terminating TLS at your ingress or load balancer, which is the typical setup on EKS with an ALB. Don't use this if the server is directly exposed.
Run terraform apply and ArgoCD will be installed in the argocd namespace.
Accessing the ArgoCD Web UI
To access the ArgoCD dashboard, port-forward the server service:
kubectl port-forward svc/argocd-server -n argocd 8080:443Then open https://localhost:8080 in your browser.
The default username is admin. To get the initial password:
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -dTip: Change this password immediately after first login, or better yet, configure SSO and disable the admin account entirely for production use.
The 5-Environment Setup
Most companies run something like this:
- Sandbox: isolated, just for experimentation, nothing breaks anything real
- Development: devs integrate their code with other microservices here
- QA/Testing: dedicated testers run this separately from devs, intentional responsibility split
- Staging/Pre-prod: should mirror production as closely as possible, data size and all
- Production: real clients, least privilege access, treat carefully
Not everyone needs all five. If you're trying to cut costs, you can collapse staging into dev with a freeze/unfreeze approach (more on this below).
The Weekly Release Trick (Freeze/Unfreeze Cycle)
If you deploy to production every Wednesday for example, here's a workflow that makes sense:
- Devs deploy freely to dev environment up until Tuesday
- Tuesday: freeze dev environment (stop all new version deployments)
- QA tests everything, automated and manual
- Take a snapshot of the tested versions
- Wednesday: production push via PR
- After successful push: unfreeze dev so devs can keep going
The freeze is done by adding an ignore annotation to ArgoCD Application resources. Doing this manually for 100 microservices is a nightmare, so you automate it with a script.
What the GitOps Script Does
There are 3 main actions:
Pause (freeze)
- Creates a new branch in the remote GitOps repo
- Adds
ignoreannotations to all Application resources in the target environment - Opens a PR for the team to review and merge to actually freeze
Push (production push)
- Takes source env (dev), collects all the currently deployed image tags
- Updates those versions in the production environment files
- Opens a PR for the production push
Resume (unfreeze)
- Removes the ignore annotations
- Opens a PR to revert back to continuous delivery
All actions go through PRs, never direct commits to main. That's the point of GitOps.
Infrastructure Stack Used
- EKS on AWS (Terraform to provision)
- ArgoCD for continuous delivery (installed via Terraform + Helm)
- ArgoCD Image Updater to watch ECR for new image tags
- ECR as private container registry
- EKS Pod Identities for IAM auth (preferred over IRSA with OIDC)
- GitHub as the GitOps repo (deploy key with read/write)
- Python script to automate freeze/push/unfreeze
App of Apps Pattern
Instead of applying each ArgoCD Application resource manually, you create one parent Application that points to an entire environment folder. ArgoCD recursively applies everything under it.
# Parent app points to the whole env folder
spec:
source:
path: envs/dev # All app manifests live here
syncPolicy:
automated:
selfHeal: true
prune: trueThis is how you bootstrap dev and prod environments with a single kubectl apply. The parent app watches the folder, and any new Application YAML you add to that folder gets picked up automatically.
Development Environment: Continuous Delivery Annotations
For the dev environment, Image Updater annotations on each Application resource tell it what to watch:
annotations:
argocd-image-updater.argoproj.io/image-list: payments=<account>.dkr.ecr.<region>.amazonaws.com/payments
argocd-image-updater.argoproj.io/payments.update-strategy: semver
argocd-image-updater.argoproj.io/write-back-method: gitThe write-back-method: git is important when using app of apps. Image Updater commits new image tags back to the repo, then ArgoCD picks up the change and syncs.
Note: Default image check interval is 2 minutes. Don't lower it, too many ECR API calls can get throttled.
Freezing with the Ignore Annotation
To pause a service, Image Updater respects this:
annotations:
argocd-image-updater.argoproj.io/payments.ignore: "*"The * means ignore every tag. The script adds this annotation to all apps in the target environment in one go.
IAM / ECR Auth for Image Updater
I needed to set up a few things for Image Updater to pull from private ECR:
- Create an IAM role with read-only ECR policy
- Associate it with the Image Updater's Kubernetes service account via Pod Identities
- Mount a small shell script inside Image Updater that fetches a temporary token:
aws ecr get-authorization-token --region <region> | base64 -dImage Updater runs this script whenever the token expires. Simple but it works.
Folder Structure in the GitOps Repo
envs/
dev/
payments/
application.yaml
users/
application.yaml
prod/
payments/
application.yaml
users/
application.yaml
helm-charts/
payments/
users/The Python script iterates over subfolders inside each environment. The folder names don't matter as long as there's one folder per service, the script picks them all up.
Image Updater writes new image tags under the helm-charts/ folder (not directly in the env folder), using a file like argocd-source.yaml per chart.
The Python Script: Key Parts
The script authenticates with GitHub via a fine-grained personal access token (permissions needed: contents + pull requests). It uses the PyGithub package to interact with remote files without cloning locally.
Pause function: adds the ignore annotation to each app's YAML, commits to a new branch, opens a PR.
Resume function: removes the annotation using dict.pop(), commits, opens a PR.
Collect versions: reads argocd-source.yaml from each Helm chart folder to grab the currently deployed image tag per service. Returns a dict like {"payments": "2.5.0", "users": "0.5.0"}.
Production push: iterates over all prod app YAMLs, replaces the image tag in Helm params with the version pulled from the frozen dev environment, commits, opens a PR.
Branch naming convention used:
pause-dev-2026-01-12resume-dev-2026-01-12prod-push-2026-01-12
Image Constraints Per Service
You can set semantic version constraints on which tags Image Updater will deploy automatically. For example, the users service only auto-deploys patch/minor updates (not major), since it has API consumers:
argocd-image-updater.argoproj.io/users.allow-tags: regexp:^0\.[0-9]+\.[0-9]+$The payments service in this example has no constraint and takes any new tag.
Production Environment: No Automated Deployments
Production apps have no Image Updater annotations. Versions are hardcoded in the Application YAML and only changed via the production push PR. This is intentional. You almost never want automated deploys to prod.
Things to Double-Check If Something Breaks
- If Image Updater gets permission denied errors on ECR, restart the pod first, then check Pod Identity config and IAM roles
- If ArgoCD can't clone the GitOps repo, check that the deploy key URL matches exactly between the Kubernetes secret and the Application resource
- Image Updater uses a 2-minute poll cycle, ArgoCD sync is roughly 3 minutes, so after merging a PR expect up to 5-6 minutes before you see the change in the cluster
write-back-method: gitis mandatory with app of apps. The default in-memory write-back doesn't work here
Rough Timeline When Things Work
| Event | Wait time |
|---|---|
| Merge freeze PR | ~3 min for ArgoCD to apply |
| Push new ECR image | ~2 min for Image Updater to detect |
| Merge prod push PR | ~3-4 min for ArgoCD to roll out |
| Unfreeze + new images in ECR | ~5-6 min total end-to-end |
Conclusion
This setup covers the full lifecycle: installing ArgoCD on EKS with Terraform, configuring Image Updater for continuous delivery to dev, and automating the freeze/push/unfreeze workflow for production releases. The Python script is the piece I'll probably need to adapt the most depending on whether we're using GitHub or something else like GitLab or Bitbucket. The core logic stays the same, just swap out the API calls.
Comments