July 29, 2025

How to Set Up ArgoCD for Production in AWS

Step by step guide to implement GitOps with ArgoCD on AWS EKS, from installing ArgoCD with Terraform to managing multiple Kubernetes environments with automated freeze/unfreeze workflows.

Prerequisites

Before starting, make sure you have the following ready:

An AWS EKS cluster up and running (provisioned via Terraform or the AWS console)
kubectl installed and configured to connect to your EKS cluster
Helm 3 installed
Terraform >= 1.0.0 installed
A GitHub repository to use as your GitOps repo
Basic familiarity with Kubernetes resources and YAML manifests

Goals

By the end of this guide, you will:

Install ArgoCD on EKS using Terraform and Helm
Set up the App of Apps pattern to manage multiple environments
Configure ArgoCD Image Updater to watch ECR for new image tags
Automate environment freezing and production pushes with a Python script
Understand the full weekly release workflow from dev to production

Why ArgoCD?

ArgoCD is a declarative, GitOps continuous delivery tool for Kubernetes. The core idea is simple: your Git repository is the single source of truth for what should be running in your cluster. ArgoCD watches the repo and keeps the cluster in sync.

This gives you a few important things:

Automated deployment: ArgoCD continuously reconciles what's in Git with what's in the cluster, so changes get applied automatically when you merge a PR
Rollbacks: every change is a Git commit, so rolling back is just reverting a commit
Audit trail: Git history becomes your deployment history
Monitoring: ArgoCD's UI shows sync status, health checks, and drift detection out of the box

Installing ArgoCD on EKS with Terraform

Instead of running helm install manually, let's declare the ArgoCD installation in Terraform. This way the installation itself is version-controlled and reproducible.

First, set up the Helm provider. This tells Terraform how to talk to your cluster:

provider.tf

provider "helm" {
  kubernetes {
    config_path = "~/.kube/config"
  }
}

terraform {
  required_version = ">= 1.0.0"

  required_providers {
    helm = {
      source  = "hashicorp/helm"
      version = "= 2.5.1"
    }
  }
}

Note: If you're running Terraform in a CI pipeline, you'll want to use host, cluster_ca_certificate, and token instead of config_path to avoid depending on a local kubeconfig file.

Now define the Helm release resource for ArgoCD:

argocd.tf

resource "helm_release" "argocd" {
  name             = "argocd"
  repository       = "https://argoproj.github.io/argo-helm"
  chart            = "argo-cd"
  namespace        = "argocd"
  create_namespace = true
  version          = "3.35.4"
  values           = [file("values/argocd.yaml")]
}

The values parameter points to a custom values file where you configure ArgoCD's behavior. Here's a minimal starting point:

values/argocd.yaml

global:
  image:
    tag: "v2.6.6"

dex:
  enabled: false

server:
  extraArgs:
    - --insecure

A few notes on these values:

Dex is disabled because in this setup we're not using SSO. If you need OIDC or LDAP authentication, you'll want to enable it and configure your provider.
The --insecure flag disables TLS on the ArgoCD server itself. This is fine if you're terminating TLS at your ingress or load balancer, which is the typical setup on EKS with an ALB. Don't use this if the server is directly exposed.

Run terraform apply and ArgoCD will be installed in the argocd namespace.

Accessing the ArgoCD Web UI

To access the ArgoCD dashboard, port-forward the server service:

kubectl port-forward svc/argocd-server -n argocd 8080:443

Then open https://localhost:8080 in your browser.

The default username is admin. To get the initial password:

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Tip: Change this password immediately after first login, or better yet, configure SSO and disable the admin account entirely for production use.

The 5-Environment Setup

Most companies run something like this:

Sandbox: isolated, just for experimentation, nothing breaks anything real
Development: devs integrate their code with other microservices here
QA/Testing: dedicated testers run this separately from devs, intentional responsibility split
Staging/Pre-prod: should mirror production as closely as possible, data size and all
Production: real clients, least privilege access, treat carefully

Not everyone needs all five. If you're trying to cut costs, you can collapse staging into dev with a freeze/unfreeze approach (more on this below).

The Weekly Release Trick (Freeze/Unfreeze Cycle)

If you deploy to production every Wednesday for example, here's a workflow that makes sense:

Devs deploy freely to dev environment up until Tuesday
Tuesday: freeze dev environment (stop all new version deployments)
QA tests everything, automated and manual
Take a snapshot of the tested versions
Wednesday: production push via PR
After successful push: unfreeze dev so devs can keep going

The freeze is done by adding an ignore annotation to ArgoCD Application resources. Doing this manually for 100 microservices is a nightmare, so you automate it with a script.

What the GitOps Script Does

There are 3 main actions:

Pause (freeze)

Creates a new branch in the remote GitOps repo
Adds ignore annotations to all Application resources in the target environment
Opens a PR for the team to review and merge to actually freeze

Push (production push)

Takes source env (dev), collects all the currently deployed image tags
Updates those versions in the production environment files
Opens a PR for the production push

Resume (unfreeze)

Removes the ignore annotations
Opens a PR to revert back to continuous delivery

All actions go through PRs, never direct commits to main. That's the point of GitOps.

Infrastructure Stack Used

EKS on AWS (Terraform to provision)
ArgoCD for continuous delivery (installed via Terraform + Helm)
ArgoCD Image Updater to watch ECR for new image tags
ECR as private container registry
EKS Pod Identities for IAM auth (preferred over IRSA with OIDC)
GitHub as the GitOps repo (deploy key with read/write)
Python script to automate freeze/push/unfreeze

App of Apps Pattern

Instead of applying each ArgoCD Application resource manually, you create one parent Application that points to an entire environment folder. ArgoCD recursively applies everything under it.

# Parent app points to the whole env folder
spec:
  source:
    path: envs/dev  # All app manifests live here
  syncPolicy:
    automated:
      selfHeal: true
      prune: true

This is how you bootstrap dev and prod environments with a single kubectl apply. The parent app watches the folder, and any new Application YAML you add to that folder gets picked up automatically.

Development Environment: Continuous Delivery Annotations

For the dev environment, Image Updater annotations on each Application resource tell it what to watch:

annotations:
  argocd-image-updater.argoproj.io/image-list: payments=<account>.dkr.ecr.<region>.amazonaws.com/payments
  argocd-image-updater.argoproj.io/payments.update-strategy: semver
  argocd-image-updater.argoproj.io/write-back-method: git

The write-back-method: git is important when using app of apps. Image Updater commits new image tags back to the repo, then ArgoCD picks up the change and syncs.

Note: Default image check interval is 2 minutes. Don't lower it, too many ECR API calls can get throttled.

Freezing with the Ignore Annotation

To pause a service, Image Updater respects this:

annotations:
  argocd-image-updater.argoproj.io/payments.ignore: "*"

The * means ignore every tag. The script adds this annotation to all apps in the target environment in one go.

IAM / ECR Auth for Image Updater

I needed to set up a few things for Image Updater to pull from private ECR:

Create an IAM role with read-only ECR policy
Associate it with the Image Updater's Kubernetes service account via Pod Identities
Mount a small shell script inside Image Updater that fetches a temporary token:

aws ecr get-authorization-token --region <region> | base64 -d

Image Updater runs this script whenever the token expires. Simple but it works.

Folder Structure in the GitOps Repo

envs/
  dev/
    payments/
      application.yaml
    users/
      application.yaml
  prod/
    payments/
      application.yaml
    users/
      application.yaml
helm-charts/
  payments/
  users/

The Python script iterates over subfolders inside each environment. The folder names don't matter as long as there's one folder per service, the script picks them all up.

Image Updater writes new image tags under the helm-charts/ folder (not directly in the env folder), using a file like argocd-source.yaml per chart.

The Python Script: Key Parts

The script authenticates with GitHub via a fine-grained personal access token (permissions needed: contents + pull requests). It uses the PyGithub package to interact with remote files without cloning locally.

Pause function: adds the ignore annotation to each app's YAML, commits to a new branch, opens a PR.

Resume function: removes the annotation using dict.pop(), commits, opens a PR.

Collect versions: reads argocd-source.yaml from each Helm chart folder to grab the currently deployed image tag per service. Returns a dict like {"payments": "2.5.0", "users": "0.5.0"}.

Production push: iterates over all prod app YAMLs, replaces the image tag in Helm params with the version pulled from the frozen dev environment, commits, opens a PR.

Branch naming convention used:

pause-dev-2026-01-12
resume-dev-2026-01-12
prod-push-2026-01-12

Image Constraints Per Service

You can set semantic version constraints on which tags Image Updater will deploy automatically. For example, the users service only auto-deploys patch/minor updates (not major), since it has API consumers:

argocd-image-updater.argoproj.io/users.allow-tags: regexp:^0\.[0-9]+\.[0-9]+$

The payments service in this example has no constraint and takes any new tag.

Production Environment: No Automated Deployments

Production apps have no Image Updater annotations. Versions are hardcoded in the Application YAML and only changed via the production push PR. This is intentional. You almost never want automated deploys to prod.

Things to Double-Check If Something Breaks

If Image Updater gets permission denied errors on ECR, restart the pod first, then check Pod Identity config and IAM roles
If ArgoCD can't clone the GitOps repo, check that the deploy key URL matches exactly between the Kubernetes secret and the Application resource
Image Updater uses a 2-minute poll cycle, ArgoCD sync is roughly 3 minutes, so after merging a PR expect up to 5-6 minutes before you see the change in the cluster
write-back-method: git is mandatory with app of apps. The default in-memory write-back doesn't work here

Rough Timeline When Things Work

Event	Wait time
Merge freeze PR	~3 min for ArgoCD to apply
Push new ECR image	~2 min for Image Updater to detect
Merge prod push PR	~3-4 min for ArgoCD to roll out
Unfreeze + new images in ECR	~5-6 min total end-to-end

Conclusion

This setup covers the full lifecycle: installing ArgoCD on EKS with Terraform, configuring Image Updater for continuous delivery to dev, and automating the freeze/push/unfreeze workflow for production releases. The Python script is the piece I'll probably need to adapt the most depending on whether we're using GitHub or something else like GitLab or Bitbucket. The core logic stays the same, just swap out the API calls.