Technical Debt is a Choice: How I Finally Tore Down Our Legacy Pipeline

11 min

•

Published On: March 15, 2026

•

tl;dr

As the solo senior developer managing 15+ servers, I replaced a brittle GitLab-CLI setup with a modern, health-checked GitHub Actions architecture. Here is the 112-hour journey from 'black box' fear to zero-downtime confidence.

112 hours.

That's how long it took for me to understand the infrastucture of our devops systems and untangle the issues, then do a server migration.

For over a year I was managing 15+ servers as a solo senior developer and it was a game of survival. When you're responsible for both infrastructure and features, you quickly learn which parts of your stack are "set and forget" and which are "ticking time bombs."

Why Migrate? The Business Case

This wasn't a vanity project—it was a calculus:

Cost: The old server was a aging Hetzner VPS with expensive renewal rates and occasional systematic failures. New hosting provider offered better specs for less money. But more importantly, the old infrastructure required constant babysitting every update caused downtime, every issue Required late-night debugging.

The "One-Man Army" Burden: In traditional software lifecycles, boundaries were clear: Developers wrote code, QA assigned semantic version numbers (like v1.2.3), and System Admins guarded production. Devs rarely touched PROD.

But the modern reality of a solo Senior Full-Stack Developer is that you wear all these hats simultaneously. Managing 15+ servers alone means I don't have a QA team to track releases, and I don't have an Admin on-call to roll back a broken deploy. When a bug hits production, asking "what version is running?" usually leads to a frantic search through logs.

When you are the Dev, QA, and Admin all at once, manual processes are a fast track to burnout. I needed my pipeline to act as my strict gatekeeper. Instead of relying on manual versioning, the pipeline now automatically tags every build artifact with its unique Git SHA (e.g., dev-<sha>). There is no guessing—the deployment log tells me exactly which commit is live. The pipeline tracks the exact version, refuses to push broken containers, and either works seamlessly or fails visibly, so I can sleep at night.

Technical Debt Accumulation: The stack was running deprecated Node 20, aging services with known CVEs, and a Podman setup that required manual IP management. Each month, the "easy" choice was to ignore it. The hard choice was realizing ignoring technical debt is still a choice it is just has a compounding interest rate.

For a long time, our infrastructure was a "black box" to me. I inherited scripts I didn't write and pipelines that caused "hard downtime" during every update. When we hit the Gitlab upgrade crisis or certification issues or security patches for services like Mattermost and Taiga, I often found myself staring at logs, asking: "What do I do now?"

I spent too many nights troubleshooting why code worked locally but crashed in production. But this week, after a 112-hour migration, that fear is gone. I stopped using AI to "patch the holes" and started using it as a librarian to teach myself the architecture of the systems I manage.

Today, our stack is on GitHub Actions and a fresh VPS. Here is how I reclaimed my sanity and redefined what "Full-Stack" means for myself.

The GHCR Permission Trap
Zero-Downtime & Health Checks
Rewriting History
Architectural Spring Cleaning
The Podman to Docker Exodus
The 'Two-Hop' Smuggler
Trusting Docker DNS
Taming the Registry Bloat

The GHCR Permission Trap

The Problem: The server kept hitting 403 Forbidden errors when pulling private images from the GitHub Container Registry. I initially tried using a Personal Access Token (PAT) to let the server pull private Docker images, it threw errors because for Organization-scoped private packages, a personal PAT doesn't automatically grant access.

The Fix: I ditched long-lived PATs entirely and leveraged the natively scoped GITHUB_TOKEN. By explicitly adding packages: read and packages: write permissions to the GitHub Actions job, the deployment manages its own security without manual secret rotation.

.github/workflows/deploy.yml

permissions:
  contents: read
  packages: read

This is significantly more secure than hardcoded PATs as GitHub handles the token lifecycle automatically.

Zero-Downtime and the "Health Check" Loop

My old scripts ran a destructive workflow:

docker stop pcb-frontend-dev
docker rm pcb-frontend-dev
docker pull ghcr.io/.../pcb-frontend-dev:latest
docker run ...

If the registry was slow, the site stayed down. That was "hard downtime" which end up users seeing a broken page.

The "Pull-Before-Push" Pattern

I refactored this to a "Pull-Before-Push" flow:

# Pull the new image FIRST
docker compose pull
 
# Then bring up the new container (graceful swap)
docker compose up -d --remove-orphans

Docker Compose is smart enough to gracefully swap containers with millisecond downtime. No more manual docker stop before pulling.

The Health Check Loop

But I went one step further. A deployment isn't "done" when the container starts, it's done when the app is healthy. I added a polling loop that pings the /health endpoint for 30 seconds after deployment:

for i in {1..30}; do
  if curl -sf http://localhost:4000/health > /dev/null 2>&1; then
    echo "Health check passed"
    exit 0
  fi
  echo "Waiting for health... attempt $i/30"
  sleep 1
done
 
echo "Health check failed!"
docker logs pcb-frontend-dev  # Point me directly to the logs
exit 1

If it doesn't get a 200 OK, the build fails and alerts me immediately. No more "silent" production crashes where the container starts but the app is broken.

Rewriting History

During the migration, while I reviewed the code I caught a disturbing find: another developer's .env.local file slipped through .gitignore and AI code reviewer's oversight and into the Git history a few commits ago. Secrets that should have never left our local machines were sitting in the repository.

As the lead developer, I am responsible for overseeing what gets committed and even when trust an AI code reviewer to catch everything. The responsibility ultimately lands on me. This is the exact real life situation where an LLM can't be the responsible of human actions.

The Fix: I performed a force-push to purge those secrets from the repository entirely before making the GitHub move final.

I taught moving to a new CI/CD platform is a also a "Security Audit" opportunity. Cleaning up Git history ensures that our new "clean start" doesn't carry old security vulnerabilities with it. The migration forced me to confront what I'd been ignoring: critical security exposures.

Architectural Spring Cleaning

Migration is the best time for "Infrastructure Minimalism." The old server was cluttered with services that didn't belong in the main stack:

A legacy Handbook (Astro site) that hadn't been updated in months, and was not part of the tools our company provided
Old Proxy Server data that was never cleaned up and initially build up for testing

Instead of moving the mess, I decoupled them. I deleted the orphans and cleaned the Caddyfile.

Bumping the Runtime

We also took the opportunity to bump our runtime to Node 24 (from Node 20, which was deprecated). Technical debt isn't just about code—it's about the runtime. Never migrate a legacy app onto a legacy runtime. Moving to the latest LTS resets your "maintenance clock" for years to come.

From System Services to Containers

In the old setup, both Nginx and Caddy ran as system services outside of Podman:

Nginx served static files directly on the host
Caddy handled TLS termination and reverse proxying at the system level

This meant:

System updates would sometimes break the web server
Configuration required sudo access
No simple way to track versions or rollback
Restarting one service risked affecting others

In the new setup, both live inside the Docker ecosystem:

docker-compose.yml

services:
  caddy:
    image: caddy:2.7-alpine
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro
      - caddy_data:/data
    networks:
      - pcb-network
    ports:
      - "80:80"
      - "443:443"

Now:

Caddy and Nginx are just another container
Zero system-level configuration
Versions are pinned in the image tag
Restarting is docker compose restart caddy—no sudo needed
Network is isolated but reachable by all services

The entire web stack is now portable. I can move it to any server by copying the compose.yml file.

Securing MinIO

In the old server, MinIO ports and its dashboard were exposed publicly. Users' files lived in those buckets—this was a huge security concern waiting to happen.

The Fix: MinIO ports are no longer exposed to the world. Only the internal Docker network can reach it. If an admin needs to access the MinIO console for any reason, we use an SSH tunnel:

# Open tunnel, access console locally
ssh -L 9001:localhost:9001 deploy@vps
 
# Then open in browser
http://localhost:9001

No more public ports for sensitive storage. The principle is simple: if it's not needed publicly, don't expose it publicly.

The Podman to Docker Exodus

The old server was running Podman in rootless mode, and it was a constant source of pain. Here's what was breaking and how Docker solved it:

Problem 1: DNS Resolution Failures

Podman's rootless DNS struggled to resolve container names. I'd wake up to find services that worked yesterday failing today with cryptic connection errors.

How Docker fixed it: Standard Docker's bridge network has reliable native DNS. Containers resolve each other by name (postgres, redis, minio) without any manual intervention.

Problem 2: Hardcoded IPs Everywhere

Because Podman DNS was unreliable, the codebase was littered with hardcoded IP overrides:

# I was doing this monstrosity to inject a proxy IP:
sed -i 's/MINIO_ENDPOINT/minio.dannie.cc/g' .env

How Docker fixed it: Network aliases work reliably. I can now reference minio.dannie.cc in configs and trust it resolves correctly.

Problem 3: Inconsistent IP Assignments

Container IPs in Podman would shift after restarts. Our extra_hosts configuration needed constant maintenance:

# This was our "solution" to Podman's chaos:
extra_hosts:
  - "minio.dannie.cc:x.x.x.2"
  - "postgres.dannie.cc:x.x.x.8"
  - "redis.dannie.cc:x.x.x.4"

How Docker fixed it: With Docker's native DNS, I don't need extra_hosts at all. Containers resolve by name, and the network is stable across restarts.

Problem 4: Timeouts and Hangs

Under load, Podman containers would hang or timeout on inter container communication. The fix was always "restart the container" a band-aid, not a solution.

How Docker fixed it: Docker's networking stack is more mature. The bridge network handles traffic reliably, even under load.

The Lesson

Podman is a great concept; rootless containers, no daemon but it's not ready for production workloads that need reliability. When your infrastructure depends on containers being reachable, Docker's battle-tested networking is worth the daemon overhead. I would have arguments to use Podman in the first place if I had the ground and knowledge in the first place back in the day.

The "Two-Hop" Data Smuggler

I needed to move ~500 files of production data from MinIO buckets. I tried a direct server-to-server mc mirror, but the old server sat behind Cloudflare and it threw a 502 Bad Gateway because it blocked traffic coming from the new Hostinger VPS datacenter IP.

I had to become the bridge:

Hop 1: rsync from old server to my local machine (my IP is trusted by Cloudflare)
Hop 2: rsync from my machine to the new Hostinger VPS

# Step 1: Pull to local (trusted IP)
rsync -avz --progress deploy@old-server:/srv/minio/data/bucket/ ./temp-bucket/
 
# Step 2: Push to new server
rsync -avz --progress ./temp-bucket/ deploy@new-hostinger:/srv/minio/data/bucket/

It's a reminder that in DevOps, sometimes the most pragmatic solution is a manual one. Data migrations rarely go exactly as planned; network boundaries, CDNs, and WAFs will actively block server-to-server transfers. Always be prepared to use your local machine as a secure bridge.

Trusting Docker DNS

The Old Way: Hardcoded IPs

My legacy setup relied on brittle configurations:

extra_hosts:
  - "minio.dannie.cc:x.x.x.x"

I was injecting a proxy IP via a sed command because Podman's rootless DNS struggled to resolve container names. Hardcoded IPs and extra_hosts are code smells in containerized environments.

The New Way: Docker Bridge Network

By moving to standard Docker and putting all containers on a shared pcb-network bridge, I was able to utilize Docker's native DNS:

docker-compose.yml

networks:
  pcb-network:
    driver: bridge
    name: pcb-network

I gave my Caddy reverse proxy network aliases (e.g., minio.dannie.cc) so backend containers could route to each other using standard URLs, entirely bypassing hairpin NAT issues.

The lesson: Trust the Docker bridge network DNS. It makes your infrastructure instantly portable and resilient to container restarts. I can move this entire stack to any server in minutes now.

Taming the Registry Bloat

Every build gets tagged with a unique Git SHA (dev-<sha>), which means my GHCR storage was going to balloon rapidly. Without intervention, I'd risk hitting GitHub's storage quotas or getting surprise billing.

The Fix: Automated Garbage Collection

I implemented actions/delete-package-versions@v5 at the end of the workflow to automatically prune old dev-tagged images, keeping only the 5 most recent:

.github/workflows/cleanup.yml

- name: Delete old package versions
  uses: actions/delete-package-versions@v5
  with:
    package-type: container
    package-name: pcb-frontend-dev
    delete-version-pattern: "dev-*"
    keep-nversions: 5

The lesson: Always implement garbage collection in your CI/CD pipelines. Automating registry cleanup saves you from "Quota Exceeded" failed builds down the line.

Conclusion: Scale Requires Clean Foundations

Being a solo senior dev means your time is the company's most valuable resource. Spending that time fixing brittle scripts is a waste. By investing 112 hours into this migration, I've gained back countless hours of future productivity.

The pipeline now:

Self-heals with health checks that fail fast
Secures itself with native GitHub tokens (no manual rotation)
Scales cleanly with Docker DNS and registry garbage collection
Migrates responsibly with the "two-hop" pattern for data
Runs reliably on Docker's mature networking stack (goodbye Podman DNS headaches)

This one GitHub Actions workflow now serves as the automated blueprint for the 10 servers. That 112 hour investment doesn't just fix one server but it scales across the entire company.

What's the one piece of "legacy magic" in your stack that you're currently too afraid to touch? Let's talk about the cost of fixing it vs. the cost of keeping it.

Technical Debt is a Choice: How I Finally Tore Down Our Legacy Pipeline

devops github-actions docker vps +2

11 min

•

Published On: March 15, 2026

•

tl;dr

112 hours.

That's how long it took for me to understand the infrastucture of our devops systems and untangle the issues, then do a server migration.

Why Migrate? The Business Case

This wasn't a vanity project—it was a calculus:

Today, our stack is on GitHub Actions and a fresh VPS. Here is how I reclaimed my sanity and redefined what "Full-Stack" means for myself.

The GHCR Permission Trap
Zero-Downtime & Health Checks
Rewriting History
Architectural Spring Cleaning
The Podman to Docker Exodus
The 'Two-Hop' Smuggler
Trusting Docker DNS
Taming the Registry Bloat

The GHCR Permission Trap

.github/workflows/deploy.yml

permissions:
  contents: read
  packages: read

This is significantly more secure than hardcoded PATs as GitHub handles the token lifecycle automatically.

Zero-Downtime and the "Health Check" Loop

My old scripts ran a destructive workflow:

docker stop pcb-frontend-dev
docker rm pcb-frontend-dev
docker pull ghcr.io/.../pcb-frontend-dev:latest
docker run ...

If the registry was slow, the site stayed down. That was "hard downtime" which end up users seeing a broken page.

The "Pull-Before-Push" Pattern

I refactored this to a "Pull-Before-Push" flow:

# Pull the new image FIRST
docker compose pull
 
# Then bring up the new container (graceful swap)
docker compose up -d --remove-orphans

Docker Compose is smart enough to gracefully swap containers with millisecond downtime. No more manual docker stop before pulling.

The Health Check Loop

for i in {1..30}; do
  if curl -sf http://localhost:4000/health > /dev/null 2>&1; then
    echo "Health check passed"
    exit 0
  fi
  echo "Waiting for health... attempt $i/30"
  sleep 1
done
 
echo "Health check failed!"
docker logs pcb-frontend-dev  # Point me directly to the logs
exit 1

If it doesn't get a 200 OK, the build fails and alerts me immediately. No more "silent" production crashes where the container starts but the app is broken.

Rewriting History

The Fix: I performed a force-push to purge those secrets from the repository entirely before making the GitHub move final.

Architectural Spring Cleaning

Migration is the best time for "Infrastructure Minimalism." The old server was cluttered with services that didn't belong in the main stack:

A legacy Handbook (Astro site) that hadn't been updated in months, and was not part of the tools our company provided
Old Proxy Server data that was never cleaned up and initially build up for testing

Instead of moving the mess, I decoupled them. I deleted the orphans and cleaned the Caddyfile.

Bumping the Runtime

From System Services to Containers

In the old setup, both Nginx and Caddy ran as system services outside of Podman:

Nginx served static files directly on the host
Caddy handled TLS termination and reverse proxying at the system level

This meant:

System updates would sometimes break the web server
Configuration required sudo access
No simple way to track versions or rollback
Restarting one service risked affecting others

In the new setup, both live inside the Docker ecosystem:

docker-compose.yml

services:
  caddy:
    image: caddy:2.7-alpine
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro
      - caddy_data:/data
    networks:
      - pcb-network
    ports:
      - "80:80"
      - "443:443"

Now:

Caddy and Nginx are just another container
Zero system-level configuration
Versions are pinned in the image tag
Restarting is docker compose restart caddy—no sudo needed
Network is isolated but reachable by all services

The entire web stack is now portable. I can move it to any server by copying the compose.yml file.

Securing MinIO

In the old server, MinIO ports and its dashboard were exposed publicly. Users' files lived in those buckets—this was a huge security concern waiting to happen.

The Fix: MinIO ports are no longer exposed to the world. Only the internal Docker network can reach it. If an admin needs to access the MinIO console for any reason, we use an SSH tunnel:

# Open tunnel, access console locally
ssh -L 9001:localhost:9001 deploy@vps
 
# Then open in browser
http://localhost:9001

No more public ports for sensitive storage. The principle is simple: if it's not needed publicly, don't expose it publicly.

The Podman to Docker Exodus

The old server was running Podman in rootless mode, and it was a constant source of pain. Here's what was breaking and how Docker solved it:

Problem 1: DNS Resolution Failures

Podman's rootless DNS struggled to resolve container names. I'd wake up to find services that worked yesterday failing today with cryptic connection errors.

How Docker fixed it: Standard Docker's bridge network has reliable native DNS. Containers resolve each other by name (postgres, redis, minio) without any manual intervention.

Problem 2: Hardcoded IPs Everywhere

Because Podman DNS was unreliable, the codebase was littered with hardcoded IP overrides:

# I was doing this monstrosity to inject a proxy IP:
sed -i 's/MINIO_ENDPOINT/minio.dannie.cc/g' .env

How Docker fixed it: Network aliases work reliably. I can now reference minio.dannie.cc in configs and trust it resolves correctly.

Problem 3: Inconsistent IP Assignments

Container IPs in Podman would shift after restarts. Our extra_hosts configuration needed constant maintenance:

# This was our "solution" to Podman's chaos:
extra_hosts:
  - "minio.dannie.cc:x.x.x.2"
  - "postgres.dannie.cc:x.x.x.8"
  - "redis.dannie.cc:x.x.x.4"

How Docker fixed it: With Docker's native DNS, I don't need extra_hosts at all. Containers resolve by name, and the network is stable across restarts.

Problem 4: Timeouts and Hangs

Under load, Podman containers would hang or timeout on inter container communication. The fix was always "restart the container" a band-aid, not a solution.

How Docker fixed it: Docker's networking stack is more mature. The bridge network handles traffic reliably, even under load.

The Lesson

The "Two-Hop" Data Smuggler

I had to become the bridge:

Hop 1: rsync from old server to my local machine (my IP is trusted by Cloudflare)
Hop 2: rsync from my machine to the new Hostinger VPS

# Step 1: Pull to local (trusted IP)
rsync -avz --progress deploy@old-server:/srv/minio/data/bucket/ ./temp-bucket/
 
# Step 2: Push to new server
rsync -avz --progress ./temp-bucket/ deploy@new-hostinger:/srv/minio/data/bucket/

Trusting Docker DNS

The Old Way: Hardcoded IPs

My legacy setup relied on brittle configurations:

extra_hosts:
  - "minio.dannie.cc:x.x.x.x"

I was injecting a proxy IP via a sed command because Podman's rootless DNS struggled to resolve container names. Hardcoded IPs and extra_hosts are code smells in containerized environments.

The New Way: Docker Bridge Network

By moving to standard Docker and putting all containers on a shared pcb-network bridge, I was able to utilize Docker's native DNS:

docker-compose.yml

networks:
  pcb-network:
    driver: bridge
    name: pcb-network

I gave my Caddy reverse proxy network aliases (e.g., minio.dannie.cc) so backend containers could route to each other using standard URLs, entirely bypassing hairpin NAT issues.

The lesson: Trust the Docker bridge network DNS. It makes your infrastructure instantly portable and resilient to container restarts. I can move this entire stack to any server in minutes now.

Taming the Registry Bloat

The Fix: Automated Garbage Collection

I implemented actions/delete-package-versions@v5 at the end of the workflow to automatically prune old dev-tagged images, keeping only the 5 most recent:

.github/workflows/cleanup.yml

- name: Delete old package versions
  uses: actions/delete-package-versions@v5
  with:
    package-type: container
    package-name: pcb-frontend-dev
    delete-version-pattern: "dev-*"
    keep-nversions: 5

The lesson: Always implement garbage collection in your CI/CD pipelines. Automating registry cleanup saves you from "Quota Exceeded" failed builds down the line.

Conclusion: Scale Requires Clean Foundations

The pipeline now:

Self-heals with health checks that fail fast
Secures itself with native GitHub tokens (no manual rotation)
Scales cleanly with Docker DNS and registry garbage collection
Migrates responsibly with the "two-hop" pattern for data
Runs reliably on Docker's mature networking stack (goodbye Podman DNS headaches)

This one GitHub Actions workflow now serves as the automated blueprint for the 10 servers. That 112 hour investment doesn't just fix one server but it scales across the entire company.

What's the one piece of "legacy magic" in your stack that you're currently too afraid to touch? Let's talk about the cost of fixing it vs. the cost of keeping it.