Technical Debt is a Choice: How I Finally Tore Down Our Legacy Pipeline
As the solo senior developer managing 15+ servers, I replaced a brittle GitLab-CLI setup with a modern, health-checked GitHub Actions architecture. Here is the 112-hour journey from 'black box' fear to zero-downtime confidence.
112 hours.
That's how long it took for me to understand the infrastucture of our devops systems and untangle the issues, then do a server migration.
For over a year I was managing 15+ servers as a solo senior developer and it was a game of survival. When you're responsible for both infrastructure and features, you quickly learn which parts of your stack are "set and forget" and which are "ticking time bombs."
Why Migrate? The Business Case
This wasn't a vanity project—it was a calculus:
Cost: The old server was a aging Hetzner VPS with expensive renewal rates and occasional systematic failures. New hosting provider offered better specs for less money. But more importantly, the old infrastructure required constant babysitting every update caused downtime, every issue Required late-night debugging.
Low Manpower: I'm a single person. I can't be on-call 24/7 for a brittle system. The goal was simple: build a pipeline that either works *or fails visibly, so I can sleep at night.
Technical Debt Accumulation: The stack was running deprecated Node 20, aging services with known CVEs, and a Podman setup that required manual IP management. Each month, the "easy" choice was to ignore it. The hard choice was realizing ignoring technical debt is still a choice it is just has a compounding interest rate.
For a long time, our infrastructure was a "black box" to me. I inherited scripts I didn't write and pipelines that caused "hard downtime" during every update. When we hit the Gitlab upgrade crisis or certification issues or security patches for services like Mattermost and Taiga, I often found myself staring at logs, asking: "What do I do now?"
I spent too many nights troubleshooting why code worked locally but crashed in production. But this week, after a 112-hour migration, that fear is gone. I stopped using AI to "patch the holes" and started using it as a librarian to teach myself the architecture of the systems I manage.
Today, our stack is on GitHub Actions and a fresh VPS. Here is how I reclaimed my sanity and redefined what "Full-Stack" means for myself.
- The GHCR Permission Trap
Zero-Downtime & Health Checks
- Rewriting History
- Architectural Spring Cleaning
- The Podman to Docker Exodus
- The 'Two-Hop' Smuggler
- Trusting Docker DNS
- Taming the Registry Bloat
The GHCR Permission Trap
The Problem: The server kept hitting 403 Forbidden errors when pulling private images from the GitHub Container Registry. I initially tried using a Personal Access Token (PAT) to let the server pull private Docker images, it threw errors because for Organization-scoped private packages, a personal PAT doesn't automatically grant access.
The Fix: I ditched long-lived PATs entirely and leveraged the natively scoped GITHUB_TOKEN. By explicitly adding packages: read and packages: write permissions to the GitHub Actions job, the deployment manages its own security without manual secret rotation.
permissions:
contents: read
packages: readThis is significantly more secure than hardcoded PATs as GitHub handles the token lifecycle automatically.
Zero-Downtime and the "Health Check" Loop
My old scripts ran a destructive workflow:
docker stop pcb-frontend-dev
docker rm pcb-frontend-dev
docker pull ghcr.io/.../pcb-frontend-dev:latest
docker run ...If the registry was slow, the site stayed down. That was "hard downtime" which end up users seeing a broken page.
The "Pull-Before-Push" Pattern
I refactored this to a "Pull-Before-Push" flow:
# Pull the new image FIRST
docker compose pull
# Then bring up the new container (graceful swap)
docker compose up -d --remove-orphansDocker Compose is smart enough to gracefully swap containers with millisecond downtime. No more manual docker stop before pulling.
The Health Check Loop
But I went one step further. A deployment isn't "done" when the container starts, it's done when the app is healthy. I added a polling loop that pings the /health endpoint for 30 seconds after deployment:
for i in {1..30}; do
if curl -sf http://localhost:4000/health > /dev/null 2>&1; then
echo "Health check passed"
exit 0
fi
echo "Waiting for health... attempt $i/30"
sleep 1
done
echo "Health check failed!"
docker logs pcb-frontend-dev # Point me directly to the logs
exit 1If it doesn't get a 200 OK, the build fails and alerts me immediately. No more "silent" production crashes where the container starts but the app is broken.
Rewriting History
During the migration, while I reviewed the code I caught a disturbing find: another developer's .env.local file slipped through .gitignore and AI code reviewer's oversight and into the Git history a few commits ago. Secrets that should have never left our local machines were sitting in the repository.
As the lead developer, I am responsible for overseeing what gets committed and even when trust an AI code reviewer to catch everything. The responsibility ultimately lands on me. This is the exact real life situation where an LLM can't be the responsible of human actions.
The Fix: I performed a force-push to purge those secrets from the repository entirely before making the GitHub move final.
I taught moving to a new CI/CD platform is a also a "Security Audit" opportunity. Cleaning up Git history ensures that our new "clean start" doesn't carry old security vulnerabilities with it. The migration forced me to confront what I'd been ignoring: critical security exposures.
Architectural Spring Cleaning
Migration is the best time for "Infrastructure Minimalism." The old server was cluttered with services that didn't belong in the main stack:
- A legacy Handbook (Astro site) that hadn't been updated in months, and was not part of the tools our company provided
- Old Proxy Server data that was never cleaned up and initially build up for testing
Instead of moving the mess, I decoupled them. I deleted the orphans and cleaned the Caddyfile.
Bumping the Runtime
We also took the opportunity to bump our runtime to Node 24 (from Node 20, which was deprecated). Technical debt isn't just about code—it's about the runtime. Never migrate a legacy app onto a legacy runtime. Moving to the latest LTS resets your "maintenance clock" for years to come.
From System Services to Containers
In the old setup, both Nginx and Caddy ran as system services outside of Podman:
- Nginx served static files directly on the host
- Caddy handled TLS termination and reverse proxying at the system level
This meant:
- System updates would sometimes break the web server
- Configuration required
sudoaccess - No simple way to track versions or rollback
- Restarting one service risked affecting others
In the new setup, both live inside the Docker ecosystem:
services:
caddy:
image: caddy:2.7-alpine
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile:ro
- caddy_data:/data
networks:
- pcb-network
ports:
- "80:80"
- "443:443"Now:
- Caddy and Nginx are just another container
- Zero system-level configuration
- Versions are pinned in the image tag
- Restarting is
docker compose restart caddy—no sudo needed - Network is isolated but reachable by all services
The entire web stack is now portable. I can move it to any server by copying the compose.yml file.
Securing MinIO
In the old server, MinIO ports and its dashboard were exposed publicly. Users' files lived in those buckets—this was a huge security concern waiting to happen.
The Fix: MinIO ports are no longer exposed to the world. Only the internal Docker network can reach it. If an admin needs to access the MinIO console for any reason, we use an SSH tunnel:
# Open tunnel, access console locally
ssh -L 9001:localhost:9001 deploy@vps
# Then open in browser
http://localhost:9001No more public ports for sensitive storage. The principle is simple: if it's not needed publicly, don't expose it publicly.
The Podman to Docker Exodus
The old server was running Podman in rootless mode, and it was a constant source of pain. Here's what was breaking and how Docker solved it:
Problem 1: DNS Resolution Failures
Podman's rootless DNS struggled to resolve container names. I'd wake up to find services that worked yesterday failing today with cryptic connection errors.
How Docker fixed it: Standard Docker's bridge network has reliable native DNS. Containers resolve each other by name (postgres, redis, minio) without any manual intervention.
Problem 2: Hardcoded IPs Everywhere
Because Podman DNS was unreliable, the codebase was littered with hardcoded IP overrides:
# I was doing this monstrosity to inject a proxy IP:
sed -i 's/MINIO_ENDPOINT/minio.dannie.cc/g' .envHow Docker fixed it: Network aliases work reliably. I can now reference minio.dannie.cc in configs and trust it resolves correctly.
Problem 3: Inconsistent IP Assignments
Container IPs in Podman would shift after restarts. Our extra_hosts configuration needed constant maintenance:
# This was our "solution" to Podman's chaos:
extra_hosts:
- "minio.dannie.cc:x.x.x.2"
- "postgres.dannie.cc:x.x.x.8"
- "redis.dannie.cc:x.x.x.4"How Docker fixed it: With Docker's native DNS, I don't need extra_hosts at all. Containers resolve by name, and the network is stable across restarts.
Problem 4: Timeouts and Hangs
Under load, Podman containers would hang or timeout on inter container communication. The fix was always "restart the container" a band-aid, not a solution.
How Docker fixed it: Docker's networking stack is more mature. The bridge network handles traffic reliably, even under load.
The Lesson
Podman is a great concept; rootless containers, no daemon but it's not ready for production workloads that need reliability. When your infrastructure depends on containers being reachable, Docker's battle-tested networking is worth the daemon overhead. I would have arguments to use Podman in the first place if I had the ground and knowledge in the first place back in the day.
The "Two-Hop" Data Smuggler
I needed to move ~500 files of production data from MinIO buckets. I tried a direct server-to-server mc mirror, but the old server sat behind Cloudflare and it threw a 502 Bad Gateway because it blocked traffic coming from the new Hostinger VPS datacenter IP.
I had to become the bridge:
- Hop 1:
rsyncfrom old server to my local machine (my IP is trusted by Cloudflare) - Hop 2:
rsyncfrom my machine to the new Hostinger VPS
# Step 1: Pull to local (trusted IP)
rsync -avz --progress deploy@old-server:/srv/minio/data/bucket/ ./temp-bucket/
# Step 2: Push to new server
rsync -avz --progress ./temp-bucket/ deploy@new-hostinger:/srv/minio/data/bucket/It's a reminder that in DevOps, sometimes the most pragmatic solution is a manual one. Data migrations rarely go exactly as planned; network boundaries, CDNs, and WAFs will actively block server-to-server transfers. Always be prepared to use your local machine as a secure bridge.
Trusting Docker DNS
The Old Way: Hardcoded IPs
My legacy setup relied on brittle configurations:
extra_hosts:
- "minio.dannie.cc:x.x.x.x"I was injecting a proxy IP via a sed command because Podman's rootless DNS struggled to resolve container names. Hardcoded IPs and extra_hosts are code smells in containerized environments.
The New Way: Docker Bridge Network
By moving to standard Docker and putting all containers on a shared pcb-network bridge, I was able to utilize Docker's native DNS:
networks:
pcb-network:
driver: bridge
name: pcb-networkI gave my Caddy reverse proxy network aliases (e.g., minio.dannie.cc) so backend containers could route to each other using standard URLs, entirely bypassing hairpin NAT issues.
The lesson: Trust the Docker bridge network DNS. It makes your infrastructure instantly portable and resilient to container restarts. I can move this entire stack to any server in minutes now.
Taming the Registry Bloat
Every build gets tagged with a unique Git SHA (dev-<sha>), which means my GHCR storage was going to balloon rapidly. Without intervention, I'd risk hitting GitHub's storage quotas or getting surprise billing.
The Fix: Automated Garbage Collection
I implemented actions/delete-package-versions@v5 at the end of the workflow to automatically prune old dev-tagged images, keeping only the 5 most recent:
- name: Delete old package versions
uses: actions/delete-package-versions@v5
with:
package-type: container
package-name: pcb-frontend-dev
delete-version-pattern: "dev-*"
keep-nversions: 5The lesson: Always implement garbage collection in your CI/CD pipelines. Automating registry cleanup saves you from "Quota Exceeded" failed builds down the line.
Conclusion: Scale Requires Clean Foundations
Being a solo senior dev means your time is the company's most valuable resource. Spending that time fixing brittle scripts is a waste. By investing 112 hours into this migration, I've gained back countless hours of future productivity.
The pipeline now:
- Self-heals with health checks that fail fast
- Secures itself with native GitHub tokens (no manual rotation)
- Scales cleanly with Docker DNS and registry garbage collection
- Migrates responsibly with the "two-hop" pattern for data
- Runs reliably on Docker's mature networking stack (goodbye Podman DNS headaches)
This one GitHub Actions workflow now serves as the automated blueprint for the 10 servers. That 112 hour investment doesn't just fix one server but it scales across the entire company.
What's the one piece of "legacy magic" in your stack that you're currently too afraid to touch? Let's talk about the cost of fixing it vs. the cost of keeping it.