What I Learned Deploying a Multi-Service RAG App as a Solo Dev

I built Cognoir, a no-code RAG platform for people who aren’t tech-savvy, by myself — while tutoring, writing for a living, and looking after a toddler at home. The Phase 1 MVP was planned to ship in eight days, with Claude Code as a force multiplier. It shipped. It works. And it very nearly killed me around day five, when Railway decided my three services didn’t need to agree on where Redis actually lived.

This isn’t a success story in the usual sense. It’s a postmortem. There’s a difference.

The stack

Here’s what I was running. This matters, because most of what broke was specific to this combination of things.

Layer	MVP (Phase 1)	Current (Phase 2)
Backend	FastAPI + SQLAlchemy	← same
Frontend	Next.js 14 (Vercel)	← same
Database	Neon Postgres + pgvector	← same (vector col: 384 → 1024)
Storage	Cloudflare R2	← same
Queue + broker	Celery + Redis	← same
Embeddings	HuggingFace all-MiniLM-L6-v2	Voyage 4 Large (docs) + Voyage 4 Lite (queries)
Reranking	—	Voyage rerank-2.5
LLM	Groq / Llama 3.3 70b	Claude Sonnet (standard) + Claude Opus (deep analysis)
Hosting	Railway (API + Celery worker)	← same

The MVP ran on Groq and local HuggingFace embeddings. Both free, both fast to build with, both good enough to ship. The plan from the start was to swap them out once the system was stable and retrieval quality actually mattered. That swap happened in Phase 2. But first — the things that broke.

1. The Railway trap

Railway is great for solo developers. Fast deploys, generous free tier, good GitHub integration. But it has one quirk that got me badly: services in different Railway projects can’t talk to each other over the internal network.

I set up Redis as its own project and the API as another project. In Railway’s mental model, those are two separate things — because they are. The internal redis:// URL I was handing to my API wasn’t reachable from a different project at all. Everything worked locally, because Docker handles cross-container networking without you having to think about it, so I had no idea until production was live and Celery was silently failing to connect.

What I did wrong: Made Redis a separate Railway project, then tried to use its internal URL from the API service in another project. The internal network only exists inside a single project. To talk across projects you’d need the public URL, which defeats the whole point.

The fix was to delete the separate Redis project and add Redis as a service inside the API project. Three services, one project: API, Celery worker, Redis. Once they were in the same place, the internal URL worked and Celery connected.

Lesson: In Railway, everything that needs to talk internally has to live in the same project. Think of a project like a Docker Compose network — same namespace, shared environment. Your API, worker, and cache should be in the same house, not different buildings.

2. Celery is not a web server

This sounds obvious. It wasn’t, at 1am, staring at Railway’s Add Service screen wondering why the Celery worker kept crashing every time I deployed.

I tried to run uvicorn and the Celery worker from the same Railway service, by shoving Celery in as a pre-deploy command. Pre-deploy steps run before the main process starts. They’re for migrations — alembic upgrade head, that sort of thing. They are not for processes that are supposed to keep running.

Celery needs its own service, with its own start command:

celery -A app.workers.celery_app worker --loglevel=info

No public domain. No HTTP. It reads from the Redis queue, processes documents, writes to the database. That’s the whole job.

Lesson: Set up your Railway services the same way you’d set up a Docker Compose file. One process per service. The API handles HTTP. The worker handles the queue. They share environment variables — use Railway’s shared variables so you set them once at the project level instead of three times by hand. This isn’t a nice-to-have. It’s the difference between “I updated the database URL” and “I updated it in two of the three places and spent an hour figuring out which one I missed.”

3. The HuggingFace embedding problem — and why it was always going to be temporary

I used all-MiniLM-L6-v2 as the MVP embedding model. Runs locally, no API calls, no cost, 384 dimensions, works fine with pgvector. The problem: HuggingFace downloads the model the first time it’s used. Which means the first request to your production API takes 30 to 40 seconds while the container pulls it down.

In development you never see this. Docker caches it. You forget it’s even happening. In production, the first real user gets a spinner and then a timeout. That’s not a soft launch. That’s a soft disaster.

What I did wrong: Assumed what happens in local Docker would happen in production. Container restarts don’t keep the model cached unless you bake it into the image at build time.

The fix is to download the model into the image while it’s being built, not when it first runs:

RUN python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

The image gets about 90MB heavier. Cold starts are fast. For a platform where embedding is on the critical path for every document upload, there’s no other way to do it.

That said — local HuggingFace embeddings were always a development decision, not a production one. The RAG quality was genuinely mediocre. Retrieval was imprecise. Answers were often technically in the right neighbourhood, but not really grounded in the right chunks. There was no reranking to fix bad retrieval either. After Phase 1 shipped, switching to Voyage AI made the quality gap impossible to miss. More on that later.

Lesson: Bake any heavy asset your app needs at runtime into the Docker image. But also — don’t confuse “good enough to ship an MVP” with “good enough to keep.” The embedding model is basically your retrieval engine. It deserves real attention once the infrastructure is stable.

4. SSE streaming is harder than it looks

The RAG chat response streams via Server-Sent Events. The answer generates token by token, citations appear inline, the user can see something is happening. In local development, with uvicorn and the Next.js dev server running on the same machine, this was fine.

In production — frontend on Vercel, backend on Railway behind a reverse proxy — the answer would start loading, the browser network tab would show some vague error, and the spinner would run forever.

Two things were wrong at the same time. That’s the worst kind of debugging.

First: ALLOWED_ORIGINS was set to http://localhost:3000. The production Vercel URL wasn’t in it. CORS was rejecting every preflight from the production frontend. The error you get from this — a spinning loader, nothing useful in the console — is not particularly helpful.

Second: FastAPI’s SSE implementation and Next.js’s fetch have different opinions about headers, and those opinions get even more different when there’s a proxy in the middle. The fix was explicitly setting these headers on the SSE response:

headers = {
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache",
    "Connection": "keep-alive",
    "X-Accel-Buffering": "no",  # tells Railway's proxy to stop buffering
}

Lesson: Test SSE through the actual production setup before calling it done. Testing locally with two processes on the same machine hides a whole category of proxy and header problems that only show up in production. A real integration test that fires a real HTTP request with real CORS headers tells you more than any unit test on the streaming endpoint.

5. CORS will always bite you last

I’ve already mentioned CORS once. I’m mentioning it again because it showed up in at least three different forms over the eight-day sprint, and each time it looked a bit different.

The actual problem is that ALLOWED_ORIGINS has to be updated in two places — your local .env and Railway’s shared variables — every time a frontend URL changes. Vercel preview deployments have their own URLs. Your production URL is different from your preview URL. If you’re testing from localhost and a preview deploy at the same time, you need both in the list.

In development, ALLOWED_ORIGINS=* is fine. In production, lock it down to your actual frontend domains. Keep a list of every URL that belongs there. Vercel gives you a stable production URL and a stable preview subdomain. Add both before you deploy.

The more useful lesson is what CORS errors actually look like. The browser message is almost never the real diagnosis. Open the network tab, find the failed OPTIONS preflight, check what header the server sent back, and work backwards from there. The problem is always on the server — a missing header, an origin that’s not in the list, or middleware that’s not running on OPTIONS requests.

6. The provider switch — and why the abstraction layer was worth building

After Phase 1 shipped, Groq/Llama and local HuggingFace embeddings had done what they were supposed to do: zero cost to iterate, fast feedback, a working system in production. But the retrieval quality was not good. Answers were often technically in the right neighbourhood, but not actually grounded — right document in the corpus, wrong chunks coming up, citations that were sitting next to the answer instead of being the answer.

Testing with real documents made the decision easy.

What changed:

LLM: Groq / Llama 3.3 → Claude Sonnet for regular queries + Claude Opus for deep analysis mode
Embeddings: HuggingFace all-MiniLM-L6-v2 (384d) → Voyage 4 Large for documents, Voyage 4 Lite for queries (1024d, asymmetric)
Reranking: nothing → Voyage rerank-2.5 after retrieval

The asymmetric embedding strategy is worth explaining briefly. Using a different model for documents versus queries isn’t just about cost. It’s about what each model is actually good at. Voyage 4 Large is built for accuracy at indexing time, when you’re not in a hurry. Voyage 4 Lite is built for speed at query time, when someone is waiting. The vectors are compatible across the same model family, so retrieval still works correctly.

Reranking was the single biggest quality improvement. Vector similarity search retrieves the top chunks that are semantically close to the query. Reranking then reorders those chunks using a cross-encoder model — much more expensive, only practical on a small candidate set of maybe 20 chunks, not the whole corpus. The difference in which chunk ends up at rank 1 is real. Answers got more precise. Citations got more accurate.

How the abstraction layer made all of this boring to execute:

The whole switch touched two files — core/embeddings.py and core/llm.py — and a set of environment variable updates. Nothing in the RAG pipeline, the ingestion worker, the chat endpoint, or the retrieval service had to change. Every layer called get_embeddings(mode="document") or get_llm() and got back whatever the config said. That was it.

New environment variables:

ANTHROPIC_API_KEY=sk-ant-...
LLM_PROVIDER=anthropic
LLM_MODEL=claude-sonnet-4-20250514
LLM_MODEL_DEEP=claude-opus-4-20250514

VOYAGE_API_KEY=pa-...
EMBEDDING_PROVIDER=voyageai
EMBEDDING_MODEL_DOCUMENT=voyage-4-large
EMBEDDING_MODEL_QUERY=voyage-4-lite
EMBEDDING_DIMENSION=1024

The one thing you can’t abstract away is a dimension change. Going from 384d to 1024d meant running an ALTER COLUMN on Neon and re-embedding every document. That’s a one-time batch job. But it’s also a production migration, and it needs to be treated like one. Every existing embedding became invalid the moment the column changed. The key is knowing that’s coming and writing the re-processing job before you run the migration, not after.

Lesson: Build the abstraction layer before you need it. It feels like unnecessary architecture on day one. On the day you switch providers — and you will, because the RAG ecosystem is still moving fast — you’ll be glad it’s there.

What I’d do differently

Infrastructure before code. I wrote application code on day one and spent day five figuring out why my services couldn’t find each other. The Railway setup, shared variables, and service layout should all be confirmed before any application code gets written. Deploy a hello world. Make sure everything can reach everything else. Then start building.

Make local match production. My docker-compose.yml should have matched production exactly — same service names, same networking assumptions, same env var structure. If it runs in Docker Compose with those constraints, it runs on Railway. The gap between “works locally” and “works in prod” is almost always an infrastructure gap, not a code gap.

Treat environment variables like infrastructure. Every env var should be documented and kept in sync between local and production as a deliberate step, not something you sort out when something breaks. A complete .env.example that mirrors the full production variable set would have saved me at least two debugging sessions.

Build the provider abstraction layer first. The Phase 2 provider switch — Groq → Claude Sonnet/Opus, HuggingFace → Voyage 4, adding rerank-2.5 — touched two files and some env vars. Nothing else. That is only possible because the abstraction was built into the MVP from the start. Do this upfront. It will feel like extra work. It isn’t.

On building alone

There’s a specific kind of tiredness that comes from being the only person who knows why something is broken. No one to think out loud with at 2am who already knows the codebase. Every problem is yours, from the moment you find it to the moment it’s fixed and tested.

Claude Code helped with this — not as a code generator, but as something that holds context. The CLAUDE.md in the project root meant I could start a session with “we’re working on the Celery pipeline today” and it already knew the architecture, the constraints, the decisions that were already made. That’s genuinely useful when you’re building in three-hour windows between other commitments.

But there are limits worth being honest about. Nothing replaces sitting with a broken system and actually thinking. The Railway networking issue wasn’t fixed by a prompt. It was fixed by drawing the service layout on paper and asking: “wait, can service A in project X actually reach service B in project Y?” Obvious once it was written down. Invisible until then.

Phase 1 is shipped. Document processing works. Citations are generated and clickable. The anonymous trial works without signing up. These are real things that exist in production.

Phase 2 is underway. The infrastructure lessons from Phase 1 are already in the CLAUDE.md.

The toddler still wakes up at 6am. Some things can’t be optimized.

Cognoir is a horizontal RAG platform for non-technical users, built by Terran Coders. Phase 1 MVP covers document upload (PDF, DOCX), semantic search via pgvector, and cited answer generation. This post covers the Phase 1 deployment sprint and Phase 2 provider upgrade.