llm-in-a-box - Yash Thakur

Problem

Most LLM usage today happens through APIs, which abstracts away how models are actually run and served.

I wanted to understand how LLMs work at a system level — how they are deployed, exposed, and integrated into applications.

The goal was:

Run an LLM locally and expose it as a simple, reusable API, similar to any backend service.

User / Client
      ↓
FastAPI Server
      ↓
LLM Runtime (Ollama / Hugging Face)
      ↓
Response

Optional (future):

User → API → Embeddings → FAISS → Context → LLM → Response

This setup focuses on understanding how models are served and how an API layer interacts with them.

Local model over API-based model To gain full control, avoid cost, and understand how models are actually served.

FastAPI for serving Lightweight and simple for exposing model inference as an API.

Docker for portability Ensures the setup is reproducible across environments.

Start simple Focus on core prompt-response flow before adding embeddings or retrieval systems.

A working foundation for running and exposing LLMs locally through a backend service.

The project is evolving into a system for experimenting with model serving, API design, and AI infrastructure patterns.