Problem
Most LLM usage today happens through APIs, which abstracts away how models are actually run and served.
I wanted to understand how LLMs work at a system level — how they are deployed, exposed, and integrated into applications.
The goal was:
Run an LLM locally and expose it as a simple, reusable API, similar to any backend service.
Architecture
User / Client
↓
FastAPI Server
↓
LLM Runtime (Ollama / Hugging Face)
↓
Response
Optional (future):
User → API → Embeddings → FAISS → Context → LLM → Response
This setup focuses on understanding how models are served and how an API layer interacts with them.
Tech Stack
- FastAPI
- Ollama / Hugging Face Transformers
- Docker
- Python
- FAISS (planned)
Key Decisions
Local model over API-based model To gain full control, avoid cost, and understand how models are actually served.
FastAPI for serving Lightweight and simple for exposing model inference as an API.
Docker for portability Ensures the setup is reproducible across environments.
Start simple Focus on core prompt-response flow before adding embeddings or retrieval systems.
Challenges
- Running LLMs efficiently on local hardware
- Managing model size and performance trade-offs
- Understanding model loading vs inference lifecycle
- Handling latency in responses
- Designing a clean and usable API interface
Result
A working foundation for running and exposing LLMs locally through a backend service.
The project is evolving into a system for experimenting with model serving, API design, and AI infrastructure patterns.
Future Work
- Add embeddings and vector search (FAISS)
- Implement context-aware responses (RAG)
- Optimize performance and latency
- Extend API capabilities