../RAG system

22 August 2025

Chroma DB is a RAG-based chat API using FastAPI, Chroma, and LLaMA 3 via Ollama. It loads, splits, embeds data, and answers queries with contextual LLM output.

What is RAG?

RAG (Retrieval-Augmented Generation) is a technique that enhances the capabilities of language models by combining information retrieval with text generation. Instead of relying solely on what the model has memorized during training, RAG allows the model to access external sources of information—like documents, databases, or knowledge bases—at runtime.

The process works in two main steps:

Retrieval: When a user asks a question, the system searches a collection of documents (often stored as vector embeddings) to find the most relevant passages based on semantic similarity.
Augmented Generation: These retrieved passages are then combined with the user’s original query and fed into a language model (like GPT or LLaMA). The model uses both the question and the retrieved context to generate a more accurate and informed answer.

How it works:

Input query → "What is quantum computing?"
Retrieve relevant context → Search documents (e.g., using vector embeddings) to find top-k passages.
Augment the prompt → Add retrieved passages to the query.
Generate → Use an LLM (e.g., GPT) to answer using both the query and retrieved context.

What is this program?

In this project, there's no need for API keys or paid services. Everything runs locally. All you need is to have Ollama installed (to run LLaMA 3 or any compatible model) and , which is used to store embedding vectors locally on your machine. That’s it — no external dependencies, no cloud setup, just a simple and self-contained RAG system.

../RAG system

What is RAG?

How it works:

What is this program?

../RAG system

What is RAG?

How it works:

What is this program?

Important note

Install Ollama

Back-end flow

How Documents Are Ingested

1. Load the Document

2. Split the Content

3. Assign Unique IDs

4. Store in ChromaDB

API Endpoints

Project Structure