What Is RAG and Why Use LlamaIndex?
Retrieval Augmented Generation (RAG) is a pattern that connects a language model to an external knowledge base at query time. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents from your data store and injects them into the prompt—giving the model accurate, up-to-date, source-grounded context to work with.
RAG is the practical alternative to fine-tuning for most business applications:
- No retraining required when your data changes—update the index, not the model
- Dramatically cheaper than fine-tuning a large model on proprietary data
- Auditable—you can inspect exactly which documents were retrieved for any answer
- Works with private data without sending training data to model providers
LlamaIndex is the leading Python framework for building RAG pipelines. It handles the entire workflow from document ingestion to query response, with integrations for dozens of data sources, vector stores, and LLM providers.
Installation
Start with a minimal install:
pip install llama-index
For local models via Ollama, also install:
pip install llama-index-llms-ollama llama-index-embeddings-ollama
The 5 Stages of a LlamaIndex Pipeline
Stage 1: Loading
LlamaIndex can load documents from files, directories, databases, Notion, Slack, web pages, and many other sources via its connector ecosystem (LlamaHub).
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()
SimpleDirectoryReader handles PDF, DOCX, TXT, Markdown, and other common formats automatically. For a single file, pass the file path directly.
Stage 2: Indexing
Indexing splits documents into chunks (nodes) and generates vector embeddings for each chunk. The embeddings are stored in a vector index that supports semantic similarity search.
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
By default, LlamaIndex uses OpenAI's text-embedding-ada-002 for embeddings and chunks documents at 1024 tokens with 20-token overlap. Both settings are configurable.
Stage 3: Storing (Index Persistence)
Without persistence, you regenerate embeddings on every run—burning API tokens and time. Persist the index to disk after the first build:
import os
from llama_index.core import StorageContext, load_index_from_storage
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
# First run: build and save
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
# Subsequent runs: load from disk
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)
After the first run, loading from storage takes milliseconds and costs zero API tokens.
Stage 4: Querying
Create a query engine from the index and ask questions in natural language:
query_engine = index.as_query_engine()
response = query_engine.query("What are the payment terms in the contract?")
print(response)
Under the hood, LlamaIndex:
- Embeds your query using the same embedding model used for indexing
- Finds the top-k most similar document chunks by cosine similarity
- Builds a prompt containing those chunks as context
- Sends the prompt to your configured LLM
- Returns the LLM's response
The default retrieves the top 2 chunks. Increase with similarity_top_k:
query_engine = index.as_query_engine(similarity_top_k=5)
Stage 5: Evaluation
LlamaIndex includes evaluation modules to measure retrieval quality and response faithfulness:
from llama_index.core.evaluation import FaithfulnessEvaluator
evaluator = FaithfulnessEvaluator()
result = evaluator.evaluate_response(response=response)
print(result.passing) # True if response is grounded in retrieved context
Evaluation is essential before deploying a RAG system to production. Hallucinations that pass in testing tend to compound in production use.
Using Local Models via Ollama
For complete data privacy, replace OpenAI with a local model running via Ollama. Pull a model first:
ollama pull llama3.2
ollama pull nomic-embed-text
Then configure LlamaIndex to use your local models:
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
Settings.llm = Ollama(model="llama3.2", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
# Now build or load your index as normal
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the main findings.")
print(response)
Your documents and queries never leave your machine. This setup is viable for sensitive business data where cloud API calls are not acceptable.
The Complete Minimal Pipeline
Here is the entire production-ready pattern in one block:
import os
from llama_index.core import (
SimpleDirectoryReader,
VectorStoreIndex,
StorageContext,
load_index_from_storage,
)
PERSIST_DIR = "./storage"
DATA_DIR = "./data"
if not os.path.exists(PERSIST_DIR):
documents = SimpleDirectoryReader(DATA_DIR).load_data()
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("Your question here")
print(response)
Put your documents in ./data, run the script once to build the index, and every subsequent run loads from disk in under a second. Swap in Ollama settings at the top for a fully local, zero-cost-per-query setup.
When RAG Is the Right Choice
Use RAG when:
- Your knowledge base changes frequently (weekly or more)
- You need the model to cite specific source documents
- Your data is private and cannot be sent to a training pipeline
- You need to be operational within days, not weeks
Consider fine-tuning instead when:
- You need the model to adopt a specific tone or communication style
- The knowledge is stable and deeply structural (e.g., a domain-specific grammar)
- You need inference to be extremely fast with no retrieval latency
For most B2B applications—document Q&A, internal knowledge bases, contract review, customer support—RAG with LlamaIndex is the correct starting point. It is faster to build, cheaper to operate, and easier to maintain than any fine-tuning approach.