What is RAG and why use it instead of fine-tuning?

RAG (Retrieval Augmented Generation) connects an LLM to an external knowledge base at inference time. It is faster and cheaper than fine-tuning because you don't need to retrain the model when your data changes—you just update the index.

What are the 5 stages of LlamaIndex data processing?

The 5 stages are: 1) Loading - ingesting documents from files, databases, or APIs; 2) Indexing - chunking text and generating embeddings; 3) Storing - persisting the index to disk; 4) Querying - retrieving relevant chunks and passing them to the LLM; 5) Evaluation - measuring retrieval and response quality.

Can I use LlamaIndex with a local model instead of OpenAI?

Yes. LlamaIndex supports local models via Ollama. Set Ollama as your LLM and embedding model, and your data never leaves your machine. This is ideal for sensitive business data.

LlamaIndex in Python: Complete RAG Guide in 5 Minutes

What Is RAG and Why Use LlamaIndex?

Retrieval Augmented Generation (RAG) is a pattern that connects a language model to an external knowledge base at query time. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents from your data store and injects them into the prompt—giving the model accurate, up-to-date, source-grounded context to work with.

RAG is the practical alternative to fine-tuning for most business applications:

No retraining required when your data changes—update the index, not the model
Dramatically cheaper than fine-tuning a large model on proprietary data
Auditable—you can inspect exactly which documents were retrieved for any answer
Works with private data without sending training data to model providers

LlamaIndex is the leading Python framework for building RAG pipelines. It handles the entire workflow from document ingestion to query response, with integrations for dozens of data sources, vector stores, and LLM providers.

Installation

Start with a minimal install:

pip install llama-index

For local models via Ollama, also install:

pip install llama-index-llms-ollama llama-index-embeddings-ollama

The 5 Stages of a LlamaIndex Pipeline

Stage 1: Loading

LlamaIndex can load documents from files, directories, databases, Notion, Slack, web pages, and many other sources via its connector ecosystem (LlamaHub).

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()

SimpleDirectoryReader handles PDF, DOCX, TXT, Markdown, and other common formats automatically. For a single file, pass the file path directly.

Stage 2: Indexing

Indexing splits documents into chunks (nodes) and generates vector embeddings for each chunk. The embeddings are stored in a vector index that supports semantic similarity search.

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

By default, LlamaIndex uses OpenAI's text-embedding-ada-002 for embeddings and chunks documents at 1024 tokens with 20-token overlap. Both settings are configurable.

Stage 3: Storing (Index Persistence)

Without persistence, you regenerate embeddings on every run—burning API tokens and time. Persist the index to disk after the first build:

import os
from llama_index.core import StorageContext, load_index_from_storage

PERSIST_DIR = "./storage"

if not os.path.exists(PERSIST_DIR):
    # First run: build and save
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # Subsequent runs: load from disk
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

After the first run, loading from storage takes milliseconds and costs zero API tokens.

Stage 4: Querying

Create a query engine from the index and ask questions in natural language:

query_engine = index.as_query_engine()
response = query_engine.query("What are the payment terms in the contract?")
print(response)

Under the hood, LlamaIndex:

Embeds your query using the same embedding model used for indexing
Finds the top-k most similar document chunks by cosine similarity
Builds a prompt containing those chunks as context
Sends the prompt to your configured LLM
Returns the LLM's response

The default retrieves the top 2 chunks. Increase with similarity_top_k:

query_engine = index.as_query_engine(similarity_top_k=5)

Stage 5: Evaluation

LlamaIndex includes evaluation modules to measure retrieval quality and response faithfulness:

from llama_index.core.evaluation import FaithfulnessEvaluator

evaluator = FaithfulnessEvaluator()
result = evaluator.evaluate_response(response=response)
print(result.passing)  # True if response is grounded in retrieved context

Evaluation is essential before deploying a RAG system to production. Hallucinations that pass in testing tend to compound in production use.

Using Local Models via Ollama

For complete data privacy, replace OpenAI with a local model running via Ollama. Pull a model first:

ollama pull llama3.2
ollama pull nomic-embed-text

Then configure LlamaIndex to use your local models:

from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

Settings.llm = Ollama(model="llama3.2", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")

# Now build or load your index as normal
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the main findings.")
print(response)

Your documents and queries never leave your machine. This setup is viable for sensitive business data where cloud API calls are not acceptable.

The Complete Minimal Pipeline

Here is the entire production-ready pattern in one block:

import os
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
)

PERSIST_DIR = "./storage"
DATA_DIR = "./data"

if not os.path.exists(PERSIST_DIR):
    documents = SimpleDirectoryReader(DATA_DIR).load_data()
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("Your question here")
print(response)

Put your documents in ./data, run the script once to build the index, and every subsequent run loads from disk in under a second. Swap in Ollama settings at the top for a fully local, zero-cost-per-query setup.

When RAG Is the Right Choice

Use RAG when:

Your knowledge base changes frequently (weekly or more)
You need the model to cite specific source documents
Your data is private and cannot be sent to a training pipeline
You need to be operational within days, not weeks

Consider fine-tuning instead when:

You need the model to adopt a specific tone or communication style
The knowledge is stable and deeply structural (e.g., a domain-specific grammar)
You need inference to be extremely fast with no retrieval latency

For most B2B applications—document Q&A, internal knowledge bases, contract review, customer support—RAG with LlamaIndex is the correct starting point. It is faster to build, cheaper to operate, and easier to maintain than any fine-tuning approach.