LlamaIndex for Beginners (2025): A Complete Guide to Building RAG Apps from Zero to Production

6 min readAug 18, 2025

By gautAm sonI

Who this is for: Developers who want to connect LLMs to their own data — docs, PDFs, wikis, Notion, databases — and ship something that actually answers questions reliably.

What you’ll build: A minimal RAG bot in Python, then you’ll level it up with better parsing, indexing, retrieval, evaluation, and deployment patterns.

What is LlamaIndex and why use it.
Core mental model (Documents → Nodes → Index → Retriever → Query Engine)
Installation & setup
10-minute Quickstart: your first RAG bot
Chunking, metadata, and retrieval you can trust
Parsing real-world PDFs with LlamaParse
Going beyond Q&A: agents and tools
Evaluate your RAG (don’t skip this)
Observability, caching, and cost control
Deploying with FastAPI
Common pitfalls + fixes
Where to go next

1) What is LlamaIndex and why use it

LlamaIndex is a data framework for LLM applications. It helps you ingest, transform, index, retrieve, and synthesize answers from your own data across many sources (local files, SaaS apps, databases), and many model/back-end choices (OpenAI, Anthropic, local models, Bedrock, Vertex, etc.). In practice, it gives you composable building blocks to implement Retrieval-Augmented Generation (RAG) and agents with less glue code and more guardrails.

2) Core mental model

Keep this pipeline in your head:

Documents (raw files, pages, DB rows)

→ Nodes (chunked text + metadata)

→ Index (e.g., a vector index)

→ Retriever (find relevant nodes)

→ Query Engine (synthesize final answer with citations)

LlamaIndex lets you swap pieces (different chunkers, vector stores, rerankers, LLMs) without rewriting your app. The canonical “hello world” is a VectorStoreIndex built from documents, queried via a Query Engine.

3) Installation & setup

You can install “batteries included” or pick packages à la carte.

# Core + OpenAI LLM + OpenAI embeddings
pip install llama-index llama-index-llms-openai llama-index-embeddings-openai

# Optional: local models, vector stores, loaders, etc.
# pip install llama-index-llms-ollama llama-index-embeddings-huggingface
# pip install llama-index-vector-stores-chroma chromadb
# pip install llama-index-readers-notion

Set your API keys via environment variables (e.g., OPENAI_API_KEY). See the official install docs for the latest package names and options.

4) 10-minute Quickstart: your first RAG bot

Below is a minimal example that:

loads files from a folder,
builds a vector index,
answers questions with sources.

Folder layout
./data → put a few .md, .txt, or .pdf files here.

# app.py
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# 1) Configure model + embeddings (swap these for Anthropic, Ollama, etc.)
Settings.llm = OpenAI(model="gpt-4o-mini")             # fast, low-cost family works fine
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# 2) Load documents (many readers available via LlamaHub)
docs = SimpleDirectoryReader("data").load_data()
# 3) Build an index
index = VectorStoreIndex.from_documents(docs)
# 4) Create a query engine
qe = index.as_query_engine(response_mode="compact")     # "compact" answers + citations
# 5) Ask questions
q = "What are the main themes across these documents? Include citations."
ans = qe.query(q)
print(ans)

Run it:

python app.py

Try questions like “Summarize file X” or “Where do we mention GDPR?” The Query Engine will pull relevant chunks and include source refs. For local-LLM quickstarts (Ollama, llama-cpp), see the LlamaIndex local tutorial.

5) Chunking, metadata, and retrieval you can trust

Why it matters: Retrieval quality makes or breaks RAG. Three levers:

Chunking strategy
— Default chunk sizes (e.g., 512–1024 tokens) are fine to start.
— Prefer semantic chunking for long, structured docs (split by headings, sections).
— Keep overlap (e.g., 20–50 tokens) so concepts aren’t cut in half.
Metadata & filters
— Tag nodes with source, section, page, created_at, etc
— Use metadata filters at query-time (e.g., “only project=alpha” or “after 2024–01–01”).
Retrieval upgrades
— Hybrid retrieval: combine keyword (BM25) + vector.
— Reranking: use a reranker model to re-order the top-k results
— Decomposition / query transforms: break complex questions into sub-queries, then merge.

All of these are first-class citizens in LlamaIndex’s retriever/query engine APIs.

6) Parsing real-world PDFs with LlamaParse

Real PDFs include tables, footnotes, and messy layouts. LlamaParse (a LlamaIndex service) upgrades parsing so your chunks preserve structure — headings, lists, tables — improving retrieval quality.

from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.readers import LlamaParseReader
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding


Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# Requires a LlamaParse API key; see docs for setup.
parser = LlamaParseReader(result_type="markdown")
docs = parser.load_data(["./data/annual_report.pdf"])
index = VectorStoreIndex.from_documents(docs)
qe = index.as_query_engine()
print(qe.query("Extract 3 key risks from the report and cite pages."))

Alternative loaders (free/open) exist too — check LlamaHub for hundreds of connectors (Notion, Slack export, Google Drive, S3, databases, etc.).

7) Going beyond Q&A: agents and tools

RAG is great, but sometimes you need actions: run a SQL query, call an API, or browse a knowledge base in steps. LlamaIndex lets you expose tools (including your Query Engine) to an agent that plans + acts.

Typical pattern:

Make your Query Engine a tool (e.g., “CompanyDocsSearch”),
Add domain tools (SQL, HTTP),
Let the agent decide when to search vs. compute.

The “Starter Tutorial (Using OpenAI)” shows this progression from simple RAG to agents.

8) Evaluate your RAG (don’t skip this)

You’ll never trust a RAG system without measurement. LlamaIndex ships evaluation utilities (faithfulness, answer relevancy, context recall). You can also integrate RAGAS — a community toolkit for QA datasets, metrics, and leaderboards. Use both: no single metric captures truth.

Tiny example:

# pseudo-ish example; see docs for exact APIs that fit your version
from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator


faith = FaithfulnessEvaluator()
rel = RelevancyEvaluator()
question = "What are our SOC 2 policies?"
response = qe.query(question)
print("faithfulness:", faith.evaluate_response(response).score)
print("relevancy:",   rel.evaluate_response(response, question).score)

For end-to-end examples (including Bedrock), see AWS’s guide integrating LlamaIndex + RAGAS.

⚠️ Caveat: RAGAS is useful but imperfect; combine multiple metrics and manual checks.

9) Observability, caching, and cost control

Tracing & logs: Use LlamaIndex’s callback/tracing ecosystem or hosted solutions to see every step of retrieval and synthesis.
Caching: Cache embeddings and intermediate retrieval to cut cost/latency.
Persistent vector stores: Move from in-memory to Chroma, FAISS, pgvector, Pinecone, etc., for warm startups and scalability.
Cold-start latency: The first query after a restart can be slower — pre-warm or lazy-load indexes. (This is a common complaint with some vector stores.)

10) Deploying with FastAPI (production-ish skeleton)

# server.py
from fastapi import FastAPI
from pydantic import BaseModel
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

app = FastAPI()
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
docs = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(docs)
qe = index.as_query_engine(streaming=True)  # stream tokens to clients if you like
class Query(BaseModel):
    q: str
@app.post("/ask")
def ask(body: Query):
    resp = qe.query(body.q)
    # Return both text and sources
    return {
        "answer": str(resp),
        "sources": [
            {"node_id": s.node.node_id, "source": s.node.metadata.get("file_name")}
            for s in getattr(resp, "source_nodes", [])
        ],
    }

You can put this behind a lightweight Next.js/SwiftUI/React client with SSE or websockets for streaming.

11) Common pitfalls + fixes

Embedding/model mismatch

Don’t change your embedding model without re-embedding your data.
Keep model names/version in your index metadata.

Chunk size & overlap

Too small → fragments context; too big → noisy retrieval. Start ~512–1024 tokens with ~50 overlap, then evaluate.

Metadata is missing

Add source, title, section, page. You’ll want it for filtering and readable citations.

Vector store cold starts

Persist indexes and pre-warm. If the first query is slow on a given stack (e.g., Chroma), consider spinning the store up before traffic.

Bad PDFs

Use a better parser (LlamaParse) for tables/headings. Evaluate before/after to verify the lift.

12) Where to go next

Official Quickstarts (OpenAI & Local LLMs): A great way to see agents layered on top of RAG.
Vector index & retrieval guides: Deep dives into retrievers, rerankers, and hybrid strategies.
LlamaHub connectors: Loaders for Notion, Slack, Drive, S3, databases, and more.
LlamaParse: Advanced PDF parsing for better chunks.
Evaluation: LlamaIndex evaluators + RAGAS for dataset/metric workflows.

Final checklist before you ship

Your data is correctly chunked and enriched with metadata
Retrieval returns the right chunks (spot-check top-k)
Answers cite sources; hallucinations are measured (faithfulness)
You’ve run evaluation on a small test set (10–50 examples)
Index is persisted; cold starts are acceptable
Costs are bounded (streaming + caching)
Logs/traces are on for debugging in prod

TL;DR

LlamaIndex abstracts the gnarly parts of RAG — loading, chunking, indexing, retrieving, and synthesis — so you can focus on UX and correctness. Start with a simple VectorStoreIndex, then layer in better parsing (LlamaParse), hybrid retrieval, reranking, evaluators, and production-grade deployment. You’ll go from “AI toy” to “AI tool” much faster.