Loading and searching large text files in RAM
Shared by abc

This is a copy of a chat between Claude and abc. Content may include unverified or unsafe content that do not represent the views of Anthropic. Shared snapshot may contain attachments and data not displayed here.

You said: give me a few easy ways to: load 1GB of .

give me a few easy ways to: load 1GB of .txt / .md files into RAM, and then search through (grep / ripgrep) those files. please explain everything in very detail.

May 20

Claude responded: Great topic.

Great topic. Before jumping into the techniques, it helps to understand why you'd want to load files into RAM in the first place — and what's actually happening under the hood when you do. Let me build up the picture from the ground up.


First: How your OS already caches files in RAM

Your operating system has something called the page cache (sometimes called the "buffer cache"). Every time you read a file from disk, the OS silently copies its contents into RAM and keeps them there. The next time you (or any program) reads that same file, the OS serves the data straight from RAM — no disk access at all. This happens automatically, without you doing anything special.

This means that if you run a search over 1GB of files once, those files are likely already in RAM for your second search. You can verify how much RAM the page cache is using on Linux with:

bash
free -h
# The "buff/cache" column shows page cache usage

And you can deliberately pre-warm the cache (force the OS to load everything into RAM before you search) like this:

bash
# "vmtouch" is a tool that explicitly loads files/dirs into the page cache
sudo apt install vmtouch   # or brew install vmtouch on macOS

vmtouch -t /path/to/your/files/   # -t = "touch" (load into RAM)
vmtouch -v /path/to/your/files/   # -v = show what's cached

After running vmtouch -t, all your files are in RAM. Subsequent grep or ripgrep runs will be purely CPU-bound — no waiting for disk I/O at all. This is often the easiest approach because you don't have to move files anywhere or change your workflow.


Approach 1: ripgrep — the right tool from the start

Before worrying about RAM tricks, it's worth knowing that ripgrep (rg) is already highly optimized for exactly this use case. Here's what makes it special:

Memory-mapped I/O (mmap): Instead of reading files sequentially into a buffer, rg can ask the OS to map the file directly into its virtual address space. The OS then loads only the pages of the file that are actually touched during the search — and those pages come from the page cache if already loaded, or from disk otherwise. You can enable this explicitly:

bash
rg --mmap "your search pattern" /path/to/files/

Parallelism: rg searches multiple files simultaneously across CPU cores, so even if some files need to be fetched from disk, other cores are busy searching files already in RAM.

Smart defaults: rg automatically skips binary files, respects .gitignore, and uses SIMD (CPU vector instructions) for pattern matching — making it 5–10× faster than grep on typical workloads.

A practical example searching all .txt and .md files recursively:

bash
# Search for "TODO" in all .txt and .md files, show filename and line number
rg "TODO" --type txt --type md /path/to/files/

# Or explicitly specify extensions
rg "TODO" -g "*.txt" -g "*.md" /path/to/files/

# Case-insensitive, show 2 lines of context around each match
rg -i "TODO" -C 2 -g "*.{txt,md}" /path/to/files/

Approach 2: A RAM disk with tmpfs (Linux) or /dev/shm

This is the most literal answer to "load files into RAM." A RAM disk is a filesystem that the OS creates entirely in physical memory — there is no backing disk at all. On Linux, /dev/shm is one such filesystem that exists by default (it's used for inter-process shared memory), and you can also create your own tmpfs mount.

bash
# Option A: Use /dev/shm — it already exists, just copy your files there
cp -r /path/to/your/files/ /dev/shm/myfiles/

# Option B: Create a dedicated tmpfs mount (gives you more control over the size)
sudo mkdir /mnt/ramfs
sudo mount -t tmpfs -o size=2G tmpfs /mnt/ramfs   # reserve 2GB of RAM
cp -r /path/to/your/files/ /mnt/ramfs/

# Now search — every read is a RAM read, no disk involved
rg "your pattern" /dev/shm/myfiles/

The key mental model here: when you copy to /dev/shm, the only copy of those bytes lives in RAM. Compare this to the page cache approach (Approach above), where the authoritative copy is still on disk and RAM is just a cache. With tmpfs, if you reboot or the mount is removed, the data is gone. This makes it perfect for ephemeral search tasks ("I want to search this dataset quickly right now") but not for permanent storage.

A note on macOS: macOS doesn't have /dev/shm as a general-purpose RAM disk, but you can create one:

bash
# Create a 2GB RAM disk on macOS (size is in 512-byte blocks, so 2GB = 4194304)
diskutil erasevolume HFS+ "RAMDisk" `hdiutil attach -nomount ram://4194304`

cp -r /path/to/your/files/ /Volumes/RAMDisk/
rg "pattern" /Volumes/RAMDisk/

Approach 3: Concatenate everything into one in-memory buffer, then search

A different mental model entirely: instead of searching many files, what if you smash all the content into one giant string and search that? This is useful when you want to do programmatic processing beyond simple pattern matching, or when you're working in Python/another language.

Here's the shell version — simple but loses filename tracking:

bash
# Pipe all file contents into rg (searches the combined stream)
# Note: you lose which file a match came from with this approach
find /path/to/files -name "*.txt" -o -name "*.md" | xargs cat | grep "pattern"

A better approach in Python, which preserves filenames and loads everything into RAM explicitly:

python
import os
import re
from pathlib import Path

# Step 1: Load all files into RAM as a dictionary: {filepath: content}
corpus = {}
base_path = Path("/path/to/your/files")

for filepath in base_path.rglob("*"):          # rglob walks recursively
    if filepath.suffix in (".txt", ".md"):
        # read_text() loads the entire file into a Python string (in RAM)
        corpus[filepath] = filepath.read_text(encoding="utf-8", errors="ignore")

print(f"Loaded {len(corpus)} files into RAM")

# Step 2: Search — this is now a pure RAM operation, no disk I/O
pattern = re.compile("TODO", re.IGNORECASE)

for filepath, content in corpus.items():
    for line_num, line in enumerate(content.splitlines(), start=1):
        if pattern.search(line):
            print(f"{filepath}:{line_num}: {line.strip()}")

This approach is great when you want to do multiple searches over the same dataset (load once, search many times) or when you need the full power of Python's re module. The tradeoff is that Python strings carry overhead — 1GB of text files might use 2–3GB of RAM once Python wraps them in string objects.


Putting it all together: which approach should you use?

Think of it like this: if you're doing one-off searches from the command line, start with rg --mmap and let the OS page cache do its job — it's the least effort and usually fast enough. If you need guaranteed zero disk I/O (e.g., you're benchmarking, or your disk is very slow), copy to /dev/shm and search from there. If you're writing a program that will search the same corpus many times in a single session, the Python dictionary approach wins because you load once and all subsequent searches cost nothing in I/O.

A useful mental exercise: run rg twice on your 1GB dataset and time both runs with time rg "pattern" /path/. You'll likely find the second run is significantly faster — that's the page cache at work, and it costs you absolutely nothing.

You said: please give me a few more easy techniques / ways / systems .

please give me a few more easy techniques / ways / systems . also, please do explain everything in very detail.

May 20

Claude responded: Great, let's go deeper.

Great, let's go deeper. The previous response covered the foundational ideas — the OS page cache, RAM disks, ripgrep, and loading files into Python dictionaries. Now let's explore techniques that are a bit more nuanced, including parallelizing grep yourself, building a real in-memory search index, using SQLite's full-text search engine entirely in RAM, memory-mapping files with Python's mmap module, and preventing the OS from ever swapping your data to disk. Each of these teaches you something fundamentally different about how computers handle memory and search.


Technique 1: GNU Parallel + grep — building your own parallel search engine

The standard grep command is single-threaded. It processes one file at a time, which means if you have a modern CPU with 8 or 16 cores, 15 of those cores are sitting idle while grep dutifully works through your files one by one. ripgrep solves this automatically, but understanding how to parallelize grep yourself is genuinely illuminating because it teaches you how parallelism works in Unix in general.

The key tool here is GNU parallel, which takes a list of inputs and fans them out across multiple CPU cores simultaneously.

bash
# Install it first
sudo apt install parallel      # Linux
brew install parallel          # macOS

# The basic idea: find all your files, then process N files simultaneously
# The {//} is a placeholder that parallel replaces with each filename
find /path/to/files -name "*.txt" -o -name "*.md" | \
  parallel -j$(nproc) grep -l "your pattern" {}

# Breaking this down:
# - find: produces a list of file paths, one per line
# - parallel: reads that list and spawns jobs
# - -j$(nproc): use as many jobs as you have CPU cores (nproc prints the core count)
# - grep -l: the -l flag means "only print the filename if there's a match, not the line"
# - {}: parallel replaces this with the current filename from the input list

But there's an even smarter way to use parallel with grep — you can give each worker a batch of files instead of one file at a time, which reduces the overhead of spawning many processes:

bash
# --block 10M tells parallel to give each worker ~10MB worth of filenames at once
# --pipe feeds data through stdin rather than as arguments
# This is extremely fast for large corpora
find /path/to/files \( -name "*.txt" -o -name "*.md" \) -print0 | \
  parallel -0 -j$(nproc) --pipe grep -rn "your pattern"

Here's the mental model that makes this click: imagine you have a stack of 1,000 books to search through. The single-threaded approach is one person working through the stack alone. parallel is like hiring 16 workers simultaneously, splitting the stack evenly, and having everyone search their pile at the same time. The total wall-clock time drops by roughly a factor of 16 — not perfectly, because there's some coordination overhead, but very close.

The reason this pairs well with the RAM-loading strategies from before is that once your files are in the page cache (or in /dev/shm), there's no disk I/O bottleneck — so throwing more CPU cores at the problem actually gets you the full speedup, rather than everyone waiting on disk.


Technique 2: SQLite in-memory FTS5 — a real search index in RAM

This is one of the most underused techniques for searching text corpora. SQLite is normally thought of as a disk-based database, but it has two powerful features that change the picture entirely: it can run entirely in RAM (:memory: mode), and it has a built-in full-text search engine called FTS5 (Full Text Search version 5).

Here's why this matters: grep and rg are linear scanners — every search has to read through every byte of every file. An FTS5 index is an inverted index, which means it pre-computes a map of "which files contain which words," so a search doesn't read the files at all — it just looks up the word in the index. For repeated searches over the same dataset, this is transformatively faster.

python
import sqlite3
import os
from pathlib import Path

# Step 1: Create a database entirely in RAM
# The magic string ":memory:" tells SQLite to never touch the disk
conn = sqlite3.connect(":memory:")
cursor = conn.cursor()

# Step 2: Create an FTS5 virtual table
# FTS5 is SQLite's full-text search engine — it builds an inverted index automatically
# The "content" and "filepath" are columns; you can query either
cursor.execute("""
    CREATE VIRTUAL TABLE documents USING fts5(
        filepath,    -- stores the file path so we know where a match came from
        content,     -- the actual text content of the file
        tokenize = 'porter ascii'  -- "porter" applies stemming: "running" matches "run"
    )
""")

# Step 3: Load all your .txt and .md files into the index
base_path = Path("/path/to/your/files")
files_loaded = 0

for filepath in base_path.rglob("*"):
    if filepath.suffix in (".txt", ".md"):
        try:
            text = filepath.read_text(encoding="utf-8", errors="ignore")
            # Each row in the FTS table is one file's worth of content
            cursor.execute(
                "INSERT INTO documents(filepath, content) VALUES (?, ?)",
                (str(filepath), text)
            )
            files_loaded += 1
        except Exception as e:
            print(f"Skipping {filepath}: {e}")

conn.commit()
print(f"Indexed {files_loaded} files into RAM")

# Step 4: Search — this is now O(log N) instead of O(N * file_size)
# FTS5 uses a BM25 ranking algorithm by default (same family as Google's early ranking)
search_term = "database optimization"

results = cursor.execute("""
    SELECT filepath, snippet(documents, 1, '[', ']', '...', 20)
    FROM documents
    WHERE documents MATCH ?
    ORDER BY rank   -- rank sorts by relevance, best matches first
""", (search_term,)).fetchall()

for filepath, snippet in results:
    print(f"\n{filepath}")
    print(f"  ...{snippet}...")

The snippet() function is particularly useful — it extracts the relevant portion of the text around the match, like you'd see in a web search result. The ORDER BY rank line uses BM25 relevance scoring, meaning results where your term appears frequently and in a short document rank higher. This is a genuinely sophisticated search system, running entirely in a few hundred lines of Python with no external dependencies.

The tradeoff to understand: building the index takes time (roughly 3–5 seconds per GB of text), so this technique pays off when you want to do many searches over the same dataset in one session. Load once, search dozens or hundreds of times — each search takes milliseconds regardless of corpus size.


Technique 3: Python's mmap module — memory mapping with surgical precision

In the previous response, the Python dictionary approach loaded every file fully into Python string objects. mmap is a different and more elegant approach: it asks the OS to make a file look like a block of memory, but the OS only loads the parts you actually touch. This is particularly powerful when your files are large but you often only need to search a portion of them.

The key insight is that mmap gives you a file that behaves exactly like a Python bytes object — you can slice it, search it with find(), or apply regex patterns to it — but underneath, the OS is doing lazy loading, page by page.

python
import mmap
import re
import os
from pathlib import Path

def search_file_with_mmap(filepath: Path, pattern: bytes) -> list[tuple[int, bytes]]:
    """
    Search a file using memory-mapping.
    Notice we work with bytes (b"pattern"), not strings — mmap operates at the byte level.
    Returns a list of (byte_offset, matched_line) tuples.
    """
    matches = []

    with open(filepath, "rb") as f:           # open in binary mode
        file_size = os.path.getsize(filepath)

        if file_size == 0:
            return []                          # mmap cannot map a zero-byte file

        # mmap.mmap() creates the memory-mapped view
        # mmap.ACCESS_READ means we won't write to it (safer, allows sharing between processes)
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            # mm now behaves like a bytes object, but is backed by the OS page cache
            # mm.find() scans for the pattern — the OS loads pages on demand as it scans
            
            # Use re.finditer directly on the mmap object — this works!
            for match in re.finditer(pattern, mm):
                offset = match.start()
                
                # Find the start of the line containing the match
                line_start = mm.rfind(b"\n", 0, offset) + 1
                # Find the end of that line
                line_end = mm.find(b"\n", offset)
                if line_end == -1:
                    line_end = file_size
                
                line = mm[line_start:line_end]
                matches.append((offset, line))

    return matches


# Search all files in a directory
base_path = Path("/path/to/your/files")
# Compile the regex once (outside the loop) — this is much faster than compiling per-file
pattern = re.compile(b"TODO|FIXME|HACK", re.IGNORECASE)

for filepath in base_path.rglob("*"):
    if filepath.suffix in (".txt", ".md"):
        results = search_file_with_mmap(filepath, pattern)
        for offset, line in results:
            print(f"{filepath}:{offset}: {line.decode('utf-8', errors='replace')}")

The reason mmap is interesting to understand is that it blurs the line between "reading a file" and "using RAM." From your program's perspective, the file is just a big byte array in your address space. From the OS's perspective, it's a region of virtual memory backed by the page cache. When your regex engine scans through mm, the CPU generates page faults for each 4KB page it hasn't touched yet — the OS intercepts these faults, loads the relevant page from disk (or serves it from the page cache if it's already there), and the scan continues seamlessly. Your code never explicitly calls read() at all.

The practical advantage over the dictionary approach: if you have 1GB of files but your search pattern only matches in the first 10% of each file, mmap only ever loads that 10% into physical RAM. The dictionary approach always loads 100% of every file upfront.


Technique 4: mlockall — preventing the OS from swapping your data to disk

This is a technique about guaranteeing that your files stay in RAM once loaded, rather than loading them faster. Here's the problem it solves: even if you've loaded everything into the page cache, the OS is free to evict those pages at any moment if it needs RAM for something else — another program, a system update running in the background, etc. When this happens, your next search has to load from disk again, destroying your performance.

mlock and mlockall are system calls that tell the OS "these pages must never be swapped to disk or evicted from RAM." Think of it as putting a "do not disturb" sign on a hotel room — the OS cannot reassign that RAM until you explicitly unlock it.

In Python, you can use ctypes to call these system calls directly:

python
import ctypes
import ctypes.util

# Load the C standard library — this is where mlock/mlockall live
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)

# MCL_CURRENT = lock all pages currently in this process's address space
# MCL_FUTURE  = lock all pages that will be mapped in the future too
MCL_CURRENT = 1
MCL_FUTURE  = 2

def lock_all_memory():
    """
    Tells the OS to keep ALL of this process's memory in physical RAM.
    Requires root (sudo) or CAP_IPC_LOCK capability on Linux.
    """
    result = libc.mlockall(MCL_CURRENT | MCL_FUTURE)
    if result != 0:
        errno = ctypes.get_errno()
        raise OSError(errno, f"mlockall failed: {os.strerror(errno)}")
    print("Memory locked — OS will not swap this process to disk")

def unlock_all_memory():
    libc.munlockall()
    print("Memory unlocked")


# Usage pattern: lock, then load your data, then search
lock_all_memory()

# Now load your corpus — these pages are guaranteed to stay in RAM
corpus = {}
for filepath in Path("/path/to/files").rglob("*.txt"):
    corpus[filepath] = filepath.read_text(encoding="utf-8", errors="ignore")

# All your searches here are guaranteed to be pure RAM operations
# ... your search code ...

unlock_all_memory()

One important caveat: mlockall requires elevated privileges on most systems because locking too much RAM can starve other processes. You might need to run with sudo, or increase the RLIMIT_MEMLOCK limit for your user:

bash
# Check your current limit (in bytes, -1 means unlimited)
ulimit -l

# Temporarily raise the limit for your session (64MB example)
ulimit -l 65536

# Or permanently edit /etc/security/limits.conf:
# yourusername hard memlock unlimited

A gentler alternative that doesn't require root is to use mlock (lock specific allocations rather than all memory) via a C extension — but for most practical purposes on a development machine, simply running your Python script with sudo is fine.


Technique 5: Building an inverted index yourself — understanding search at its core

All professional search tools (Elasticsearch, Solr, SQLite FTS5, even Google) are built on the same fundamental data structure: the inverted index. Building one yourself in Python takes about 30 lines and gives you a mental model that makes every other search technique suddenly make more sense.

The core idea: instead of "file → words" (which is how files are stored), an inverted index stores "word → list of files containing that word." Searching becomes a dictionary lookup, which is O(1) — constant time regardless of how many files you have.

python
import re
from pathlib import Path
from collections import defaultdict

def tokenize(text: str) -> list[str]:
    """
    Break text into searchable tokens (words).
    This is a naive tokenizer — production tokenizers do much more
    (handle Unicode, stemming, stop words, n-grams, etc.)
    """
    # \b\w+\b matches whole words; lower() makes search case-insensitive
    return re.findall(r'\b\w+\b', text.lower())

def build_inverted_index(directory: str) -> dict[str, set[str]]:
    """
    Build an inverted index from all .txt and .md files in a directory.
    
    The result looks like:
    {
        "python":   {"/files/readme.md", "/files/tutorial.txt", ...},
        "database": {"/files/notes.txt", ...},
        "TODO":     {"/files/main.py", ...},
        ...
    }
    """
    # defaultdict(set) means: if a key doesn't exist yet, create an empty set for it
    # We use a set (not a list) so each file appears at most once per word
    index = defaultdict(set)
    
    base_path = Path(directory)
    
    for filepath in base_path.rglob("*"):
        if filepath.suffix not in (".txt", ".md"):
            continue
        
        try:
            text = filepath.read_text(encoding="utf-8", errors="ignore")
        except Exception:
            continue
        
        for token in tokenize(text):
            # The critical line: map each word back to the file containing it
            index[token].add(str(filepath))
    
    return index


def search(index: dict, query: str) -> set[str]:
    """
    Search the index. Supports multi-word AND queries:
    "python database" returns only files containing BOTH words.
    
    This is how search engines handle multi-word queries —
    they intersect the posting lists for each term.
    """
    terms = tokenize(query)
    
    if not terms:
        return set()
    
    # Start with the files matching the first term
    results = index.get(terms[0], set()).copy()
    
    # Intersect with files matching each subsequent term
    # This implements AND logic: only files with ALL terms survive
    for term in terms[1:]:
        results &= index.get(term, set())   # &= is set intersection
    
    return results


# --- Main usage ---
print("Building index...")
index = build_inverted_index("/path/to/your/files")
print(f"Index contains {len(index)} unique tokens")

# Now search is instant — no file reading at all
results = search(index, "database optimization")
for filepath in sorted(results):
    print(filepath)

# You can run hundreds of searches with no additional I/O cost
results2 = search(index, "TODO error handling")
results3 = search(index, "configuration file path")

This is a beautiful exercise because it makes the cost/benefit of indexing completely tangible. Building the index is expensive (you read every file once). But each search is O(1) — looking up a word in a Python dictionary takes the same amount of time whether the dictionary has 1,000 entries or 10 million. Once built, you could search this index a million times and it would cost nothing in either I/O or compute.

The SQLite FTS5 technique from earlier is essentially a highly optimized, persistent version of exactly this data structure — with ranking algorithms, phrase matching, and prefix search added on top.


Technique 6: ugrep — a modern grep with a built-in index

ugrep is a relatively new tool (think of it as a peer to ripgrep) that has one feature neither grep nor ripgrep has: it can build and use a persistent on-disk index that it then loads into RAM at query time. This means the first search is slow (it builds the index), but all subsequent searches — even across different sessions, because the index persists on disk — are dramatically faster.

bash
# Install ugrep
brew install ugrep        # macOS
sudo apt install ugrep    # Linux (or build from source for latest version)

# Step 1: Build the index (do this once)
# This creates a .ugrep/ directory inside your search path
ugrep --index /path/to/your/files/

# Step 2: Search using the index — ugrep loads the index into RAM and queries it
# The --count flag shows match counts per file
ugrep -r --index "your pattern" /path/to/your/files/

# Rebuild the index after files change
ugrep --index --check /path/to/your/files/   # only re-indexes modified files

The conceptual difference from the other techniques: instead of loading the raw file contents into RAM, ugrep's index loads a compact summary of those contents. The index is much smaller than the original files (often 5–10% of the original size), so the entire index for 1GB of text might only need 50–100MB of RAM. This makes it practical to keep the index hot in the page cache even when working with large corpora.


Putting the techniques in perspective

The right mental framework for choosing between these approaches is to ask two questions: how many times will I search this dataset? and do I need exact line matches or relevance ranking?

If you're doing a single one-off search, rg --mmap with a warmed page cache is unbeatable — zero setup, zero overhead. If you're doing repeated exact searches in a scripting context, GNU parallel + grep exploits all your CPU cores with minimal setup. If you're building something interactive that needs relevance ranking (best results first), the SQLite FTS5 approach gives you a proper search engine for free. If you need guaranteed RAM residency with no possibility of swapping, mlockall is the right layer to add on top of any other technique. And if you want to truly understand what all the tools are doing under the hood, building the inverted index yourself is the most educational 30 minutes you can spend.

You said: okay great.

okay great. again, please give me a few more easy techniques / ways / systems . also, please do explain everything in very detail.

May 20

Claude responded: Excellent — let's go into four more techniques that each teach you something genuinely new about the problem.

Excellent — let's go into four more techniques that each teach you something genuinely new about the problem. The previous responses covered the OS page cache, RAM disks, parallelism, SQLite FTS5, memory-mapping, and inverted indexes. What's left to explore is how to fit more data into RAM by compressing it, how to use a modern analytical database engine to search text, how to use a fundamentally different data structure called a suffix array that handles patterns that inverted indexes struggle with, and how to tell the OS in advance what you're about to need so it can fetch it for you in the background while your program does other work.


Technique 1: A compressed in-memory corpus — trading CPU cycles for RAM space

Here's a problem that none of the previous techniques address: what if your dataset is too big to fit in RAM uncompressed? The standard answer is "use disk," but there's a middle path that's often overlooked — you can compress the files in RAM, store the compressed bytes, and then decompress each file on-the-fly at search time. Modern compression algorithms like LZ4 and Zstandard (zstd) are so fast that decompression speed exceeds typical RAM bandwidth, meaning you can often search a compressed corpus faster than an uncompressed one, because you're moving fewer bytes around in memory.

To understand why this works, think about what the CPU does during a search. It has to move data from RAM into its cache registers before it can inspect bytes. RAM bandwidth (the speed at which data travels from the memory chips into the CPU) is a real bottleneck on modern hardware — typically around 40–50 GB/s on a modern desktop. A compressor like LZ4 can achieve 4–5× compression on English text, meaning 1GB of files becomes ~250MB in RAM. Moving 250MB through the memory bus takes one-fifth as long as moving 1GB, and LZ4's decompressor runs at several GB/s per core, so the decompression cost is smaller than the I/O savings. This is the same reason CPUs have L1/L2/L3 caches — smaller + faster consistently beats larger + slower.

python
import lz4.frame          # pip install lz4
import re
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

# --- Step 1: Build a compressed corpus in RAM ---
# Instead of storing the raw text string, we store its compressed bytes.
# The dictionary maps filepath -> compressed_bytes (NOT raw text).

compressed_corpus = {}
base_path = Path("/path/to/your/files")

for filepath in base_path.rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        raw_bytes = filepath.read_bytes()          # read the file as raw bytes
        # lz4.frame.compress() returns compressed bytes
        # compression_level=0 is the fastest (LZ4 "fast" mode) — still ~3-4x compression on text
        compressed = lz4.frame.compress(raw_bytes, compression_level=0)
        compressed_corpus[str(filepath)] = compressed
    except Exception as e:
        print(f"Skipping {filepath}: {e}")

# Let's see how much RAM we're saving
original_size  = sum(len(lz4.frame.decompress(v)) for v in compressed_corpus.values())
compressed_size = sum(len(v) for v in compressed_corpus.values())
print(f"Original:   {original_size  / 1e6:.1f} MB")
print(f"Compressed: {compressed_size / 1e6:.1f} MB")
print(f"Ratio:      {original_size / compressed_size:.1f}x")


# --- Step 2: Search --- 
# For each file, decompress on-the-fly, then run your pattern over the raw text.
# The decompression happens in microseconds per file.

pattern = re.compile(rb"TODO|FIXME|HACK", re.IGNORECASE)   # note: bytes pattern (rb"...")

def search_compressed(filepath_str: str, compressed_data: bytes):
    """Decompress one file and search it. Returns list of matching lines."""
    raw = lz4.frame.decompress(compressed_data)   # back to original bytes, instantly
    matches = []
    for line_num, line in enumerate(raw.splitlines(), start=1):
        if pattern.search(line):
            matches.append((line_num, line.decode("utf-8", errors="replace")))
    return matches

# Use a thread pool to decompress and search files in parallel
# (Decompression releases the GIL, so threads actually run concurrently here)
with ThreadPoolExecutor(max_workers=8) as pool:
    futures = {
        pool.submit(search_compressed, fp, data): fp
        for fp, data in compressed_corpus.items()
    }
    for future, filepath in futures.items():
        results = future.result()
        for line_num, line in results:
            print(f"{filepath}:{line_num}: {line.strip()}")

One thing worth thinking about: this technique essentially moves the bottleneck from "RAM size" to "CPU speed," which is almost always the right trade on modern hardware. CPUs have gotten dramatically faster over the decades while RAM capacity growth has been more modest. If you want to go even further, Zstandard (zstd) achieves better compression ratios than LZ4 at a modest speed cost — and it has a "dictionary training" feature where you train it on a sample of your files first, and it compresses subsequent files more aggressively because it recognizes your domain's vocabulary. For a corpus of English prose or source code, this can push compression ratios to 8–10×.


Technique 2: DuckDB in-memory — an analytical SQL engine for text search

DuckDB is one of the most exciting database tools of the last few years. It's an embedded analytical database (like SQLite, but designed for analytical queries over large datasets rather than transactional workloads) that runs entirely in-process — no server, no setup. It also has a genuinely powerful full-text search extension, vectorized execution (it processes thousands of rows at a time using SIMD CPU instructions), and it runs entirely in RAM when you point it at :memory:.

The reason DuckDB is interesting as a search tool — rather than just using SQLite FTS5 from the previous response — is that it's designed for a different kind of query. SQLite FTS5 is great at "find files containing these keywords." DuckDB is great at answering questions like "find all lines containing this pattern, then count how many times each unique word appears in those lines, grouped by file extension." It treats your text corpus like a structured dataset and lets you run complex analytics over it using standard SQL, all in RAM.

python
import duckdb           # pip install duckdb
from pathlib import Path

# Step 1: Create an in-memory DuckDB connection
# The empty string "" means in-memory; ":memory:" also works
con = duckdb.connect(database=":memory:")

# Step 2: Load all files into a DuckDB table
# We'll load line by line so we can do per-line searches later
# (as opposed to SQLite FTS5 where we stored one row per file)

# First, build a list of (filepath, line_number, line_text) tuples
rows = []
base_path = Path("/path/to/your/files")

for filepath in base_path.rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        for line_num, line in enumerate(
            filepath.read_text(encoding="utf-8", errors="ignore").splitlines(),
            start=1
        ):
            rows.append((str(filepath), line_num, line))
    except Exception:
        continue

# DuckDB can ingest a Python list of tuples extremely quickly
# executemany is fast but for really large datasets, use duckdb's read_csv or Arrow
con.execute("""
    CREATE TABLE corpus (
        filepath    VARCHAR,
        line_number INTEGER,
        content     VARCHAR
    )
""")
con.executemany("INSERT INTO corpus VALUES (?, ?, ?)", rows)
print(f"Loaded {len(rows):,} lines into DuckDB")

# Install and load the full-text search extension
con.execute("INSTALL fts")
con.execute("LOAD fts")

# Build an FTS index on the content column
# This creates an inverted index under the hood, stored in RAM
con.execute("PRAGMA create_fts_index('corpus', 'rowid', 'content')")

Now the interesting part — the kinds of queries you can run that go beyond what grep or even SQLite FTS can easily express:

python
# Basic FTS search — returns rows where content matches the query
results = con.execute("""
    SELECT filepath, line_number, content,
           fts_main_corpus.match_bm25(rowid, 'database optimization') AS score
    FROM corpus
    WHERE score IS NOT NULL
    ORDER BY score DESC
    LIMIT 20
""").fetchall()

for filepath, line_num, content, score in results:
    print(f"[score={score:.3f}] {filepath}:{line_num}  {content.strip()}")


# Here's where DuckDB really shines over grep — analytical queries:

# "Which files have the most TODO/FIXME comments?"
con.execute("""
    SELECT filepath,
           COUNT(*) AS annotation_count
    FROM corpus
    WHERE regexp_matches(content, '(TODO|FIXME|HACK|XXX)')
    GROUP BY filepath
    ORDER BY annotation_count DESC
    LIMIT 10
""").show()

# "What's the average line length per file? Which files are most dense?"
con.execute("""
    SELECT filepath,
           COUNT(*)            AS total_lines,
           AVG(len(content))   AS avg_line_length,
           MAX(len(content))   AS longest_line
    FROM corpus
    GROUP BY filepath
    ORDER BY avg_line_length DESC
""").show()

# "Find all lines matching a pattern, then extract the matched portion"
con.execute("""
    SELECT filepath, line_number,
           regexp_extract(content, '[A-Z_]{3,}', 0) AS constant_name
    FROM corpus
    WHERE regexp_matches(content, '[A-Z_]{3,}')
    LIMIT 20
""").show()

The mental model to carry away from DuckDB is that it turns your text corpus into a queryable dataset rather than just a collection of files to scan. Once your data is in the table, you're not just searching — you're doing analytics. That distinction becomes very useful when you have questions like "show me the distribution of line lengths" or "which 10 files changed the most between two indexing runs" — questions that grep simply has no way to answer.


Technique 3: Suffix arrays — the data structure that handles patterns inverted indexes can't

All the indexing techniques in the previous responses — SQLite FTS5, the hand-built inverted index, DuckDB FTS — share one fundamental limitation: they're all word-based. They tokenize your text into words at index-build time, and they can only find things that align with word boundaries. If you want to search for func_ (a prefix, not a whole word), or tion (a suffix), or ab.*cd (an arbitrary regex pattern), word-based indexes can't help you — they fall back to linear scanning every file.

Suffix arrays solve this differently. A suffix array is a sorted list of every possible suffix of your text. To understand what a suffix is: given the string "hello", its suffixes are "hello", "ello", "llo", "lo", and "o". If you sort all suffixes of a large text, you get a data structure where you can binary-search for any substring in O(log N) time, regardless of word boundaries.

Let's build intuition with a tiny example before looking at code. Imagine your entire corpus is the string "the cat sat on the mat". Its suffixes, sorted alphabetically, look something like:

" cat sat on the mat"
" mat"
" on the mat"
" sat on the mat"
" the cat sat on the mat"
" the mat"
"at"
"at on the mat"
"at sat on the mat"
"cat sat on the mat"
...
"the cat sat on the mat"
"the mat"

Now if you want to find all occurrences of "the", you binary-search this sorted list for the first entry starting with "the" and the last entry starting with "the". Everything in between is a match. This is O(log N) — blazing fast — and it works for any substring, not just whole words.

python
# Building a suffix array in pure Python is educational but slow for large texts.
# For real use, we use the 'pysais' or 'sais' library which uses the SA-IS algorithm
# (Suffix Array Induced Sorting) — O(N) construction time.
# pip install pysais
import pysais
import bisect
from pathlib import Path

# Step 1: Concatenate ALL files into one giant byte string
# We use a separator character that never appears in real text (byte 0x00, the null byte)
# to mark the boundary between files, so a match can't span two files.
SEPARATOR = b"\x00"
file_boundaries = {}   # maps (start_offset, end_offset) -> filepath
all_bytes = bytearray()

for filepath in Path("/path/to/your/files").rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        content = filepath.read_bytes()
        start = len(all_bytes)
        all_bytes.extend(content)
        all_bytes.extend(SEPARATOR)
        file_boundaries[(start, len(all_bytes) - 1)] = str(filepath)
    except Exception:
        continue

print(f"Total corpus size: {len(all_bytes) / 1e6:.1f} MB")

# Step 2: Build the suffix array
# This is O(N) time and O(N) space using the SA-IS algorithm
# For 1GB of text, this takes a few seconds and uses ~5GB RAM (the array stores int64 indices)
# NOTE: suffix arrays have high memory overhead — roughly 5x the text size
print("Building suffix array (this takes a moment for large corpora)...")
sa = pysais.sais(bytes(all_bytes))   # sa is a list of integer offsets into all_bytes
print(f"Suffix array built: {len(sa)} entries")

# Step 3: Search using binary search
# bisect.bisect_left finds the first position in the sorted suffix array
# where the suffix starts with our pattern

def find_all_occurrences(pattern: bytes, text: bytes, suffix_array: list) -> list[int]:
    """
    Find all byte offsets where `pattern` appears in `text`.
    Uses binary search on the suffix array — O(P * log N) where P = pattern length.
    """
    n = len(text)
    p = len(pattern)

    # Binary search for the leftmost suffix that starts with `pattern`
    lo = bisect.bisect_left(
        suffix_array, 0,
        key=lambda i: text[i:i+p] >= pattern
    )
    # Binary search for the rightmost such suffix
    hi = bisect.bisect_left(
        suffix_array, 0,
        key=lambda i: text[i:i+p] > pattern
    )

    # All entries between lo and hi are matches
    return [suffix_array[i] for i in range(lo, hi)]


# Search for an arbitrary substring — no word boundaries required!
pattern = b"func_parse"
offsets = find_all_occurrences(pattern, bytes(all_bytes), sa)

for offset in offsets:
    # Find which file this offset belongs to
    for (start, end), filepath in file_boundaries.items():
        if start <= offset < end:
            # Extract the surrounding line for context
            line_start = bytes(all_bytes).rfind(b"\n", 0, offset) + 1
            line_end   = bytes(all_bytes).find(b"\n", offset)
            line = all_bytes[line_start:line_end].decode("utf-8", errors="replace")
            local_line_num = bytes(all_bytes).count(b"\n", start, offset) + 1
            print(f"{filepath}:{local_line_num}: {line.strip()}")
            break

The tradeoff to understand with suffix arrays is memory. A suffix array stores one integer (typically 8 bytes) per byte of text, so a 1GB corpus needs an 8GB suffix array. This makes them impractical for very large corpora but excellent for small-to-medium ones (tens to hundreds of MB). The construction algorithm also temporarily needs several times the text size in working memory. In exchange, you get the ability to find any substring — including partial words, partial identifiers, arbitrary byte sequences — in logarithmic time. This is the structure underlying tools like agrep and the genome sequence search tools used in bioinformatics, where "word boundaries" have no meaning and every possible substring matters.


Technique 4: posix_fadvise and async prefetching — telling the OS what you're about to need

Every technique so far has been about either loading data into RAM yourself or searching data that's already there. This technique is different in a subtle and important way: it's about communicating your intentions to the OS kernel so that it can load data into the page cache asynchronously, in the background, while your program is busy doing something else.

The system call is called posix_fadvise, and it lets you say things like "I'm about to read this entire file sequentially" or "I'll need this file's contents soon — please start fetching it from disk now" or even "I'm done with this file, you can evict its pages from the cache." The OS uses these hints to schedule I/O in a smarter way than it would if it had to react to your reads as they happen.

The key constant to understand is POSIX_FADV_WILLNEED, which means "start reading this file into the page cache right now, don't wait for me to ask for it." This is called an asynchronous prefetch — the disk I/O happens in the background while your program continues executing. When your program eventually reads the file, the data is already in RAM and the read completes instantly.

python
import os
import ctypes
import ctypes.util
from pathlib import Path
import time

# Load libc to access posix_fadvise directly
# (Python's os module doesn't expose fadvise natively)
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)

# These constants are defined in <fcntl.h> on Linux
POSIX_FADV_NORMAL     = 0   # no advice — default behavior
POSIX_FADV_RANDOM     = 1   # we'll access this file in random order
POSIX_FADV_SEQUENTIAL = 2   # we'll read this file from start to finish
POSIX_FADV_WILLNEED   = 3   # please prefetch this file into the page cache NOW
POSIX_FADV_DONTNEED   = 4   # you can evict this file's pages, we're done with it
POSIX_FADV_NOREUSE    = 5   # we'll only read each page once (stream processing)

def prefetch_file(filepath: str):
    """
    Tell the OS to start loading `filepath` into the page cache immediately,
    without blocking. Returns right away — prefetch happens in the background.
    """
    fd = os.open(filepath, os.O_RDONLY)
    try:
        file_size = os.path.getsize(filepath)
        # posix_fadvise(fd, offset, length, advice)
        # offset=0, length=0 means "the whole file"
        result = libc.posix_fadvise(
            ctypes.c_int(fd),
            ctypes.c_long(0),      # offset: start from byte 0
            ctypes.c_long(0),      # length: 0 means "entire file"
            ctypes.c_int(POSIX_FADV_WILLNEED)
        )
        if result != 0:
            print(f"fadvise failed for {filepath}")
    finally:
        os.close(fd)

def mark_file_done(filepath: str):
    """
    Tell the OS we're finished with this file.
    The OS can now reclaim those pages for other processes.
    This is good hygiene in long-running search scripts.
    """
    fd = os.open(filepath, os.O_RDONLY)
    try:
        libc.posix_fadvise(
            ctypes.c_int(fd),
            ctypes.c_long(0),
            ctypes.c_long(0),
            ctypes.c_int(POSIX_FADV_DONTNEED)
        )
    finally:
        os.close(fd)


# --- The key usage pattern: prefetch batch N+1 while processing batch N ---
# This is called "double buffering" or "prefetch pipelining"

import re
from itertools import islice

def chunked(iterable, size):
    """Yield successive chunks of `size` items from `iterable`."""
    it = iter(iterable)
    while chunk := list(islice(it, size)):
        yield chunk

all_files = [
    str(fp) for fp in Path("/path/to/your/files").rglob("*")
    if fp.suffix in (".txt", ".md")
]

pattern = re.compile(r"TODO|FIXME", re.IGNORECASE)
BATCH_SIZE = 50   # process 50 files at a time

file_chunks = list(chunked(all_files, BATCH_SIZE))

for i, current_batch in enumerate(file_chunks):

    # Prefetch the NEXT batch while we process the current one
    # The OS starts reading next_batch's files from disk right now,
    # in the background — by the time we get to that batch, they'll be in RAM.
    if i + 1 < len(file_chunks):
        next_batch = file_chunks[i + 1]
        for fp in next_batch:
            prefetch_file(fp)          # non-blocking — returns immediately

    # Process the current batch (these reads hit the page cache if we prefetched earlier)
    for filepath in current_batch:
        try:
            text = Path(filepath).read_text(encoding="utf-8", errors="ignore")
            for line_num, line in enumerate(text.splitlines(), start=1):
                if pattern.search(line):
                    print(f"{filepath}:{line_num}: {line.strip()}")
        except Exception:
            pass

    # Mark current batch as done — let the OS reclaim its pages
    for filepath in current_batch:
        mark_file_done(filepath)

The mental image that makes this click is a conveyor belt in a factory. Without prefetching, a worker waits for a part to arrive, processes it, waits for the next part, and so on — lots of idle waiting. With prefetching, the conveyor belt is always moving: while the worker processes part N, parts N+1 through N+5 are already traveling toward them. The worker is never idle because the pipeline is always full.

On modern NVMe SSDs this matters less than it did with spinning hard disks (SSDs have microsecond latency rather than millisecond latency), but it matters enormously for network-attached storage, slower SSDs under load, or any situation where your files are on a device that benefits from parallel I/O requests. The OS can issue multiple read requests to the disk simultaneously only if it knows they're coming — POSIX_FADV_WILLNEED is how you tell it.

There's also a Python-level way to do this without calling posix_fadvise directly, using asyncio with aiofiles:

python
import asyncio
import aiofiles      # pip install aiofiles
import re

pattern = re.compile(r"TODO|FIXME", re.IGNORECASE)

async def search_file_async(filepath: str) -> list[tuple[str, int, str]]:
    """
    Read and search a single file asynchronously.
    aiofiles uses a thread pool under the hood to avoid blocking the event loop.
    """
    matches = []
    async with aiofiles.open(filepath, encoding="utf-8", errors="ignore") as f:
        content = await f.read()
    for line_num, line in enumerate(content.splitlines(), start=1):
        if pattern.search(line):
            matches.append((filepath, line_num, line.strip()))
    return matches

async def search_all_async(all_files: list[str]):
    """
    Launch all file reads concurrently.
    While one file's I/O is waiting, other files' processing continues.
    """
    # asyncio.gather launches all coroutines and runs them concurrently
    all_results = await asyncio.gather(*[
        search_file_async(fp) for fp in all_files
    ])
    for file_results in all_results:
        for filepath, line_num, line in file_results:
            print(f"{filepath}:{line_num}: {line}")

# Run it
all_files = [str(fp) for fp in Path("/path/to/your/files").rglob("*")
             if fp.suffix in (".txt", ".md")]
asyncio.run(search_all_async(all_files))

The asyncio approach is conceptually cleaner than calling posix_fadvise directly — you're expressing "do all these reads at the same time" rather than "please prefetch these files," and the event loop figures out the scheduling. The posix_fadvise approach gives you more precise control, which matters when memory is tight and you want to carefully manage which pages stay in the cache.


The broader picture

Stepping back, what these four techniques collectively teach is that "loading files into RAM and searching them" is not a single thing — it's a design space with multiple independent axes to tune:

The first axis is how much RAM you need. The compressed corpus technique (Technique 1) is specifically about bending this axis — fitting more data into less RAM by trading CPU time. The suffix array technique (Technique 3) bends it the other way — it needs more RAM than the raw files, in exchange for richer search capabilities.

The second axis is what kinds of queries you want to support. Inverted indexes (SQLite FTS5, DuckDB FTS) are ideal for keyword and phrase queries with relevance ranking. Suffix arrays are ideal for arbitrary substring and regex search. DuckDB's SQL layer adds analytical queries that no pure search tool can express.

The third axis is when the data enters RAM. Every technique in both responses loads data reactively — you ask for it, it arrives. The posix_fadvise / asyncio technique (Technique 4) is specifically about making loading proactive — pushing data into RAM before you need it, hiding the latency behind useful work.

Understanding these axes separately means you can mix and match: you might use a compressed corpus (Technique 1) stored in /dev/shm (from the previous response), loaded via async prefetching (Technique 4), and searched using DuckDB (Technique 2) for structured queries and a suffix array (Technique 3) for regex patterns. Each layer addresses a different constraint.

You said: okay great.

okay great. again, please give me a few more easy techniques / ways / systems . also, please do explain everything in very detail.

May 20

Claude responded: Excellent — let's go into four more techniques that each open up a genuinely new dimension of the problem.

Excellent — let's go into four more techniques that each open up a genuinely new dimension of the problem. The previous responses have covered the OS page cache, RAM disks, ripgrep, parallelism, SQLite FTS5, memory mapping, inverted indexes, compressed corpora, DuckDB, suffix arrays, and prefetching with posix_fadvise. What remains unexplored is how to share a single copy of your corpus across multiple processes without copying it, how to search for thousands of patterns simultaneously in a single pass through the data, how to use a probabilistic data structure to skip files that definitely won't match before you've even opened them, and how to give the OS fine-grained advice about how you intend to access memory-mapped regions. Each of these teaches you something that none of the earlier techniques touched.


Technique 1: multiprocessing.shared_memory — one copy of the corpus, many readers

Here is a subtle but important problem with every parallelism technique from the earlier responses. When you use Python's multiprocessing module to spawn worker processes, each worker gets its own copy of the data. If you load 1GB of text into the parent process and spawn 8 workers, Python copies that 1GB into each worker — so you're actually using 9GB of RAM (one for the parent, eight for the workers). This completely defeats the purpose of parallelism when your dataset is large.

Python 3.8 introduced multiprocessing.shared_memory, which solves this elegantly. The idea is that you allocate a single block of RAM at the OS level — not inside any one Python process, but as a named shared memory segment — and then multiple processes can all map that same physical RAM into their own address spaces. Every worker reads from the exact same physical memory pages. No copying happens at all. From each worker's perspective, the data looks like a local byte array, but it's actually the same RAM that every other worker is also reading.

The mental model that makes this click is a whiteboard in a meeting room. The old approach (copying data to each worker) is like making a photocopy of the whiteboard for every person in the meeting. The shared memory approach is like having everyone stand in the same room reading the same whiteboard. The whiteboard exists once; everyone reads it simultaneously.

python
import re
import os
from pathlib import Path
from multiprocessing import shared_memory, Pool, cpu_count
from multiprocessing.shared_memory import SharedMemory

# ── Step 1: Load the entire corpus into a shared memory block ──────────────
# We'll concatenate all files into one giant byte string, just like in the
# suffix array approach, using a null byte as a file boundary marker.

SEPARATOR = b"\x00"

# Each entry maps (start, end) byte offsets → filepath string
file_registry = {}
all_bytes = bytearray()

for filepath in Path("/path/to/your/files").rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        content = filepath.read_bytes()
        start = len(all_bytes)
        all_bytes.extend(content)
        all_bytes.extend(SEPARATOR)
        file_registry[(start, len(all_bytes) - 1)] = str(filepath)
    except Exception:
        continue

# Allocate the shared memory block — this lives at the OS level, outside any process
shm = shared_memory.SharedMemory(create=True, size=len(all_bytes))

# Copy our data into the shared block (we do this exactly once)
shm.buf[:len(all_bytes)] = all_bytes
print(f"Corpus loaded into shared memory: {shm.name!r}, {len(all_bytes)/1e6:.1f} MB")


# ── Step 2: Define the worker function ────────────────────────────────────
# This function runs inside each worker process.
# It attaches to the EXISTING shared memory block by name — no data is copied.

def search_region(args):
    """
    Search a slice of the shared memory corpus.
    Each worker gets a (start, end) range to be responsible for.
    """
    shm_name, start, end, pattern_str, registry = args

    # Attach to the shared memory by name — this is just a pointer, not a copy
    local_shm = SharedMemory(name=shm_name)
    # Wrap the shared buffer in a memoryview for efficient slicing
    data = memoryview(local_shm.buf)[start:end]

    pattern = re.compile(pattern_str.encode(), re.IGNORECASE)
    matches = []

    # Convert our slice to bytes for regex scanning
    # (memoryview slices don't copy — they're views into the same RAM)
    raw = bytes(data)
    for match in re.finditer(pattern, raw):
        offset = start + match.start()
        # Find the line containing this match
        line_start = raw.rfind(b"\n", 0, match.start()) + 1
        line_end   = raw.find(b"\n", match.start())
        line_end   = line_end if line_end != -1 else len(raw)
        line = raw[line_start:line_end].decode("utf-8", errors="replace")
        matches.append((offset, line.strip()))

    local_shm.close()  # detach (does NOT free the memory — the owner still holds it)
    return matches


# ── Step 3: Divide the corpus into regions and search in parallel ──────────
# Each worker gets a roughly equal slice of the byte range to search.
# Because everyone reads from the same physical RAM, there is zero memory overhead.

n_workers = cpu_count()
corpus_size = len(all_bytes)
chunk_size  = corpus_size // n_workers

regions = [
    (shm.name, i * chunk_size, min((i + 1) * chunk_size, corpus_size), "TODO|FIXME", file_registry)
    for i in range(n_workers)
]

with Pool(processes=n_workers) as pool:
    all_results = pool.map(search_region, regions)

# Flatten and print results
for region_matches in all_results:
    for offset, line in region_matches:
        # Look up which file this byte offset belongs to
        for (start, end), filepath in file_registry.items():
            if start <= offset < end:
                print(f"{filepath}: {line}")
                break

# ── Step 4: Clean up the shared memory block ──────────────────────────────
# This is important — shared memory persists until explicitly freed,
# even if your Python process exits. Always clean up.
shm.close()
shm.unlink()   # unlink() removes the OS-level segment entirely

The key insight to sit with is the shm.unlink() call at the end. On Linux, shared memory segments are named OS resources, similar to files — they exist independently of any process. If your program crashes before calling unlink(), the shared memory segment stays allocated, consuming RAM until the system reboots or you manually remove it. This is different from every other Python data structure, which the garbage collector cleans up automatically. It's a small price for a capability that's genuinely powerful: multi-process parallelism over a large corpus with zero memory overhead.


Technique 2: The Aho-Corasick automaton — searching for thousands of patterns in one pass

Every technique so far has assumed you're searching for one pattern at a time. But what if you have 10,000 patterns — a list of error codes, a dictionary of sensitive words, a set of known malware signatures — and you want to find any of them in your corpus? The naïve approach is to run your search tool 10,000 times, once per pattern. That means reading every file 10,000 times, which is wildly inefficient.

The Aho-Corasick algorithm solves this elegantly. It compiles all your patterns into a single finite-state automaton (think of it as a flowchart) that processes the text exactly once and simultaneously checks for all patterns. No matter how many patterns you have — ten or ten million — the text is still only read once.

To understand how it works, first think about how a simple string search works. Looking for "cat" in text means: at each position, check if the next three characters are c, a, t. If any check fails, move one position forward and try again. This is O(N × M) where N is text length and M is pattern length.

Aho-Corasick builds a trie (a prefix tree) of all your patterns, then adds "failure links" — shortcuts that say "if this match attempt fails at position K, jump to this other state which represents the longest suffix of what you've matched so far that is also a prefix of some other pattern." This means the automaton never backs up in the text. It makes exactly N state transitions for a text of length N, regardless of how many patterns you're searching for. It runs in O(N + M + Z) time where Z is the number of matches found.

python
# pip install pyahocorasick
import ahocorasick
import re
from pathlib import Path

# ── Step 1: Build the Aho-Corasick automaton from your pattern list ────────
# Imagine you have a file of patterns — one per line
patterns = [
    "TODO", "FIXME", "HACK", "XXX",
    "password", "secret", "api_key",
    "deprecated", "unsafe", "overflow",
    # ... you could have thousands more here with no performance penalty
]

# The automaton object compiles all patterns into one state machine
A = ahocorasick.Automaton()

for idx, pattern in enumerate(patterns):
    # add_word(key, value): the key is the pattern, the value is whatever you want
    # to associate with a match — here we store the pattern string and its index
    A.add_word(pattern.lower(), (idx, pattern))

# make_automaton() finalises the structure, adding the failure links that make it fast
# After this call, the automaton is immutable but can be used for searching
A.make_automaton()

print(f"Automaton built from {len(patterns)} patterns")


# ── Step 2: Load corpus into RAM and search ────────────────────────────────
# Note that we lower-case the text before scanning to make search case-insensitive.
# This is fine because we're only searching, not modifying.

corpus = {}
for filepath in Path("/path/to/your/files").rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        corpus[str(filepath)] = filepath.read_text(encoding="utf-8", errors="ignore").lower()
    except Exception:
        continue

print(f"Loaded {len(corpus)} files")


# ── Step 3: Search — one pass through each file, ALL patterns simultaneously ─
for filepath, text in corpus.items():
    # A.iter(text) is a generator that yields (end_index, value) for each match.
    # end_index is the index of the LAST character of the match in `text`.
    # The automaton processes text character-by-character, O(N) regardless of patterns.
    
    file_matches = []
    for end_index, (pattern_idx, pattern) in A.iter(text):
        start_index = end_index - len(pattern) + 1
        
        # Find the line number by counting newlines before this position
        line_num = text.count("\n", 0, start_index) + 1
        
        # Extract the surrounding line for context
        line_start = text.rfind("\n", 0, start_index) + 1
        line_end   = text.find("\n", end_index)
        line_end   = line_end if line_end != -1 else len(text)
        line = text[line_start:line_end].strip()
        
        file_matches.append((line_num, pattern, line))
    
    for line_num, pattern, line in file_matches:
        print(f"{filepath}:{line_num} [{pattern}] {line}")

The thing that makes Aho-Corasick genuinely exciting is its scaling behaviour. If you benchmark searching for 1 pattern vs. 100 patterns vs. 10,000 patterns over the same corpus, you'll find that the runtime barely changes as the pattern count grows. The automaton construction time grows with the number of patterns, but the search time is determined almost entirely by the size of your text. This is a completely different scaling curve from any regex-based approach, where adding more alternatives to a pattern like cat|dog|fish|... does slow things down proportionally.

A good mental exercise: think about a spam filter that needs to check every incoming email for 50,000 known spam phrases. With regex alternation, each email takes 50,000 pattern comparisons. With Aho-Corasick, each email is scanned exactly once, and all 50,000 patterns are checked simultaneously. That's the practical power of this algorithm.


Technique 3: Bloom filters — probabilistically ruling out files before you search them

This technique is philosophically different from everything else in these posts. Instead of making search faster, it's about making search smarter by answering the question "which files can I safely skip entirely?" before doing any expensive scanning.

A Bloom filter is a probabilistic data structure that answers membership queries — "does this set contain this element?" — in O(1) time with zero false negatives and a tunable false positive rate. The critical guarantee is: if the Bloom filter says "this file does not contain the word 'TODO'", it is absolutely, provably right — there is no chance of a false negative. If it says "this file might contain 'TODO'", there's a small (and tunable) chance it's wrong — a false positive — but you only need to actually read that file to confirm.

To understand how a Bloom filter works, imagine a bit array of, say, 1,000 bits, all starting at 0. When you add a word to the filter, you run it through several different hash functions and set the bits at those positions to 1. When you query whether a word is in the filter, you run the same hash functions and check those positions — if any of them is 0, the word is definitely not in the set. If all of them are 1, the word is probably in the set (with a false positive rate that depends on how full the filter is and how many hash functions you used).

The reason this pairs so well with the search-in-RAM problem is that Bloom filters are tiny. A filter with a 1% false positive rate for 100,000 unique words takes only about 114KB of RAM. You can keep one Bloom filter per file in RAM — representing the entire vocabulary of each file — and quickly check which files could possibly contain your search term before opening any of them.

python
# pip install pybloom-live
from pybloom_live import BloomFilter
from pathlib import Path
import re

# ── Step 1: Build one Bloom filter per file and store them all in RAM ──────
# The filter represents the set of unique words in each file.
# error_rate=0.01 means ~1% false positive rate — tunable to your needs.

def tokenize(text: str) -> set[str]:
    """Extract unique lowercase words from text."""
    return set(re.findall(r'\b[a-z]+\b', text.lower()))

file_filters: dict[str, BloomFilter] = {}
corpus_words: dict[str, set[str]] = {}   # we'll use this for verification

base_path = Path("/path/to/your/files")

for filepath in base_path.rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        text = filepath.read_text(encoding="utf-8", errors="ignore")
        words = tokenize(text)
        
        if not words:
            continue
        
        # capacity=len(words) tells the filter how many items to expect,
        # so it can size itself optimally. error_rate is the false positive rate.
        bf = BloomFilter(capacity=max(len(words), 100), error_rate=0.01)
        
        for word in words:
            bf.add(word)
        
        file_filters[str(filepath)] = bf
        corpus_words[str(filepath)] = words   # keep for verification only
        
    except Exception:
        continue

total_filter_bytes = sum(bf.num_bits // 8 for bf in file_filters.values())
print(f"Built {len(file_filters)} Bloom filters, total RAM: {total_filter_bytes / 1e3:.1f} KB")


# ── Step 2: Use filters to pre-screen files before expensive full scan ──────
# This is the two-phase search pattern:
# Phase 1 (fast, RAM only): ask the Bloom filter — can this file possibly contain the term?
# Phase 2 (slow, may involve I/O): only for files that passed Phase 1, do a real scan.

def search_with_bloom_prefilter(query_word: str, files_to_check: list[str]) -> list[str]:
    """
    Returns files that definitely contain `query_word`.
    Uses Bloom filters to skip files that provably don't contain it.
    """
    query = query_word.lower()
    
    definitely_not = 0
    candidates = []
    
    for filepath in files_to_check:
        bf = file_filters.get(filepath)
        if bf is None:
            continue
        
        if query not in bf:
            # The filter says NO with 100% certainty — skip this file entirely
            definitely_not += 1
        else:
            # The filter says MAYBE — add to candidates for real scanning
            candidates.append(filepath)
    
    print(f"Bloom filter eliminated {definitely_not} files ({definitely_not/len(files_to_check)*100:.1f}%)")
    print(f"Scanning {len(candidates)} candidate files...")
    
    # Phase 2: real scan of candidates only
    confirmed = []
    for filepath in candidates:
        # corpus_words is already in RAM from indexing time, so this is free
        if query in corpus_words.get(filepath, set()):
            confirmed.append(filepath)
    
    return confirmed


all_files = list(file_filters.keys())
results = search_with_bloom_prefilter("optimization", all_files)
print(f"\nFiles actually containing 'optimization': {len(results)}")
for filepath in results:
    print(f"  {filepath}")

The practical impact of Bloom filter pre-screening depends on how selective your query is. If you're searching for a common word like "the", almost every file will pass the filter (true positives) and you save very little. If you're searching for a rare term like "Aho-Corasick", most files will be immediately eliminated (true negatives) and you might skip 99% of files without opening them. In the best case — rare terms, large corpus — Bloom filters can turn an O(N) scan into something much closer to O(1).

This is exactly the same idea that Apache Cassandra, Apache Parquet, and many other large-scale storage systems use to avoid reading data blocks that can't possibly contain a query result. The concept scales from a Python script on your laptop all the way to distributed databases handling petabytes.


Technique 4: madvise — fine-grained memory advice for mapped regions

The last technique brings together memory-mapping (from the second response) and the prefetching idea (from the third response) and gives you direct, surgical control over how the OS treats individual pages within a mapped region. While posix_fadvise gives advice about files being read via read() system calls, madvise gives advice about memory regions that have been mapped into your address space via mmap. These are two different subsystems in the kernel, and madvise gives you more fine-grained control because you're working directly with virtual memory pages rather than file I/O.

The key system call is madvise(addr, length, advice), where advice is one of several constants that change the OS's memory management behaviour for that region. The most useful ones are MADV_SEQUENTIAL (I'm going to scan this region front-to-back — prefetch aggressively and discard pages I've already passed), MADV_RANDOM (I'm going to jump around — don't bother prefetching), MADV_WILLNEED (prefetch these pages into RAM right now, in the background), and MADV_DONTNEED (I'm done with this region — you can evict these pages immediately).

Think of madvise as giving the OS a reading schedule. Without it, the OS has to guess: when you access page 1 of a mapping, it assumes you might want page 2 next and prefetches a few pages ahead — called "read-ahead." But if you're actually jumping to page 500 next, that prefetch was wasted. madvise lets you tell the OS exactly what the access pattern will be so it can schedule memory operations optimally.

python
import mmap
import ctypes
import ctypes.util
import os
import re
from pathlib import Path

libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)

# madvise advice constants (from <sys/mman.h> on Linux)
MADV_NORMAL     = 0   # default: expect sequential access with some read-ahead
MADV_RANDOM     = 1   # expect random access: disable read-ahead entirely
MADV_SEQUENTIAL = 2   # expect sequential: prefetch aggressively, discard early pages
MADV_WILLNEED   = 3   # prefetch this region into RAM right now (async, non-blocking)
MADV_DONTNEED   = 4   # I'm done: evict this region's pages to free RAM
MADV_FREE       = 8   # pages are reusable — OS can reclaim them lazily (Linux 4.5+)

def advise(mm: mmap.mmap, offset: int, length: int, advice: int):
    """
    Give the OS memory management advice about a region of a mmap object.
    mm: the mmap object
    offset: byte offset within the mapped region (must be page-aligned)
    length: number of bytes the advice covers
    advice: one of the MADV_* constants above
    """
    # Get the base address of the mmap region as a raw integer
    # ctypes.addressof works on the mmap's buffer object
    base_addr = ctypes.addressof(ctypes.c_char.from_buffer(mm))
    target_addr = base_addr + offset
    
    result = libc.madvise(
        ctypes.c_void_p(target_addr),
        ctypes.c_size_t(length),
        ctypes.c_int(advice)
    )
    if result != 0:
        errno = ctypes.get_errno()
        raise OSError(errno, f"madvise failed: {os.strerror(errno)}")


# ── The main use case: a two-pass search strategy ─────────────────────────
# Pass 1: memory-map all files and issue MADV_WILLNEED for all of them
#         simultaneously. The OS starts loading all files into the page cache
#         in the background while we do other work.
# Pass 2: search each file. By the time we get to it, its pages are likely
#         already in RAM. After searching, issue MADV_DONTNEED to free pages
#         we're done with.

def two_pass_search(directory: str, pattern_str: str):
    pattern = re.compile(pattern_str.encode(), re.IGNORECASE)
    base_path = Path(directory)
    
    # Collect all files first
    all_files = [
        fp for fp in base_path.rglob("*")
        if fp.suffix in (".txt", ".md") and fp.stat().st_size > 0
    ]
    
    # Open all files and memory-map them
    open_files = []
    for filepath in all_files:
        f = open(filepath, "rb")
        mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        open_files.append((filepath, f, mm))
    
    print(f"Memory-mapped {len(open_files)} files")
    
    # ── Pass 1: Prefetch ALL files simultaneously ──────────────────────────
    # MADV_SEQUENTIAL tells the OS: I'll scan this file front to back.
    # This does two things: (1) it starts async prefetching into the page cache,
    # and (2) it tells the OS it can aggressively discard pages we've already
    # scanned, keeping memory pressure low.
    for filepath, f, mm in open_files:
        try:
            advise(mm, 0, len(mm), MADV_SEQUENTIAL)
            # MADV_WILLNEED additionally says "start loading RIGHT NOW"
            advise(mm, 0, len(mm), MADV_WILLNEED)
        except OSError:
            pass   # madvise is advisory — failure is non-fatal
    
    print("Prefetch hints issued — OS is loading files into RAM in the background")
    
    # ── Pass 2: Search each file ───────────────────────────────────────────
    # The MADV_WILLNEED hint means that by the time we get to later files
    # in this loop, they're likely already in the page cache.
    for filepath, f, mm in open_files:
        file_has_matches = False
        
        for match in re.finditer(pattern, mm):
            if not file_has_matches:
                print(f"\n{filepath}:")
                file_has_matches = True
            
            # Find the surrounding line
            offset = match.start()
            line_start = mm.rfind(b"\n", 0, offset) + 1
            line_end   = mm.find(b"\n", offset)
            line_end   = line_end if line_end != -1 else len(mm)
            line = mm[line_start:line_end].decode("utf-8", errors="replace")
            line_num   = mm[:offset].count(b"\n") + 1
            print(f"  {line_num}: {line.strip()}")
        
        # ── Tell the OS we're done with this file ──────────────────────────
        # MADV_DONTNEED marks these pages as immediately reclaimable.
        # Without this, the pages stay in the page cache until the OS
        # decides to evict them for other reasons — using RAM you no longer need.
        try:
            advise(mm, 0, len(mm), MADV_DONTNEED)
        except OSError:
            pass
        
        mm.close()
        f.close()


two_pass_search("/path/to/your/files", r"TODO|FIXME")

There's a subtlety worth dwelling on here: the difference between MADV_WILLNEED and MADV_SEQUENTIAL even though both cause prefetching. MADV_WILLNEED is a one-time instruction: "fetch these pages now." MADV_SEQUENTIAL is an ongoing policy: "as I scan through this region, keep prefetching aggressively ahead of my current position and drop pages I've already passed." Using both together is the most aggressive possible prefetch strategy — it's useful when you know you're going to scan every byte of every file and you want the OS to pipeline the I/O as tightly as possible.

The MADV_DONTNEED call at the end of each file is worth highlighting as a habit. Without it, your search script might use 1GB of page cache for files it has already finished with, competing with other processes (or later files in your own search) for RAM. Explicitly releasing pages as you finish with them keeps your working set small and your overall performance more consistent, especially on machines under memory pressure.


How these four fit together with everything before

Looking across all twelve techniques now, a clear taxonomy emerges that is worth making explicit.

The first group is about where data lives: the OS page cache, RAM disks (/dev/shm, tmpfs), mlockall to prevent eviction, and shared memory (multiprocessing.shared_memory) all address the question of how to get bytes into RAM and keep them there. They operate at the level of physical memory management and are largely independent of what you do with the data once it's there.

The second group is about data representation: the raw-bytes dictionary, the compressed corpus, Apache Arrow columns, and suffix arrays are all different ways to arrange your data in RAM. Each representation makes some operations cheap and others expensive, and the right choice depends entirely on your query patterns.

The third group is about search algorithms: linear scanning (grep, ripgrep), inverted indexes (SQLite FTS5, DuckDB, the hand-built index), the Aho-Corasick automaton for multi-pattern search, suffix arrays for arbitrary substrings, and Bloom filters for pre-screening. These operate on top of whatever representation you've chosen and determine the computational complexity of each query.

The fourth group is about I/O scheduling: posix_fadvise, madvise, asyncio, and prefetch pipelining are all about communicating your intentions to the OS so that I/O latency is hidden behind useful computation. These matter most when your working set is larger than physical RAM and data must be fetched from disk on demand.

The most powerful thing you can do with this knowledge is combine techniques from different groups, because they are largely orthogonal — improvements in one group don't interfere with improvements in another. You might use compressed shared memory (groups 1 + 2) searched with an Aho-Corasick automaton (group 3) over files pre-fetched with madvise (group 4). Each layer addresses a different bottleneck, and the combined system is faster than any single technique could make it.

You said: okay great.

okay great. again, please give me a few more easy techniques / ways / systems . also, please do explain everything in very detail.

May 20

Claude responded: Great — here are four more techniques, each opening up a genuinely different angle from everything covered so far.

Great — here are four more techniques, each opening up a genuinely different angle from everything covered so far. The previous twelve techniques have dealt with where data lives in memory, how to represent it, how to schedule I/O, and which algorithmic structures to search it with. What we haven't covered yet is how to do substring search with a small, practical index (the trigram index), how to exploit the CPU's SIMD hardware through a columnar memory layout (Apache Arrow), how to use a complete production-quality search library that runs entirely in RAM (Whoosh), and how to call into the C standard library's hand-optimised, SIMD-accelerated string search function directly from Python without writing any C yourself (memmem).


Technique 1: The trigram index — substring search with a tiny, practical inverted index

The word-based inverted index from earlier responses is fast and memory-efficient, but it has a fundamental limitation: it only finds whole words. If you search for parse_f, func_, or tion — partial words, identifiers with underscores, camelCase fragments — the word-based index is useless and you have to fall back to scanning every file. The suffix array from an earlier response handles this correctly, but its memory cost is roughly 8× the corpus size, making it impractical for large datasets.

The trigram index sits beautifully between these two extremes. A trigram is any three consecutive characters in a string. The word "hello" contains the trigrams "hel", "ell", and "llo". The key insight is that if a file contains the string "hello", it must by definition contain all three of those trigrams. This means you can build a character-level inverted index — mapping each trigram to the set of files that contain it — and use it as a filter before doing any actual file scanning. Any file missing even one of your query's trigrams cannot possibly contain the query string, so it can be skipped entirely without being read.

The number of possible trigrams is bounded at 26³ ≈ 17,576 for lowercase ASCII letters, which means the index stays compact regardless of corpus size. This is the technique that powers PostgreSQL's pg_trgm extension, Sourcegraph's zoekt code search engine, and Russ Cox's original codesearch tool (which he wrote specifically to make searching the entire Google codebase practical).

python
import re
from collections import defaultdict
from pathlib import Path

def get_trigrams(text: str) -> set[str]:
    """
    Extract every unique 3-character sequence from the text.
    We lowercase first so the index is case-insensitive.
    
    "hello world" → {"hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"}
    """
    text = text.lower()
    # A set comprehension — we want unique trigrams, not counts
    return {text[i:i+3] for i in range(len(text) - 2)}

# ── Phase 1: Build the index (done once, kept in RAM) ─────────────────────
# The index maps each trigram string → set of filepaths containing it.
# defaultdict(set) automatically creates an empty set for any new trigram key.
trigram_index: dict[str, set[str]] = defaultdict(set)

# We also keep the raw contents in RAM for the verification scan (Phase 3).
# This is the same dictionary approach from the first response.
file_contents: dict[str, str] = {}

base_path = Path("/path/to/your/files")

for filepath in base_path.rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        text = filepath.read_text(encoding="utf-8", errors="ignore")
        file_contents[str(filepath)] = text
        
        for trigram in get_trigrams(text):
            trigram_index[trigram].add(str(filepath))
    except Exception:
        continue

print(f"Index built: {len(trigram_index)} unique trigrams across {len(file_contents)} files")


# ── Phase 2 & 3: Search (fast candidate selection, then verification) ──────

def trigram_search(query: str) -> list[tuple[str, int, str]]:
    """
    Search for an exact substring using the trigram index.
    Returns a list of (filepath, line_number, line_text) tuples.
    """
    query_lower = query.lower()
    query_trigrams = get_trigrams(query_lower)
    
    # Edge case: queries shorter than 3 characters have no trigrams.
    # For these, we have no choice but to scan every file.
    if not query_trigrams:
        print("Query too short for trigram filtering — falling back to full scan")
        candidates = set(file_contents.keys())
    else:
        # The core of the algorithm: intersect the posting lists.
        # Start with the files matching the first trigram, then keep only
        # those that also match the second trigram, third, and so on.
        # This is identical to the AND logic in the word-based inverted index,
        # just operating on character n-grams instead of words.
        candidates = None
        for trigram in query_trigrams:
            files_with_trigram = trigram_index.get(trigram, set())
            if candidates is None:
                candidates = files_with_trigram.copy()
            else:
                # Set intersection: &= keeps only files present in BOTH sets.
                # Critically, we process the most selective (smallest) trigrams
                # first to prune the candidate set as aggressively as possible.
                candidates &= files_with_trigram
            
            if not candidates:
                return []   # early exit — nothing can possibly match
        
        candidates = candidates or set()
        total = len(file_contents)
        print(f"Trigram filter: {total - len(candidates)} of {total} files eliminated "
              f"({(total - len(candidates)) / total * 100:.1f}% skipped)")
    
    # Phase 3: Verification scan — trigram filters have false positives.
    # A file could contain all three trigrams "hel", "ell", "llo" without
    # containing "hello" if those trigrams appear in different parts of the file
    # (e.g., "shelled" contains "hel" and "ell" but not "llo" next to them).
    # The verification scan catches this by doing an exact substring check.
    results = []
    for filepath in candidates:
        content = file_contents.get(filepath, "")
        for line_num, line in enumerate(content.splitlines(), start=1):
            if query_lower in line.lower():
                results.append((filepath, line_num, line.strip()))
    
    return results


# Run a search
for filepath, line_num, line in trigram_search("parse_result"):
    print(f"{filepath}:{line_num}: {line}")

There is one subtlety worth sitting with here: the false positive rate of the trigram filter. A file passes the filter if it contains all of the query's trigrams, but those trigrams might not be adjacent in the file — they could appear in completely different paragraphs. So the trigram phase is a filter, not a final answer. You always need the verification scan in Phase 3. In practice, false positive rates are low (typically under 5% for reasonably selective queries), meaning the verification scan is a small fraction of the total work. The payoff is enormous: for a query like "parse_result", which has eight trigrams, any file missing even one of them is eliminated instantly by a set intersection with no file I/O at all.


Technique 2: Apache Arrow in-memory columnar layout — SIMD-vectorized string search

All the previous techniques — Python dictionaries, compressed corpora, inverted indexes — store data in a row-oriented layout. For a line-by-line corpus, that means each line is stored as an individual Python string object, with its own memory allocation header, reference count, and pointer. If you have ten million lines, you have ten million separate heap allocations. The CPU has to chase pointers all over memory, and its cache gets polluted by all the object overhead surrounding the actual character data.

Apache Arrow takes a fundamentally different approach: it stores data in a columnar layout, meaning all the content strings are packed into a single contiguous buffer of bytes, with a separate buffer of integer offsets that marks where each string starts and ends. There's no per-string object overhead. The character data for all ten million lines sits in one uninterrupted block of memory, which is exactly the layout that SIMD (Single Instruction, Multiple Data) CPU instructions are designed to exploit.

SIMD instructions — called AVX2 or NEON depending on your CPU architecture — can compare 32 bytes against a pattern in a single clock cycle instead of one byte at a time. When PyArrow's compute module runs a substring search, it's not running a Python loop. It calls into a compiled Arrow C++ kernel that processes the entire string column using these vectorized instructions, checking many characters simultaneously. The speedup over a Python-level loop can be 10–50× for large columns.

python
import pyarrow as pa
import pyarrow.compute as pc
from pathlib import Path

# ── Step 1: Load the corpus into an Arrow Table ───────────────────────────
# We store data as a table with three columns:
# "filepath" (string), "line_number" (int32), "content" (string).
# Each row is one line from one file.
# 
# The key difference from a Python list of tuples: Arrow stores all "content"
# strings packed into a single contiguous buffer, not as scattered heap objects.

filepaths    = []
line_numbers = []
contents     = []

for filepath in Path("/path/to/your/files").rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        text = filepath.read_text(encoding="utf-8", errors="ignore")
        for line_num, line in enumerate(text.splitlines(), start=1):
            filepaths.append(str(filepath))
            line_numbers.append(line_num)
            contents.append(line)
    except Exception:
        continue

# pa.table() arranges these Python lists into the columnar Arrow format.
# The type annotations ensure efficient memory packing:
# pa.string() uses a variable-length binary buffer (the contiguous layout described above).
# pa.int32() uses a fixed-width 4-byte integer array.
table = pa.table({
    "filepath":    pa.array(filepaths,    type=pa.string()),
    "line_number": pa.array(line_numbers, type=pa.int32()),
    "content":     pa.array(contents,     type=pa.string()),
})

total_bytes = table.nbytes
print(f"Table loaded: {len(table):,} rows, {total_bytes / 1e6:.1f} MB in RAM")


# ── Step 2: Search using vectorized compute operations ────────────────────
# pc.match_substring scans the entire "content" column in one call,
# returning a BooleanArray where True means that row matches.
# Internally, Arrow's C++ runtime uses SIMD to process many bytes at once.

def arrow_search(table: pa.Table, pattern: str, ignore_case: bool = True) -> pa.Table:
    """
    Search all lines for `pattern` using Arrow's vectorized compute kernel.
    Returns a filtered table containing only matching rows.
    """
    # pc.match_substring returns a pa.ChunkedArray of booleans
    # ignore_case=True applies case-folding at the C++ level — no Python overhead
    mask = pc.match_substring(table["content"], pattern, ignore_case=ignore_case)
    
    # table.filter() applies the boolean mask to all columns simultaneously,
    # returning a new table with only the matching rows.
    # This is also vectorized — no Python loop touches individual rows.
    return table.filter(mask)


results = arrow_search(table, "TODO")
print(f"\nFound {len(results):,} matching lines")

# Convert results to Python for display (only the matched rows, not the full table)
for row in results.to_pylist():
    print(f"{row['filepath']}:{row['line_number']}: {row['content'].strip()}")


# ── Bonus: regex search using pc.match_substring_regex ────────────────────
# Arrow also supports regex patterns with the same vectorized performance.
# This uses the RE2 engine (Google's linear-time regex library) under the hood,
# meaning even complex patterns are guaranteed to run in O(N) time.

regex_mask    = pc.match_substring_regex(table["content"], r"func_\w+\(", ignore_case=False)
regex_results = table.filter(regex_mask)
print(f"\nRegex matches: {len(regex_results):,} lines")


# ── Bonus 2: multi-column analytics, combining search with aggregation ─────
# Because the data is in a proper table, you can combine search and analytics
# in ways that grep cannot express at all.
# "How many TODO/FIXME lines does each file have?"

annotation_mask = pc.match_substring_regex(table["content"], r"TODO|FIXME|HACK")
annotation_table = table.filter(annotation_mask)

# Group by filepath and count matches per file
import pyarrow.acero as ac   # Arrow's streaming execution engine (optional, advanced)

counts = (
    annotation_table
    .group_by("filepath")
    .aggregate([("line_number", "count")])
    .sort_by([("line_number_count", "descending")])
)
print("\nFiles with most annotations:")
for row in counts.to_pylist()[:10]:
    print(f"  {row['line_number_count']:>4}  {row['filepath']}")

The memory layout difference is worth visualising concretely. Imagine storing the string "hello" as a Python str object. Python allocates roughly 49 bytes for that object: a reference count (8 bytes), a type pointer (8 bytes), a length field (8 bytes), a hash field (8 bytes), some flags and encoding metadata, and finally the 5 characters themselves. In Arrow's string array, the same string costs exactly 5 bytes of character data plus 4 bytes for its offset in the offsets array — about 9 bytes total, or roughly 5× more compact. When you multiply this across millions of strings, the difference in cache efficiency is dramatic: Arrow's contiguous buffer fits far more actual string data into each cache line, so the CPU spends much less time waiting for memory fetches.


Technique 3: Whoosh with RamStorage — a complete production search library running entirely in RAM

Every indexing technique so far has been built from scratch — we wrote the tokenizer, the posting list, the scoring logic. Whoosh is a pure-Python full-text search library that gives you all of these things, fully assembled and production-tested, with no external C dependencies. Its RamStorage backend keeps the entire index structure in Python's heap rather than on disk, which makes it immediately relevant to everything we've been discussing.

Whoosh is interesting not just as a tool but as a way to understand what a real search engine actually consists of. When you use it, you see concepts like schemas (typed field definitions), analyzers (the pipeline that transforms raw text into tokens), BM25F scoring (the modern relevance algorithm that weights term frequency and document length together), and searcher objects (the query execution context). These are the same concepts that Elasticsearch, Solr, and Lucene use — Whoosh is essentially a Lucene clone in Python — so understanding Whoosh gives you a mental model for industrial-scale search engines.

python
# pip install whoosh
from whoosh import index as whoosh_index
from whoosh.fields import Schema, TEXT, ID, NUMERIC, STORED
from whoosh.analysis import StemmingAnalyzer, StandardAnalyzer
from whoosh.filedb.filestore import RamStorage
from whoosh.qparser import QueryParser, MultifieldParser
from whoosh.writing import AsyncWriter
from pathlib import Path

# ── Step 1: Define a Schema ───────────────────────────────────────────────
# A schema is a typed description of your "documents" (in our case, files).
# Each field has a type that controls how it's indexed and searched.
#
# TEXT(stored=True, analyzer=...)  — full-text indexed, tokenized by the analyzer
# ID(stored=True, unique=True)     — exact value, not tokenized (good for file paths)
# NUMERIC(stored=True)             — numeric field, supports range queries
# STORED                           — stored but not indexed (for retrieval only)
#
# The StemmingAnalyzer applies Porter stemming:
# "running" and "run" and "runs" all map to the same stem, so they match each other.
# This is great for prose text.
# StandardAnalyzer does tokenisation + lowercasing but no stemming,
# which is better for code (you usually don't want "function" to match "func").

schema = Schema(
    filepath    = ID(stored=True, unique=True),
    filename    = TEXT(stored=True, analyzer=StandardAnalyzer()),
    extension   = ID(stored=True),
    content     = TEXT(stored=False,  # don't store the content — saves RAM
                       analyzer=StemmingAnalyzer()),
    line_count  = NUMERIC(stored=True, numtype=int),
)

# ── Step 2: Create the index in a RamStorage backend ─────────────────────
# RamStorage is a Whoosh storage class that implements the same interface
# as its disk-based FileStorage, but keeps all index files in Python dicts.
# The entire inverted index, term frequencies, BM25 statistics, and document
# store all live in RAM. Nothing touches the filesystem.

ram_store = RamStorage()
ix = ram_store.create_index(schema)

# ── Step 3: Index all files ───────────────────────────────────────────────
# The writer is a transactional object: documents are buffered until
# writer.commit() is called, at which point the index is updated atomically.
# AsyncWriter uses a background thread for the commit, keeping the main
# thread free — useful for indexing while also doing other work.

writer = ix.writer(limitmb=256)   # limitmb controls how much RAM the write buffer uses

for filepath in Path("/path/to/your/files").rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        text = filepath.read_text(encoding="utf-8", errors="ignore")
        writer.add_document(
            filepath   = str(filepath),
            filename   = filepath.name,
            extension  = filepath.suffix.lstrip("."),
            content    = text,
            line_count = text.count("\n") + 1,
        )
    except Exception as e:
        print(f"Skipping {filepath}: {e}")

writer.commit()   # finalises the index — builds term dictionaries, computes BM25 stats
print(f"Index committed. {ix.doc_count()} documents indexed.")


# ── Step 4: Search ────────────────────────────────────────────────────────
# The searcher is the query execution context. It must be used as a context
# manager so that Whoosh can properly manage its internal caches.

with ix.searcher() as searcher:

    # --- Basic keyword search ---
    # QueryParser handles user-facing query syntax:
    # "hello world"  → AND query (both words required)
    # "hello OR world" → OR query
    # "hel*"          → prefix/wildcard query
    # '"exact phrase"' → phrase query (matches words in sequence)
    parser = QueryParser("content", schema=ix.schema)
    query  = parser.parse("database optimization")
    
    results = searcher.search(query, limit=20)
    print(f"\nFound {len(results)} results (showing top {min(20, len(results))}):")
    
    for hit in results:
        # hit.score is the BM25F relevance score — higher is more relevant
        # BM25F accounts for term frequency, document length, and inverse document frequency
        print(f"  [{hit.score:.3f}] {hit['filepath']} ({hit['line_count']} lines)")
    
    # --- Multi-field search: search filename AND content simultaneously ---
    multi_parser = MultifieldParser(["filename", "content"], schema=ix.schema)
    multi_query  = multi_parser.parse("config")
    multi_results = searcher.search(multi_query, limit=10)
    
    # --- Wildcard / prefix search ---
    wildcard_query = parser.parse("func_pars*")   # matches func_parse, func_parser, etc.
    wildcard_results = searcher.search(wildcard_query, limit=10)
    
    # --- Fuzzy search: find terms within edit distance 1 of "optimise" ---
    # Useful for typos or British/American spelling differences
    fuzzy_query  = parser.parse("optimise~1")     # ~1 means Levenshtein distance ≤ 1
    fuzzy_results = searcher.search(fuzzy_query, limit=10)
    
    # --- Highlight matching terms in context ---
    # Whoosh can extract snippets around matches, with matched terms highlighted
    for hit in results[:3]:
        # To use highlights, we need the content stored — but we set stored=False above.
        # In a real application, you'd either store=True or re-read the file here.
        print(f"\n{hit['filepath']}:")

The thing that makes Whoosh instructive beyond its practical usefulness is that its RamStorage class makes the normally-invisible index structure visible as Python objects. If you inspect ram_store._filedict after indexing, you'll find a dictionary of virtual "files" with names like MAIN_0.toc, MAIN_0.seg, and so on — these are the same logical files that Whoosh would write to disk in its normal mode, but here they're stored as BytesIO objects in RAM. The inverted index, the term vectors, the document fields, the BM25 statistics — all of it is in that dictionary. This makes RamStorage a beautiful example of the abstraction layer concept in software design: the entire rest of Whoosh doesn't know or care whether its "files" are on disk or in memory.


Technique 4: memmem via ctypes — calling the C library's SIMD-accelerated substring search

This is the lowest-level technique in the entire series and teaches you something fundamental about the relationship between Python and the C runtime that underlies it. Python's bytes.find() method is reasonably fast, but it's implemented as a general-purpose string search in Python's C source code. The C standard library on your operating system has its own implementation of substring search called memmem(), and on modern systems — particularly Linux with glibc — this function is hand-written in assembly using AVX2 SIMD instructions that can compare 32 bytes per clock cycle. For large texts, this can be 3–5× faster than Python's built-in find().

The reason Python doesn't use memmem internally is partly historical and partly portability: memmem is a POSIX extension (it's not in the C standard), so it's not available on Windows. But on Linux and macOS, it's right there in libc, and you can call it directly from Python using ctypes — the same ctypes we used for mlock and madvise in earlier responses.

The deeper lesson here is about where your code actually runs. When Python executes b"needle" in haystack_bytes, it enters a Python C function, which loops through the haystack a byte at a time in a fairly straightforward way. When your code calls libc.memmem(), it enters a piece of assembly code that the glibc developers have optimised specifically for your CPU's vector instruction set, using techniques like checking 32-byte alignment, processing 256 bits at a time with AVX2, and falling back gracefully on older CPUs. The same Python-level operation, but one path uses your CPU's full width while the other uses it one byte at a time.

python
import ctypes
import ctypes.util
import mmap
import os
import re
from pathlib import Path

# ── Step 1: Load memmem from the C standard library ───────────────────────
libc = ctypes.CDLL(ctypes.util.find_library("c"))

# Declare the function signature so ctypes knows how to marshal arguments.
# memmem's C signature:
#   void* memmem(const void* haystack, size_t haystacklen,
#                const void* needle,   size_t needlelen);
# It returns a pointer to the first occurrence of needle in haystack,
# or NULL (None in Python) if not found.
libc.memmem.restype  = ctypes.c_void_p
libc.memmem.argtypes = [
    ctypes.c_void_p,  ctypes.c_size_t,   # haystack, haystacklen
    ctypes.c_void_p,  ctypes.c_size_t,   # needle, needlelen
]


def find_all_memmem(data: bytes, needle: bytes) -> list[int]:
    """
    Find all byte offsets of `needle` in `data` using the C library's memmem().
    Returns a list of integer byte offsets.
    
    On glibc (Linux), memmem() uses AVX2 SIMD instructions to compare
    32 bytes at a time — typically 3-5x faster than Python's bytes.find()
    on large haystacks.
    """
    offsets = []
    haystack_len = len(data)
    needle_len   = len(needle)
    
    if needle_len == 0 or haystack_len == 0:
        return []
    
    # ctypes needs a raw pointer to the start of the bytes buffer.
    # c_char.from_buffer_copy would copy data — we use from_buffer for zero-copy.
    # We pin the bytes object to prevent garbage collection while we hold a pointer to it.
    haystack_buf = (ctypes.c_char * haystack_len).from_buffer_copy(data)
    needle_buf   = (ctypes.c_char * needle_len).from_buffer_copy(needle)
    
    base_addr    = ctypes.addressof(haystack_buf)
    search_start = 0
    
    while search_start < haystack_len:
        # Call memmem for the remaining portion of the haystack.
        # Each call returns a pointer to the next occurrence, or None.
        result_ptr = libc.memmem(
            ctypes.c_void_p(base_addr + search_start),
            ctypes.c_size_t(haystack_len - search_start),
            needle_buf,
            ctypes.c_size_t(needle_len),
        )
        
        if result_ptr is None:
            break   # no more occurrences
        
        # Convert the returned pointer to a byte offset into the original buffer
        match_offset = result_ptr - base_addr
        offsets.append(match_offset)
        
        # Advance past this match (by 1, not needle_len, to catch overlapping matches)
        search_start = match_offset + 1
    
    return offsets


def offset_to_line(data: bytes, offset: int) -> tuple[int, bytes]:
    """
    Given a byte offset into a file's bytes, return the line number and line content.
    """
    # Count newlines before the offset to get the 1-based line number
    line_num    = data.count(b"\n", 0, offset) + 1
    line_start  = data.rfind(b"\n", 0, offset) + 1
    line_end    = data.find(b"\n", offset)
    line_end    = line_end if line_end != -1 else len(data)
    return line_num, data[line_start:line_end]


# ── Step 2: Load corpus into RAM and search with memmem ───────────────────
needle = b"parse_result"   # the literal bytes we're searching for

base_path = Path("/path/to/your/files")

for filepath in base_path.rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        data    = filepath.read_bytes()
        offsets = find_all_memmem(data, needle)
        
        for offset in offsets:
            line_num, line = offset_to_line(data, offset)
            print(f"{filepath}:{line_num}: {line.decode('utf-8', errors='replace').strip()}")
    except Exception:
        continue


# ── Step 3: Combine memmem with memory-mapped files for maximum efficiency ─
# Instead of loading the file into a Python bytes object (which copies it),
# we memory-map it and call memmem on the mapped buffer directly.
# This means the file data comes straight from the page cache — zero copies.

def search_file_mmap_memmem(filepath: Path, needle: bytes) -> list[tuple[int, str]]:
    """
    Memory-map a file and search it with memmem — the minimum-overhead approach.
    No Python bytes object is created; the search operates directly on mapped pages.
    """
    results = []
    file_size = os.path.getsize(filepath)
    if file_size == 0:
        return []
    
    with open(filepath, "rb") as f:
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            # Obtain a ctypes pointer to the mmap buffer without copying.
            # c_char.from_buffer creates a ctypes array that shares memory
            # with the mmap object — no bytes are copied.
            buf      = (ctypes.c_char * file_size).from_buffer(mm)
            base_ptr = ctypes.addressof(buf)
            
            search_start = 0
            while search_start < file_size:
                result_ptr = libc.memmem(
                    ctypes.c_void_p(base_ptr + search_start),
                    ctypes.c_size_t(file_size - search_start),
                    needle,
                    ctypes.c_size_t(len(needle)),
                )
                if result_ptr is None:
                    break
                
                offset     = result_ptr - base_ptr
                line_num   = mm[:offset].count(b"\n") + 1
                line_start = mm.rfind(b"\n", 0, offset) + 1
                line_end   = mm.find(b"\n", offset)
                line_end   = line_end if line_end != -1 else file_size
                line       = mm[line_start:line_end].decode("utf-8", errors="replace")
                
                results.append((line_num, line.strip()))
                search_start = offset + 1
    
    return results


# Run the zero-copy, SIMD-accelerated search
for filepath in base_path.rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    matches = search_file_mmap_memmem(filepath, b"parse_result")
    for line_num, line in matches:
        print(f"{filepath}:{line_num}: {line}")

It's worth understanding what from_buffer does on the ctypes.c_char array call above, because it's the crucial detail that makes this technique zero-copy. When you write (ctypes.c_char * N).from_buffer(mm), ctypes creates a thin wrapper around the memory already allocated by the mmap object — it does not allocate new memory or copy any bytes. The base_ptr variable then points directly into the page cache region that the OS is managing. When memmem scans through this region, it's scanning the same physical RAM that the OS is using as a page cache for the file — the most direct possible path from storage to CPU.


Looking at the full picture across all sixteen techniques

Now that we've accumulated sixteen techniques across four responses, it's worth stepping back to see the landscape as a whole, because the techniques cluster into a set of orthogonal problems that you can solve independently and compose together.

The first cluster is the data placement problem — where do the bytes actually live? This is addressed by the OS page cache, vmtouch, RAM disks, mlockall, and shared memory. These are all about ensuring that the bytes you want to search are in physical RAM when you need them, rather than on disk or in swap.

The second cluster is the data representation problem — what shape are the bytes in? Raw strings in a Python dictionary, compressed bytes, Arrow columnar arrays, memory-mapped regions, and suffix arrays are all different answers to this. The right choice depends on how you'll access the data: sequential scans favour contiguous layouts (Arrow, mmap); repeated random access favours dictionaries; memory pressure favours compression.

The third cluster is the search algorithm problem — how do you find the pattern in the data? Linear scan (grep, memmem), inverted word index, trigram index, suffix array, Aho-Corasick automaton, Bloom filter pre-screening, and Whoosh/DuckDB/SQLite FTS5 are all different algorithms with different tradeoffs between index build cost, query time, supported query types, and memory usage.

The fourth cluster is the I/O scheduling problem — how do you hide the latency of loading data from disk? This is addressed by posix_fadvise, madvise, asyncio prefetching, and the pipelining pattern. These only matter when your working set doesn't fit in RAM; once everything is in RAM, this cluster becomes irrelevant.

The most powerful insight is that these four clusters are independent: a change in one doesn't affect the others. You can combine Bloom filter pre-screening (cluster 3) with compressed shared memory (clusters 1 + 2) accessed via madvise-guided sequential scanning (cluster 4), and each piece contributes its improvement without interfering with the others. This is what makes the problem space feel endless — because it genuinely is. Each axis has multiple good options, and the right combination depends entirely on the shape of your specific data and your specific query patterns.

You said: okay great.

okay great. again, please give me a few more easy techniques / ways / systems . also, please do explain everything in very detail.

May 20

Claude responded: Great — here are four more techniques that each open up a genuinely new dimension of the problem space.

Great — here are four more techniques that each open up a genuinely new dimension of the problem space. Looking across the twenty techniques covered so far, there are still some important territories unexplored: a production-grade Rust-powered search engine that substantially outperforms everything Python-native, Intel's compiler-based approach to multi-pattern regex that is architecturally different from Aho-Corasick, the Rabin-Karp rolling hash which teaches a fundamentally different algorithmic idea about substring search, and finally — the most conceptually radical departure of the entire series — semantic vector search, which abandons the idea of matching characters altogether and instead searches by meaning. Each of these will teach you something that none of the previous twenty techniques touched.


Technique 1: Tantivy-py with RamDirectory — a Rust-powered search engine in RAM

Every search library covered so far has been either pure Python (Whoosh, the hand-built inverted index) or a C extension (SQLite FTS5, DuckDB). There is a third category that has become increasingly important: libraries written in Rust, which combine the memory safety guarantees of a modern systems language with performance that matches or exceeds hand-tuned C. Tantivy is the most mature Rust-based full-text search library, and it has Python bindings called tantivy-py. Its RamDirectory backend keeps the entire index in memory, making it directly applicable to everything we have been discussing.

To understand why Tantivy is faster than Whoosh, it helps to understand where Whoosh's time goes. Whoosh is pure Python, which means every operation — tokenizing a string, inserting a term into a posting list, computing a BM25 score — involves the Python interpreter's bytecode dispatch loop, object reference counting, and the Global Interpreter Lock. These overheads are small per operation but they add up to something significant when you are processing millions of terms. Tantivy performs the same logical operations in compiled Rust code, which means no interpreter overhead, no GIL, no reference counting, and aggressive compiler optimisation including SIMD vectorisation of the hot paths. For indexing and search over large corpora, Tantivy is typically five to fifteen times faster than Whoosh on the same hardware.

The second thing that makes Tantivy interesting is its use of the same architectural ideas as Apache Lucene — the library underlying Elasticsearch and Solr. When you learn to use Tantivy, you are learning a mental model that transfers directly to understanding industrial-scale search infrastructure.

python
# pip install tantivy
import tantivy
from pathlib import Path

# ── Step 1: Define a schema ────────────────────────────────────────────────
# Tantivy's schema is very similar to Whoosh's: you declare each field
# and its type before indexing any documents.
#
# SchemaBuilder is a builder pattern — you add fields one by one,
# then call .build() to finalise the schema into an immutable object.
builder = tantivy.SchemaBuilder()

# TEXT fields are tokenised and full-text indexed.
# stored=True means the original value is retrievable from the index —
# set this to False for large fields (like content) to save RAM.
# tokenizer_name="en_stem" applies English stemming (like Whoosh's StemmingAnalyzer).
builder.add_text_field("filepath",  stored=True)
builder.add_text_field("content",   stored=False,  tokenizer_name="en_stem")
builder.add_text_field("filename",  stored=True,   tokenizer_name="default")

# INTEGER fields support exact match and range queries.
builder.add_integer_field("line_count", stored=True, indexed=True)

schema = builder.build()

# ── Step 2: Create the index in a RamDirectory ────────────────────────────
# RamDirectory is the key to keeping everything in RAM.
# It implements Tantivy's Directory trait entirely in memory —
# all index segments, posting lists, term dictionaries, and stored fields
# live as Rust Vec<u8> buffers, never touching the filesystem.
#
# Compare this to tantivy.Index.create_in_dir("/some/path") which would
# write to disk. The API is identical; only the storage backend differs.
index = tantivy.Index(schema, path=None)   # path=None means RamDirectory


# ── Step 3: Index all files ────────────────────────────────────────────────
# The IndexWriter is the transactional entry point for adding documents.
# heap_size controls the size of the in-memory write buffer before a segment
# is flushed internally — larger values mean fewer, larger segments,
# which generally improves search performance at the cost of more peak RAM.
writer = index.writer(heap_size=128_000_000)   # 128 MB write buffer

for filepath in Path("/path/to/your/files").rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        text = filepath.read_text(encoding="utf-8", errors="ignore")
        # tantivy.Document is a dictionary-like object.
        # Each key must match a field name in the schema.
        doc = tantivy.Document(
            filepath   = str(filepath),
            filename   = filepath.name,
            content    = text,
            line_count = text.count("\n") + 1,
        )
        writer.add_document(doc)
    except Exception as e:
        print(f"Skipping {filepath}: {e}")

# commit() finalises the current segment, merges posting lists,
# and makes all indexed documents searchable.
# In Tantivy (as in Lucene), an index consists of multiple "segments"
# that are periodically merged — this is why indexing can be faster
# than a single large sort: you accumulate small sorted runs and merge them.
writer.commit()

# reload() makes the searcher aware of segments committed since the index was opened.
index.reload()
print(f"Index committed to RamDirectory.")


# ── Step 4: Search ─────────────────────────────────────────────────────────
# The Searcher is the read-side of the index.
# BM25F scoring is computed in Rust at query time — no Python involved
# until the results are handed back as Python objects.
searcher = index.searcher()

# QueryParser handles user-facing query syntax, just like Whoosh's.
# "database optimization"   → both terms required (AND)
# "database OR optimization" → either term
# '"exact phrase"'           → phrase query
# "optim*"                   → prefix query
query_parser = tantivy.QueryParser.for_index(index, ["content"])

# --- Basic search ---
query   = query_parser.parse_query("database optimization")
results = searcher.search(query, limit=10)

print(f"\nTop results:")
for score, doc_address in results.hits:
    doc = searcher.doc(doc_address)
    # score is the BM25F relevance score — higher means more relevant
    print(f"  [{score:.3f}] {doc['filepath'][0]}  ({doc['line_count'][0]} lines)")

# --- Fuzzy search (handles typos) ---
# The ~1 suffix means Levenshtein edit distance ≤ 1
fuzzy_query   = query_parser.parse_query("optimise~1")
fuzzy_results = searcher.search(fuzzy_query, limit=5)

# --- Phrase search ---
phrase_query   = query_parser.parse_query('"error handling"')
phrase_results = searcher.search(phrase_query, limit=5)

# --- Range query on a numeric field ---
# Find all files with between 100 and 500 lines
range_query   = query_parser.parse_query("line_count:[100 TO 500]")
range_results = searcher.search(range_query, limit=20)

The concept worth taking away from Tantivy is the segment architecture. Rather than maintaining one monolithic inverted index, Tantivy (and Lucene) maintain a collection of small immutable index segments. When you call writer.commit(), the current write buffer is flushed as a new segment. Periodically, the library merges smaller segments into larger ones in the background — this is called segment merging, and it keeps query performance stable over time by preventing the number of segments from growing unboundedly. The reason immutability is so valuable here is that multiple search threads can read a segment concurrently without any locking, because they know the segment will never change under them. This is the same architectural insight that makes copy-on-write data structures so powerful throughout computer science.


Technique 2: Intel Hyperscan — SIMD-compiled multi-pattern regex

In an earlier response, we covered Aho-Corasick as a way to search for thousands of literal string patterns simultaneously in one pass. Hyperscan solves the same problem but for full regular expressions rather than just literal strings, and it does so using a fundamentally different mechanism. While Aho-Corasick builds a state machine at runtime in regular memory, Hyperscan actually compiles your regex patterns into native machine code using SIMD (AVX2/AVX-512) instructions. The compiled database is then loaded into RAM, and scanning text against it involves running that machine code directly, with no interpreter overhead of any kind.

To understand why this is architecturally different from Aho-Corasick, think about what a traditional regex engine does. It maintains a state machine in memory (similar to Aho-Corasick) and transitions between states as it reads each character. Each transition is a memory lookup — "given that I'm in state 47 and the next character is 'a', which state do I go to?" With enough patterns, this state machine becomes enormous and the memory access pattern becomes essentially random, which is very cache-unfriendly. Hyperscan takes a different approach: it analyses your patterns at compile time and generates specialised machine code that exploits the specific structure of your particular set of patterns, using SIMD to test many characters and many states simultaneously. The generated code fits in CPU instruction cache rather than data cache, which is typically much faster.

Intel originally developed Hyperscan for network intrusion detection systems — tools that need to scan millions of packets per second against thousands of signatures. It is now open-source and available on any modern x86 CPU.

python
# pip install hyperscan
import hyperscan
from pathlib import Path
import re

# ── Step 1: Compile your pattern database ─────────────────────────────────
# The Database is the compiled artifact — this is the expensive step
# (analogous to A.make_automaton() in Aho-Corasick, but more computationally
# intensive because it compiles to native machine code).
# Once compiled, the Database object lives in RAM and can be used to scan
# as many texts as you like, very quickly.

# Define your patterns as a list of byte strings.
# Each pattern is a full PCRE-compatible regular expression.
# The key insight: you can have thousands of patterns here and scanning
# time does NOT grow proportionally — it's bounded by text length, not pattern count.
patterns = [
    b"TODO[:\\s]",
    b"FIXME[:\\s]",
    b"HACK[:\\s]",
    b"func_[a-z_]+\\(",        # function calls starting with func_
    b"class [A-Z][a-zA-Z]+",   # class definitions
    b"import [a-z_]+",         # import statements
    b"password\\s*=\\s*['\"]", # potential hardcoded passwords
    b"api[_\\-]?key",          # potential API key references
    # ... you could add thousands more here
]

# flags controls matching behaviour per-pattern.
# hyperscan.HS_FLAG_CASELESS is case-insensitive matching.
# hyperscan.HS_FLAG_MULTILINE makes ^ and $ match line boundaries.
# hyperscan.HS_FLAG_DOTALL makes . match newlines too.
flags = [hyperscan.HS_FLAG_CASELESS | hyperscan.HS_FLAG_MULTILINE] * len(patterns)

# ids are integer identifiers returned when a pattern matches,
# so your callback knows which pattern triggered.
ids = list(range(len(patterns)))

# hyperscan.Database() compiles the patterns into SIMD machine code.
# This takes a moment but happens once; the result is reused for all scans.
db = hyperscan.Database()
db.compile(
    expressions = patterns,
    flags       = flags,
    ids         = ids,
    mode        = hyperscan.HS_MODE_BLOCK,  # BLOCK mode for in-memory buffers
)
print(f"Hyperscan database compiled: {len(patterns)} patterns")


# ── Step 2: Define a match callback ───────────────────────────────────────
# Hyperscan uses a callback-based API: instead of returning a list of matches,
# it calls your function each time a pattern matches.
# This avoids allocating a potentially huge list of match objects,
# keeping memory usage low even when there are millions of matches.
#
# The callback signature is fixed:
#   id       — which pattern matched (the integer from our ids list above)
#   from_    — start byte offset of the match (may be 0 if pattern has variable prefix)
#   to       — end byte offset of the match (exclusive)
#   flags    — match flags (usually 0)
#   context  — the extra object you pass to scan(), useful for accumulating results

def on_match(id_, from_, to, flags, context):
    """Called by Hyperscan for every pattern match found."""
    filepath, data, match_list = context
    
    # Recover the matching text from the byte offsets
    matched_text = data[from_:to].decode("utf-8", errors="replace")
    
    # Find the line containing this match
    line_start = data.rfind(b"\n", 0, from_) + 1
    line_end   = data.find(b"\n", to)
    line_end   = line_end if line_end != -1 else len(data)
    line       = data[line_start:line_end].decode("utf-8", errors="replace")
    line_num   = data.count(b"\n", 0, from_) + 1
    
    match_list.append({
        "pattern_id":   id_,
        "pattern":      patterns[id_].decode(),
        "line_num":     line_num,
        "line":         line.strip(),
        "matched_text": matched_text,
    })
    
    # Returning a non-None value from the callback stops scanning early.
    # Returning None continues scanning — we always return None here.


# ── Step 3: Load corpus and scan ──────────────────────────────────────────
corpus = {}
for filepath in Path("/path/to/your/files").rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        corpus[str(filepath)] = filepath.read_bytes()
    except Exception:
        continue

print(f"Scanning {len(corpus)} files with {len(patterns)} patterns simultaneously...")

for filepath, data in corpus.items():
    match_list = []
    
    # db.scan() is the hot path — it runs the compiled SIMD machine code
    # over the data buffer once, calling on_match for every pattern that fires.
    # The third argument is the "context" object passed to your callback.
    db.scan(
        data,
        match_event_handler = on_match,
        context             = (filepath, data, match_list),
    )
    
    for m in match_list:
        print(f"{filepath}:{m['line_num']} [pattern: {m['pattern']}]  {m['line']}")

The conceptual distinction between Hyperscan and Aho-Corasick is subtle but important, and it is worth dwelling on. Aho-Corasick works well for literal strings but cannot handle full regular expressions — you cannot put func_[a-z_]+\( into an Aho-Corasick automaton directly, because it requires quantifiers and character classes that a pure string automaton cannot represent. You would have to enumerate every possible word matching that regex, which is infinite. Hyperscan handles arbitrary PCRE patterns including backreferences, lookaheads, and quantifiers, all compiled to native code. The tradeoff is that Hyperscan's compilation step is substantially slower than building an Aho-Corasick automaton, so it only makes sense when you will reuse the compiled database across many texts. If you are scanning a large corpus that fits in RAM, the compilation cost (which happens once) is amortised across all scans, and the per-scan speed is extraordinary — Hyperscan was designed to classify gigabits-per-second of network traffic in real time on commodity server hardware.


Technique 3: The Rabin-Karp rolling hash — a completely different algorithm for substring search

Every substring search technique covered so far either scans text linearly (grep, memmem) or consults a pre-built index (trigram index, suffix array, inverted index). Rabin-Karp is a third approach: it uses a rolling hash to scan text in a way that is fundamentally different from both of those categories, and it generalises naturally to something that the other algorithms struggle with — searching for any one of a large set of fixed-length patterns simultaneously.

To build the intuition from first principles, start with the naive approach to substring search. To find the word "cat" in a long text, you could check every possible 3-character window: look at characters 0–2, then 1–3, then 2–4, and so on. At each window you compare all three characters against "cat". This is O(N × M) where N is the text length and M is the pattern length.

Now here is the key insight of Rabin-Karp: instead of comparing characters, what if you computed a hash of each window and compared the hash against the hash of your pattern? If the hash matches, you do the full character comparison to confirm (handling the rare false positive). If the hash doesn't match, you skip the window immediately without comparing any characters. The question becomes: can you compute the hash of the next window cheaply, given that you already know the hash of the current window? With most hash functions, the answer is no — you'd have to rehash all M characters from scratch. But with a polynomial rolling hash, the answer is yes: you can remove the contribution of the oldest character and add the contribution of the new character in O(1) time. This is what makes it a "rolling" hash.

The polynomial hash of a window [c₀, c₁, c₂] is computed as c₀ × B² + c₁ × B¹ + c₂ × B⁰ for some base B and modulus. Rolling to the next window [c₁, c₂, c₃] means subtracting c₀ × B², multiplying the whole thing by B, and adding c₃. Each slide costs exactly three arithmetic operations regardless of window size.

python
from pathlib import Path

# ── The rolling hash implementation ───────────────────────────────────────
# We use a polynomial hash with a large prime modulus to keep collision rates low.
# BASE and MOD are chosen to give good distribution over typical ASCII text.

BASE = 256        # treat each byte as a digit in base 256 (natural for byte strings)
MOD  = (1 << 61) - 1   # a Mersenne prime — very efficient for modular arithmetic


def rabin_karp_single(text: bytes, pattern: bytes) -> list[int]:
    """
    Find all occurrences of `pattern` in `text` using the Rabin-Karp algorithm.
    Returns a list of byte offsets where the pattern starts.
    
    Time complexity: O(N + M) average, O(N*M) worst case (degenerate hash collisions).
    In practice with a good hash, worst-case behaviour is vanishingly rare.
    """
    n, m = len(text), len(pattern)
    if m > n or m == 0:
        return []
    
    # Precompute BASE^(m-1) mod MOD — we need this to "remove" the oldest character.
    # Think of it as the "place value" of the leftmost character in our hash window.
    h = pow(BASE, m - 1, MOD)
    
    # Compute the initial hash of the pattern and the first window of text.
    pattern_hash = 0
    window_hash  = 0
    for i in range(m):
        pattern_hash = (pattern_hash * BASE + pattern[i]) % MOD
        window_hash  = (window_hash  * BASE + text[i])    % MOD
    
    matches = []
    
    for i in range(n - m + 1):
        # Hash comparison is O(1) — this is the fast path
        if window_hash == pattern_hash:
            # Full character comparison to handle hash collisions (false positives)
            # This is O(M) but only happens when hashes match, which is rare
            if text[i:i+m] == pattern:
                matches.append(i)
        
        # Roll the hash forward by one position:
        # 1. Subtract the contribution of the character leaving the window (text[i])
        # 2. Multiply by BASE (shift all remaining characters up one place value)
        # 3. Add the contribution of the new character entering the window (text[i+m])
        if i < n - m:
            window_hash = (window_hash - text[i] * h) % MOD
            window_hash = (window_hash * BASE + text[i + m]) % MOD
    
    return matches


# ── The real power: searching for MANY patterns simultaneously ─────────────
# This is where Rabin-Karp genuinely shines over simpler substring search.
# If all patterns have the same length M, you can:
# 1. Hash all patterns and store them in a Python set (O(1) lookup).
# 2. Slide one window across the text, computing the rolling hash once per position.
# 3. At each position, check if the window's hash is in the pattern hash set — O(1).
# 4. Only do the expensive character comparison when there's a hash match.
#
# The result: you search for K patterns in O(N) time — the same cost as searching
# for ONE pattern. This is a qualitatively different scaling curve from any
# approach that runs a separate search per pattern.

def rabin_karp_multi(text: bytes, patterns: list[bytes]) -> dict[bytes, list[int]]:
    """
    Search for multiple patterns of the SAME length simultaneously.
    One pass through the text, all patterns checked at every position.
    
    This is the key use case where Rabin-Karp is uniquely advantageous.
    """
    if not patterns:
        return {}
    
    m = len(patterns[0])
    assert all(len(p) == m for p in patterns), "All patterns must have the same length"
    
    # Build a hash set of all pattern hashes.
    # Because Python's set uses a hash table internally, membership testing is O(1).
    pattern_hashes = {}
    for p in patterns:
        h_val = 0
        for byte in p:
            h_val = (h_val * BASE + byte) % MOD
        # Map hash → list of patterns (to handle hash collisions gracefully)
        pattern_hashes.setdefault(h_val, []).append(p)
    
    results = {p: [] for p in patterns}
    h       = pow(BASE, m - 1, MOD)
    n       = len(text)
    
    if n < m:
        return results
    
    window_hash = 0
    for i in range(m):
        window_hash = (window_hash * BASE + text[i]) % MOD
    
    for i in range(n - m + 1):
        # One O(1) hash lookup checks against ALL K patterns simultaneously.
        # This is the payoff of the rolling hash: position i costs O(1) regardless of K.
        if window_hash in pattern_hashes:
            window = text[i:i+m]
            for candidate_pattern in pattern_hashes[window_hash]:
                if window == candidate_pattern:
                    results[candidate_pattern].append(i)
        
        if i < n - m:
            window_hash = (window_hash - text[i] * h) % MOD
            window_hash = (window_hash * BASE + text[i + m]) % MOD
    
    return results


# ── Example: search for every 8-character hex string in a source file ──────
# (Useful for finding things like git commit hashes, UUIDs, magic numbers)
# Generate patterns programmatically — something impractical with most search tools
import itertools, random

# For demonstration: search for a set of specific 8-character tokens
search_tokens = [b"deadbeef", b"cafebabe", b"feedface", b"0badf00d"]

for filepath in Path("/path/to/your/files").rglob("*.txt"):
    data = filepath.read_bytes().lower()
    matches = rabin_karp_multi(data, search_tokens)
    for pattern, offsets in matches.items():
        for offset in offsets:
            line_num = data.count(b"\n", 0, offset) + 1
            print(f"{filepath}:{line_num} found {pattern.decode()!r}")

The deeper lesson that Rabin-Karp teaches is about the relationship between hashing and search. The rolling hash is the same core idea behind many other important algorithms: the rsync protocol uses rolling hashes to find matching chunks between files without transmitting full file contents; the Rabin fingerprinting scheme in content-defined chunking (used by backup tools like restic and Bup) uses rolling hashes to split files into variable-size chunks at natural boundaries; plagiarism detection systems use rolling hashes to find shared k-grams between documents. Once you understand the rolling hash, you recognise it everywhere.

The important caveat about Rabin-Karp is that the same-length restriction for multi-pattern search is a genuine limitation. If your patterns have different lengths, you either need to group them by length and run one pass per group, or you should use Aho-Corasick or Hyperscan instead. For the specific case of many fixed-length patterns — which comes up more often than you might expect — Rabin-Karp is the cleanest solution.


Technique 4: Semantic vector search with sentence-transformers and FAISS

Every technique in this entire series has been searching for text by matching characters. You give the algorithm a pattern — a literal string, a regex, a set of keywords — and it finds places where those exact characters appear in your files. This is called lexical search, and it has a fundamental limitation: it can only find what you literally wrote, not what you meant. If your files contain the word "automobile" and you search for "car", lexical search finds nothing. If your files discuss "the process of neural network training" and you search for "machine learning optimisation", lexical search might miss it entirely because the vocabulary is different even though the meaning overlaps strongly.

Semantic search takes a completely different approach. Instead of matching characters, it maps every piece of text onto a point in a high-dimensional vector space, where texts with similar meanings end up near each other regardless of what words they use. The key component is a neural network called a sentence transformer — a model trained on hundreds of millions of text pairs to produce embeddings (vectors of floating-point numbers, typically 384 or 768 dimensions) such that semantically similar texts are geometrically close in that space. Searching then becomes a nearest-neighbour query: embed your search query, then find the document vectors that are closest to it in the embedding space.

The library that handles the nearest-neighbour search efficiently in RAM is FAISS (Facebook AI Similarity Search). FAISS stores all your document vectors in memory and can search millions of them in milliseconds using highly optimised linear algebra and, for very large corpora, approximate nearest-neighbour algorithms that trade a small amount of accuracy for dramatic speed improvements.

python
# pip install sentence-transformers faiss-cpu numpy
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from pathlib import Path

# ── Step 1: Load the embedding model ──────────────────────────────────────
# SentenceTransformer downloads and caches a pre-trained neural network.
# "all-MiniLM-L6-v2" is a popular choice: it's small (80MB), fast,
# and produces 384-dimensional embeddings with excellent quality.
#
# The model runs entirely in RAM (and optionally on a GPU if you have one).
# It has already learned the semantic structure of English during training
# on a massive corpus — you don't train it yourself.
print("Loading embedding model...")
model = SentenceTransformer("all-MiniLM-L6-v2")
EMBEDDING_DIM = 384   # this model produces 384-dimensional vectors


# ── Step 2: Build the corpus — one "chunk" per paragraph ──────────────────
# A crucial design decision in semantic search is what constitutes one
# "document" in the index. Options include: one entry per file, per line,
# per paragraph, or per fixed-size sliding window.
#
# Per-paragraph is usually the sweet spot: short enough to be specific,
# long enough to carry semantic meaning. A single line is often too short
# (lacks context); a whole file is too long (the embedding averages everything,
# losing specificity).

chunks     = []   # list of text strings to embed
metadata   = []   # parallel list of (filepath, paragraph_index) for retrieval

def split_into_paragraphs(text: str) -> list[str]:
    """Split text on blank lines, filter out very short chunks."""
    raw = [p.strip() for p in text.split("\n\n")]
    return [p for p in raw if len(p) > 50]   # skip very short paragraphs

base_path = Path("/path/to/your/files")

for filepath in base_path.rglob("*"):
    if filepath.suffix not in (".txt", ".md"):
        continue
    try:
        text = filepath.read_text(encoding="utf-8", errors="ignore")
        paragraphs = split_into_paragraphs(text)
        for para_idx, para in enumerate(paragraphs):
            chunks.append(para)
            metadata.append((str(filepath), para_idx))
    except Exception:
        continue

print(f"Corpus: {len(chunks)} paragraphs from {len(set(m[0] for m in metadata))} files")


# ── Step 3: Embed all chunks using the neural network ─────────────────────
# model.encode() runs all chunks through the transformer network and returns
# a (N, 384) numpy float32 array — one 384-dimensional vector per chunk.
#
# batch_size controls how many chunks are processed simultaneously.
# Larger batches are faster (better GPU/CPU utilisation) but use more RAM.
# show_progress_bar=True shows a tqdm progress bar — useful for large corpora.
#
# This is the most computationally expensive step. On CPU, expect roughly
# 1,000–5,000 sentences per second depending on hardware.
# On a GPU, this can be 10–50× faster.
print("Embedding corpus (this may take a minute)...")
embeddings = model.encode(
    chunks,
    batch_size       = 64,
    show_progress_bar = True,
    convert_to_numpy  = True,
    normalize_embeddings = True,   # normalise to unit length for cosine similarity
)
# embeddings.shape is (len(chunks), 384) — a 2D float32 numpy array

print(f"Embeddings shape: {embeddings.shape}, RAM: {embeddings.nbytes / 1e6:.1f} MB")


# ── Step 4: Build a FAISS index and add the embeddings ────────────────────
# FAISS provides many index types with different speed/accuracy/RAM tradeoffs.
# 
# IndexFlatIP is the simplest: it stores all vectors as a flat array and
# computes exact inner products (dot products) for every search.
# Because we normalised the embeddings, inner product equals cosine similarity.
# This index is exact — no approximation — and appropriate for up to ~1M vectors.
# For larger corpora, you would use IndexIVFFlat (inverted file index) or
# IndexHNSWFlat (hierarchical navigable small world graph), which trade
# a small amount of accuracy for dramatically faster search.
#
# IndexFlatIP lives entirely in RAM — FAISS never writes to disk unless you
# explicitly call faiss.write_index().
index_faiss = faiss.IndexFlatIP(EMBEDDING_DIM)

# faiss.normalize_L2 does in-place L2 normalisation — already done above,
# but it's good practice to ensure consistency.
faiss.normalize_L2(embeddings)

# add() inserts all vectors into the index at once.
# FAISS stores them as a contiguous float32 buffer, very cache-friendly.
index_faiss.add(embeddings)
print(f"FAISS index built: {index_faiss.ntotal} vectors in RAM")


# ── Step 5: Search by meaning ──────────────────────────────────────────────
def semantic_search(query: str, top_k: int = 5) -> list[dict]:
    """
    Find the `top_k` paragraphs most semantically similar to `query`.
    
    The key thing to understand: `query` does not need to share any words
    with the matching paragraphs. The model maps both the query and the
    stored paragraphs into the same semantic space, so proximity in that
    space means proximity in meaning.
    """
    # Embed the query using the same model — this produces one 384-dim vector.
    query_vector = model.encode(
        [query],
        normalize_embeddings = True,
        convert_to_numpy     = True,
    )   # shape: (1, 384)
    
    # index.search() finds the top_k nearest vectors to query_vector.
    # Returns:
    #   distances — (1, top_k) array of similarity scores (higher = more similar)
    #   indices   — (1, top_k) array of integer indices into our chunks list
    distances, indices = index_faiss.search(query_vector, top_k)
    
    results = []
    for dist, idx in zip(distances[0], indices[0]):
        if idx == -1:   # FAISS returns -1 when fewer than top_k results exist
            continue
        filepath, para_idx = metadata[idx]
        results.append({
            "score":      float(dist),    # cosine similarity: 1.0 = identical meaning
            "filepath":   filepath,
            "para_index": para_idx,
            "text":       chunks[idx],
        })
    
    return results


# Run some semantic searches — notice the queries use DIFFERENT vocabulary
# than what might appear in the files, yet still find relevant passages.
print("\n--- Semantic search: 'ways to make code run faster' ---")
for r in semantic_search("ways to make code run faster", top_k=3):
    print(f"\n[{r['score']:.3f}] {r['filepath']} (para {r['para_index']})")
    print(f"  {r['text'][:200]}...")

print("\n--- Semantic search: 'dealing with errors and exceptions' ---")
for r in semantic_search("dealing with errors and exceptions", top_k=3):
    print(f"\n[{r['score']:.3f}] {r['filepath']} (para {r['para_index']})")
    print(f"  {r['text'][:200]}...")

# Notice: a file containing "exception handling" will be found
# even if your query said "error management" — because the model
# understands these phrases mean the same thing.

There is a deeper idea here that connects semantic search back to everything we have discussed. Every technique in this series has been designed to answer the question "does this exact pattern appear in this text?" Semantic search answers a fundamentally different question: "does this meaning appear in this text?" The shift from the first question to the second is one of the defining transitions in the history of information retrieval. All of lexical search — grep, inverted indexes, trigram indexes, suffix arrays, Aho-Corasick — is optimised for the first question. The entire field of neural information retrieval, which has grown explosively since the introduction of transformer models in 2017, is optimised for the second.

The practical implication for your corpus of text files is this: lexical search and semantic search are complementary, not competing. Lexical search is perfect when you know exactly what you are looking for — a specific function name, an error code, a configuration key. Semantic search is better when you have an idea or a concept but are not sure how it was expressed — "where in these documents did someone discuss performance tuning?" A production search system often runs both and combines the scores, a technique called hybrid retrieval, where BM25 scores from a lexical index are linearly combined with cosine similarity scores from a vector index. Elasticsearch's Relevance now supports this directly; so do newer tools like Weaviate, Qdrant, and Milvus.


Looking across all twenty techniques

With four responses now behind us, it is worth stepping back to notice the arc of the overall journey, because the techniques did not just accumulate — they each shifted what it means to "search."

The first techniques in the series (page cache, RAM disks, mlockall, shared memory, madvise, prefetching) were fundamentally about physics: getting bytes from storage into the part of the memory hierarchy where the CPU can act on them quickly. None of them changed what search meant; they just changed how fast the bytes arrived.

The middle techniques (inverted indexes, trigram indexes, suffix arrays, Aho-Corasick, Bloom filters) were about data structures and algorithms: rearranging the bytes at index time so that the search operation itself became faster or more powerful. These changed the computational complexity of search without changing its definition.

The last techniques in this response begin to change the definition of search itself. Tantivy and Whoosh add relevance ranking — not just "does this file contain this word" but "how relevant is this file to this query." Hyperscan adds full regular expression matching as a first-class multi-pattern primitive. Rabin-Karp reveals that hashing is a general-purpose tool for matching, not just for dictionaries. And semantic search abandons character-level matching entirely and instead searches for meaning in a learned geometric space.

Each of these transitions corresponds to a real moment in the history of information retrieval: from grep to inverted indexes in the 1970s and 80s, from inverted indexes to ranked retrieval in the 1990s, from keyword search to semantic search in the 2010s and 2020s. You have, in four responses, traced the entire arc of how humanity has thought about finding information in text.

You said: of all the techniques / ways / systems you've provided, do they easily work in Linux and Windows ?

  1. of all the techniques / ways / systems you've provided, do they easily work in Linux and Windows ?

  2. give me one list of those, which will not work in Windows at all. give me another list of those, which will work, but are hard to apply in Windows.

  3. why did you use Python in all of the example codes above, instead of the Go Programming Language (golang) ? golang is very fast, way more faster than Python. golang has good cross-platform support: Linux, Windows, BSDs.

  4. teach me a few easy ways to run your above Python codes in Linux and Windows. give me two seperate guides for Linux and Windows.

May 20

Claude responded: These are four really important questions, and they deserve thorough answers.

These are four really important questions, and they deserve thorough answers. Let me work through each one carefully.


Question 1: Do these techniques work on Linux and Windows?

The honest answer is: most of them work on both, but a meaningful minority are deeply tied to the Linux kernel and will either fail completely on Windows or require substantial rewriting. The fundamental reason for this split is that many of the most powerful low-level techniques — madvise, posix_fadvise, memmem, mlockall, /dev/shm — are part of the POSIX standard, which Linux implements fully and Windows does not implement at all. Windows has its own equivalents for some of these (for example, VirtualLock instead of mlockall, and PrefetchVirtualMemory instead of madvise), but they have different names, different signatures, and different calling conventions, so the Python code as written simply will not run on Windows without being rewritten from scratch.

The techniques that use pure Python libraries — Whoosh, DuckDB, SQLite, the hand-built inverted index, Rabin-Karp, Aho-Corasick, sentence-transformers — are completely cross-platform because they do not make any system calls directly. They stay within Python's own runtime, which abstracts away the OS differences. Python's mmap module is a good example of this: it works on both Linux and Windows, because the Python developers wrote separate implementations for each OS under the hood, and your code just calls mmap.mmap() without needing to know which OS it is running on.


Question 2: The two lists

Techniques that will not work on Windows at all

These techniques use Linux-specific kernel APIs or filesystem features that simply do not exist on Windows. Running this code on Windows would produce an error immediately, and fixing it would require rewriting the technique using a completely different Windows API.

vmtouch will not work on Windows. It is a command-line tool that exists only for Linux and macOS. There is no Windows port and no direct equivalent built into Windows.

/dev/shm and tmpfs RAM disk will not work on Windows. The path /dev/shm is a Linux-specific virtual filesystem that the kernel exposes. Windows has no concept of a filesystem path like this, and it has no built-in tmpfs equivalent.

posix_fadvise will not work on Windows. This is a POSIX system call that talks directly to the Linux kernel's I/O scheduler. The Windows kernel has no posix_fadvise function and no equivalent accessible by the same name via ctypes.

madvise will not work on Windows. Same situation — it is a POSIX system call for advising the kernel's virtual memory subsystem. The code using libc.madvise() via ctypes will fail because Windows's C runtime (msvcrt) does not export this function.

memmem via ctypes will not work on Windows. memmem is a GNU C Library (glibc) extension. Windows uses Microsoft's C runtime (msvcrt), which does not include memmem. The ctypes call will fail when trying to look up the symbol.

mlockall via ctypes will not work on Windows as written. Windows has VirtualLock which does something similar, but it requires completely different arguments and a completely different code structure. The libc.mlockall(MCL_CURRENT | MCL_FUTURE) call will simply fail.

GNU Parallel + grep will not work on Windows natively. Neither parallel nor grep is part of Windows. You can install them through Cygwin or WSL (Windows Subsystem for Linux), but those are essentially Linux environments running inside Windows — at that point you are not really running on Windows in the conventional sense.


Techniques that work on Windows but are hard to apply

These techniques are cross-platform in principle, but getting them to run on Windows involves extra installation steps, compiler requirements, or behavioural differences that can be frustrating.

Hyperscan is officially supported on Windows but the installation process is significantly harder. On Linux you can install it with pip install hyperscan and it mostly just works. On Windows, Intel's Hyperscan library has complex build dependencies including CMake, Boost, and a Visual Studio C++ toolchain. The pip package sometimes works, sometimes requires you to build the library from source yourself, and the documentation assumes a Linux environment.

FAISS (faiss-cpu) has improved on Windows in recent years, and pip install faiss-cpu now often succeeds. However it has historically been unreliable on Windows — older versions of the pip package were Linux-only, and even today you can encounter issues depending on your Python version, Visual C++ redistributable version, and numpy version. On Linux it just works.

Python multiprocessing works on both platforms, but the shared_memory and Pool-based techniques from the earlier responses have an important behavioural difference on Windows that can break your code. On Linux, Python's multiprocessing uses fork() to create child processes, which means the child inherits the parent's memory space instantly. On Windows, multiprocessing uses spawn instead, which starts a fresh Python interpreter and re-imports your script from scratch. This means any worker function you pass to a Pool must be importable as a module-level function — if it is defined inside an if __name__ == "__main__": block or uses a closure over local variables, it will fail on Windows with a confusing PicklingError or AttributeError. The fix is straightforward (always define worker functions at module level and guard your main code with if __name__ == "__main__":), but it catches many people off guard.

The RAM disk concept works on Windows, but requires third-party software. Tools like ImDisk or OSFMount let you create a RAM-backed drive on Windows (it appears as, say, R:\), and then you can copy your files there and search them. But unlike Linux where tmpfs is built into the kernel and takes one command to set up, on Windows you need to download, install, and configure a separate application first.

Tantivy-py works on Windows but the installation occasionally requires the Rust toolchain to be present, because the pip package sometimes needs to compile a native extension. On Linux this is usually handled automatically by a pre-built binary wheel; on Windows the pre-built wheel may not be available for your exact Python version, and you will need to install Rust and its build tools first.


Question 3: Why Python instead of Go?

This is a genuinely excellent question and you are right that Go is faster than Python for most tasks. The choice of Python across all the examples was deliberate, but it was made for teaching reasons rather than performance reasons. Let me explain the thinking honestly, and then explain where Go would actually be the better choice.

The first reason is library availability. The most powerful techniques in the series — FAISS for vector search, sentence-transformers for neural embeddings, DuckDB, Apache Arrow, Whoosh, Tantivy-py, Hyperscan — all have first-class, well-maintained Python bindings. Go equivalents either do not exist, are much less mature, or would require you to call into C libraries via Go's cgo mechanism, which adds considerable complexity and largely negates Go's deployment simplicity advantage (one of Go's best features — a single static binary — disappears the moment you use cgo). When you are trying to teach a concept like "here is how to build an in-memory FTS5 index", you want the code to focus on the concept, not on the scaffolding needed to make an immature library work.

The second reason is ecosystem for teaching. Python's interactive nature — you can paste code into a REPL, run one line at a time, inspect objects mid-execution — makes it ideal for exploratory learning. When you are building intuition about what a suffix array looks like or how Bloom filters behave, being able to add a print() call anywhere and immediately see the internal state is invaluable. Go's compiled nature makes this kind of exploration slower and more cumbersome.

The third reason is familiarity. Python is the most widely used language for data-adjacent and systems-scripting work. Most people reading about text search in memory are likely already comfortable with Python.

However, you are absolutely correct that Go would be the better language for production implementations of most of these techniques. Here is a concrete breakdown of Go's advantages and where they apply most strongly.

Go compiles to native machine code with no garbage collection pauses of the kind Python's GC can cause. Its goroutines are much cheaper than Python threads, and unlike Python, Go has no Global Interpreter Lock — all CPU cores are genuinely available simultaneously for parallel work. The parallel grep technique, the multiprocessing shared memory search, the Aho-Corasick scanner — all of these would be substantially faster in Go and substantially simpler to parallelise. In Python, you fight the GIL constantly; in Go, you just launch goroutines.

Go also has excellent cross-platform support including Windows, without the POSIX compatibility headaches described above. Go's syscall and golang.org/x/sys packages provide OS-specific system calls in a structured way, with separate implementations for Linux, macOS, and Windows — so you can write mmap-based code in Go that actually works correctly on all three platforms, which is much harder in Python where mmap's Windows behaviour has subtle differences.

To give you a concrete feel for what the Go equivalents look like, here is how the Aho-Corasick multi-pattern search and the basic parallel file search would look in Go, since those are two of the most practically useful techniques and Go handles them particularly elegantly.

go
// Multi-pattern search using Go's aho-corasick library
// go get github.com/cloudflare/ahocorasick
package main

import (
    "fmt"
    "os"
    "path/filepath"
    "strings"
    "sync"

    "github.com/cloudflare/ahocorasick"
)

func main() {
    // Build the Aho-Corasick automaton from your pattern list.
    // This compiles all patterns into one state machine — exactly
    // the same concept as the Python version, but in compiled Go.
    patterns := []string{
        "TODO", "FIXME", "HACK", "password", "api_key",
    }

    // Convert to byte slices — Go's Aho-Corasick library works at the byte level
    bytePatterns := make([][]byte, len(patterns))
    for i, p := range patterns {
        bytePatterns[i] = []byte(strings.ToLower(p))
    }

    matcher := ahocorasick.NewMatcher(bytePatterns)

    // Walk all .txt and .md files — filepath.WalkDir is Go's recursive directory walk
    var wg sync.WaitGroup
    // A buffered channel acts as a semaphore, limiting to 8 concurrent goroutines.
    // This is the Go equivalent of Python's ThreadPoolExecutor(max_workers=8).
    sem := make(chan struct{}, 8)

    filepath.WalkDir("/path/to/your/files", func(path string, d os.DirEntry, err error) error {
        if err != nil || d.IsDir() {
            return nil
        }
        ext := strings.ToLower(filepath.Ext(path))
        if ext != ".txt" && ext != ".md" {
            return nil
        }

        wg.Add(1)
        sem <- struct{}{} // acquire slot

        // Each file is searched in its own goroutine — genuinely parallel,
        // no GIL, all CPU cores used simultaneously
        go func(filePath string) {
            defer wg.Done()
            defer func() { <-sem }() // release slot

            data, err := os.ReadFile(filePath)
            if err != nil {
                return
            }

            lower := strings.ToLower(string(data))
            hits := matcher.Match([]byte(lower))

            for _, patternIdx := range hits {
                fmt.Printf("%s: matched pattern %q\n", filePath, patterns[patternIdx])
            }
        }(path)

        return nil
    })

    wg.Wait()
}

And here is what the in-memory ripgrep-style parallel search looks like in Go, using goroutines and channels — which is much cleaner than Python's multiprocessing equivalent:

go
package main

import (
    "bufio"
    "fmt"
    "os"
    "path/filepath"
    "regexp"
    "strings"
    "sync"
)

// result holds one match so goroutines can send results back to main
type result struct {
    filepath string
    lineNum  int
    line     string
}

func searchFile(path string, pattern *regexp.Regexp, results chan<- result, wg *sync.WaitGroup) {
    defer wg.Done()

    f, err := os.Open(path)
    if err != nil {
        return
    }
    defer f.Close()

    // bufio.Scanner reads line by line without loading the whole file into RAM.
    // For an in-RAM approach, use os.ReadFile and strings.Split instead.
    scanner := bufio.NewScanner(f)
    lineNum := 0
    for scanner.Scan() {
        lineNum++
        line := scanner.Text()
        if pattern.MatchString(line) {
            // Send the match to the results channel — non-blocking because
            // the channel is buffered. The goroutine never waits.
            results <- result{path, lineNum, line}
        }
    }
}

func main() {
    pattern := regexp.MustCompile(`(?i)TODO|FIXME|HACK`)
    results := make(chan result, 1000) // buffered channel
    var wg sync.WaitGroup

    // Producer: walk the filesystem and launch one goroutine per file
    go func() {
        filepath.WalkDir("/path/to/your/files", func(path string, d os.DirEntry, err error) error {
            if err != nil || d.IsDir() {
                return nil
            }
            ext := strings.ToLower(filepath.Ext(path))
            if ext == ".txt" || ext == ".md" {
                wg.Add(1)
                go searchFile(path, pattern, results, &wg)
            }
            return nil
        })
        // After all goroutines are launched, wait for them to finish
        // then close the results channel to signal the consumer to stop
        wg.Wait()
        close(results)
    }()

    // Consumer: print results as they arrive from the channel
    for r := range results {
        fmt.Printf("%s:%d: %s\n", r.filepath, r.lineNum, r.line)
    }
}

Notice how much simpler the Go parallelism is. There is no GIL to fight, no multiprocessing vs threading decision to make, no Pool or Executor boilerplate. You just write go searchFile(...) and the runtime handles the scheduling across all CPU cores.

The honest summary is: Python was the right language for teaching these concepts because its ecosystem is richer for the libraries involved, and it is more familiar to more readers. Go would be the right language for deploying these techniques in production, especially if cross-platform support and genuine parallelism are priorities. For the techniques that are pure algorithms — the inverted index, Rabin-Karp, Aho-Corasick, the trigram index, Bloom filters — Go implementations would be straightforwardly faster and would work identically on Linux, Windows, and macOS without any POSIX compatibility headaches.


Question 4: How to run the Python codes on Linux and Windows

These are two separate guides. Both start from the assumption that you have never set up a Python environment for these specific libraries before.


Guide A: Running the code on Linux

The Linux setup is the simpler of the two, because Python is usually already present and the system package manager handles most dependencies.

Step 1: Verify or install Python.

Open a terminal and type python3 --version. On Ubuntu 22.04 or later, Fedora, Debian, and most modern Linux distributions, Python 3.10 or higher is already installed. If it is not, install it with your package manager.

bash
# On Ubuntu / Debian
sudo apt update && sudo apt install python3 python3-pip python3-venv

# On Fedora / RHEL / CentOS
sudo dnf install python3 python3-pip

# On Arch Linux
sudo pacman -S python python-pip

Step 2: Create a virtual environment.

A virtual environment is an isolated Python installation for your project. This is critically important because it means the libraries you install for this project will not conflict with libraries installed globally or for other projects. Think of it as a clean room — everything inside the virtual environment is self-contained.

bash
# Navigate to wherever you want your project to live
mkdir ~/text-search-project
cd ~/text-search-project

# Create the virtual environment — this creates a folder called "venv"
python3 -m venv venv

# Activate it — your terminal prompt will change to show "(venv)"
# This tells you that you are now "inside" the virtual environment
source venv/bin/activate

After activating, any pip install command installs packages only inside this virtual environment, not globally. When you are done working, type deactivate to leave the environment.

Step 3: Install the libraries.

Install everything you need in one go. Most of these will install pre-compiled binary wheels, so you do not need a C compiler for them.

bash
pip install --upgrade pip    # always upgrade pip first

# Core search and data libraries
pip install lz4 pyarrow duckdb whoosh tantivy pyahocorasick pybloom-live pysais

# For the semantic search technique
pip install sentence-transformers faiss-cpu

# For Hyperscan (may require build tools on some Linux systems)
pip install hyperscan

# For async file reading
pip install aiofiles

If hyperscan fails to install, you may need the C++ build tools:

bash
sudo apt install build-essential cmake    # Ubuntu/Debian
sudo dnf install gcc-c++ cmake           # Fedora

Step 4: Run a script.

Save one of the example scripts to a file, say search.py. Make sure your virtual environment is activated (you should see (venv) in your prompt), then run it:

bash
python3 search.py

For the techniques that use system calls requiring elevated privileges — specifically mlockall — you need to either run with sudo or raise your memlock limit:

bash
# Option A: run with sudo (simpler, fine for testing)
sudo venv/bin/python3 search.py

# Option B: raise the memlock limit for your user (better for regular use)
# Edit /etc/security/limits.conf and add this line:
# yourusername hard memlock unlimited
# Then log out and log back in for the change to take effect

For vmtouch, install it separately as a system tool:

bash
sudo apt install vmtouch     # Ubuntu/Debian
sudo dnf install vmtouch     # Fedora (may need to enable EPEL first)
# Or build from source: https://github.com/hoytech/vmtouch

Step 5: Make the POSIX-specific scripts work.

The techniques using posix_fadvise, madvise, and memmem call directly into libc. On Linux, ctypes.util.find_library("c") reliably finds glibc, so these calls just work. The only requirement is that you are on a reasonably modern Linux system (kernel 4.x or later for features like MADV_FREE). No additional setup is needed.


Guide B: Running the code on Windows

Windows requires a bit more care because Python is not pre-installed, some libraries need the Microsoft C++ build tools, and the POSIX-specific techniques will simply not run — you will need to skip or comment out those sections.

Step 1: Install Python on Windows.

Go to python.org/downloads and download the latest Python 3.11 or 3.12 installer for Windows. When the installer runs, there is one critically important checkbox: "Add Python to PATH". Make sure this is checked before clicking Install Now. Without it, Windows will not know where to find Python when you type python in a terminal.

Alternatively, if you use the Windows Package Manager (winget), you can install Python from PowerShell:

powershell
winget install Python.Python.3.12

After installation, close and reopen any terminal windows, then verify:

powershell
python --version
# Should print something like: Python 3.12.x

Note that on Windows, the command is python rather than python3. The installer sets this up correctly.

Step 2: Install the Microsoft C++ Build Tools.

Several libraries (including pyahocorasick, lz4, and potentially tantivy) need to compile C or C++ code during installation. On Linux this is handled by gcc which is usually already present. On Windows you need Microsoft's compiler, which is distributed as part of Visual Studio Build Tools.

Go to visualstudio.microsoft.com/visual-cpp-build-tools/ and download the Build Tools installer. When it runs, select the "C++ build tools" workload. You do not need the full Visual Studio IDE — just the build tools component. This download is about 1–2GB.

Step 3: Create a virtual environment.

Open PowerShell (not the old cmd.exe — PowerShell is much better for development work). Navigate to your project folder and create a virtual environment, just as on Linux but with slightly different activation syntax:

powershell
# Navigate to your project folder
mkdir C:\Users\YourName\text-search-project
cd C:\Users\YourName\text-search-project

# Create the virtual environment
python -m venv venv

# Activate it — the activation script is in a different location on Windows
venv\Scripts\Activate.ps1

If PowerShell refuses to run the activation script with an error about "execution policy", you need to allow script execution first. This is a Windows security setting. Run this once as an administrator:

powershell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

After activation, you should see (venv) at the start of your PowerShell prompt.

Step 4: Install the cross-platform libraries.

Install the libraries that actually work on Windows. Notice the list is shorter than the Linux version — we omit the POSIX-specific ones.

powershell
pip install --upgrade pip

# These all work on Windows without issues
pip install lz4 pyarrow duckdb whoosh pybloom-live aiofiles

# Aho-Corasick — may need the C++ build tools from Step 2
pip install pyahocorasick

# Tantivy — may need the Rust toolchain (see note below)
pip install tantivy

# For semantic search
pip install sentence-transformers faiss-cpu

# DuckDB and SQLite work perfectly on Windows
pip install duckdb

If tantivy fails to install on Windows because it cannot find a Rust compiler, you will need to install Rust first. Go to rustup.rs and run the installer. After installation, close and reopen your terminal, then try pip install tantivy again.

Step 5: Adjust the code for Windows — which sections to skip.

This is the most important practical step. Before running any of the scripts on Windows, you need to identify and skip the sections that use POSIX-only APIs. Here is a clear breakdown of what to do with each affected technique.

For the mlockall technique, the libc.mlockall() call will fail. Comment out the entire lock_all_memory() function and its call. The rest of the script (loading files into a dictionary, searching them) will still work perfectly without memory locking.

For the posix_fadvise technique, the entire prefetching section needs to be commented out. The prefetch_file() and mark_file_done() functions will fail. The actual file reading and searching logic that comes after is fine — you just lose the prefetching optimisation.

For the madvise technique, comment out the advise() function and all calls to it. The mmap.mmap() usage itself works on Windows — only the madvise hints need to be removed.

For the memmem technique, the whole approach needs replacement. On Windows, instead of calling libc.memmem(), use Python's built-in bytes.find() method, which is slower but correct and cross-platform:

python
# Replace find_all_memmem(data, needle) with this on Windows:
def find_all_bytes_find(data: bytes, needle: bytes) -> list[int]:
    """Cross-platform replacement for memmem — uses Python's built-in bytes.find()."""
    offsets = []
    start = 0
    while True:
        idx = data.find(needle, start)
        if idx == -1:
            break
        offsets.append(idx)
        start = idx + 1
    return offsets

For the /dev/shm and tmpfs techniques, you have two options. The simpler option is to skip the RAM disk entirely and just work with files on your regular disk — you lose the performance benefit but the search code still works. The more complete option is to install ImDisk Toolkit (free, from sourceforge.net/projects/imdisk-toolkit), create a RAM disk (it will appear as R:\ or similar), and change your paths to point there. The Python code itself does not need to change — just the path you give it.

For the vmtouch technique, there is no direct Windows equivalent. Skip it entirely.

Step 6: Run the script.

With the virtual environment activated and the POSIX-specific sections commented out:

powershell
python search.py

For scripts that use multiprocessing (the shared memory and parallel search techniques), remember the important Windows difference mentioned earlier — always guard your main code:

python
# This is required on Windows for any script using multiprocessing.
# Without it, Windows will recursively spawn new processes when it imports your script,
# causing a crash. On Linux this is not strictly required but is good practice.
if __name__ == "__main__":
    # All your multiprocessing code goes here
    main()

One final practical note about WSL. If you find the Windows limitations frustrating — especially losing posix_fadvise, madvise, memmem, and vmtouch — the cleanest solution on a Windows machine is to use Windows Subsystem for Linux (WSL2). WSL2 runs a real Linux kernel inside Windows, which means all the POSIX techniques work correctly. You install it once (wsl --install in an administrator PowerShell), and then you follow the Linux guide above inside the WSL terminal. Your Windows files are accessible at /mnt/c/ from inside WSL. Many developers who work primarily on Windows but need Linux system calls use WSL2 as their primary development environment precisely for this reason.

Turn on web search in Search and tools menu. Otherwise, links provided may not be accurate or up to date.

You said: make me a list of your given techniques above, which requires me to use 'sudo' / become root in Linux, or have Admin priviledge in Windows.

make me a list of your given techniques above, which requires me to use 'sudo' / become root in Linux, or have Admin priviledge in Windows.

May 20

Claude responded: This is an important practical question, and the good news is that the answer is a surprisingly short list.

This is an important practical question, and the good news is that the answer is a surprisingly short list. The vast majority of the techniques across all four responses run entirely in user space — they are pure Python library calls or algorithmic operations that the OS has no reason to restrict. Elevated privileges only become necessary when you are asking the kernel to do something that could affect system stability or starve other processes of resources — specifically, reserving physical RAM so the OS cannot reclaim it, or mounting a new filesystem.

It also helps to distinguish between two separate moments where you might need sudo: at installation time (a one-off apt install or pip install that touches system directories), and at runtime (your actual Python script needing elevated privileges every time it runs). Installation-time sudo is a minor inconvenience. Runtime sudo is a more significant design constraint, because it means you cannot run your script as a normal user in production. The list below focuses primarily on runtime privilege requirements, with installation notes where they are relevant.


Techniques that require sudo / root on Linux, or Admin on Windows at runtime

Creating a tmpfs RAM disk is the most obvious one. The command sudo mount -t tmpfs -o size=2G tmpfs /mnt/ramfs calls directly into the kernel's filesystem management layer, and mounting filesystems is a privileged operation on every Unix-like system. A non-root user simply cannot execute a mount call. On Windows, the equivalent — installing and using ImDisk or OSFMount to create a RAM drive — requires Administrator privileges for the same conceptual reason: creating a new block device is a system-level operation.

Note carefully that using /dev/shm does not require sudo. That path already exists as a world-writable tmpfs mount that the OS creates at boot time. You only need root if you want to create your own tmpfs mount at a custom path.

mlockall via ctypes requires root on Linux, or at minimum a specific kernel capability called CAP_IPC_LOCK. The reason the OS restricts this is straightforward and worth understanding: if any ordinary user could call mlockall and lock all their process memory into RAM permanently, a malicious or buggy program could exhaust all physical RAM on a multi-user system and bring it down for everyone. The kernel therefore requires you to either be root, or have your memory lock limit (RLIMIT_MEMLOCK) explicitly raised by a system administrator in /etc/security/limits.conf. Editing that file itself requires sudo. On Windows, the equivalent privilege is called "Lock pages in memory" (SeLockMemoryPrivilege), and by default only the Administrator account holds it — you can grant it to a normal user through the Local Security Policy editor, but that action itself requires Admin.

vmtouch with the -l flag requires sudo on Linux. This is the same underlying reason as mlockall — locking pages in the kernel's page cache so they cannot be evicted is a privileged operation. It is important to note that vmtouch -t (which simply reads files to pull them into the page cache, without locking them) does not require sudo. The page cache loads naturally as a side-effect of reading, which any user can do. It is only the locking step — preventing eviction — that triggers the privilege requirement.

Raising RLIMIT_MEMLOCK permanently via /etc/security/limits.conf requires sudo to edit the file. The temporary shell-level workaround — running ulimit -l unlimited in your terminal session before launching your script — does not always require sudo, but it is constrained by the hard limit set for your user. If the hard limit is low (which it is by default on many distributions), you cannot raise it beyond that without root editing the limits file first.


Techniques that only need sudo for one-time installation, not for running your scripts

Several techniques require sudo apt install (or dnf install) to get the command-line tool onto the system, but once installed, your scripts and commands run as a normal user with no further elevation needed.

vmtouch itself (the -t flag, without -l) falls into this category. ugrep is the same — sudo apt install ugrep once, then ugrep commands run as a normal user. GNU parallel is the same — sudo apt install parallel once, then parallel invocations need no elevation. Installing the C++ build tools needed to compile hyperscan from source requires sudo apt install build-essential cmake, but once the library is compiled and installed, the Hyperscan Python code runs as a normal user. On Windows, installing the Visual Studio Build Tools requires an Admin account, but again this is a one-time setup step.


Everything else — no elevated privileges needed at all

To make this concrete, here is the complete list of techniques that run entirely as a normal user on both Linux and Windows, with no sudo or Admin required at any stage beyond the initial pip install: the Python dictionary corpus, all mmap usage (Python's mmap module operates on files you already have read permission for), posix_fadvise (you are advising the kernel about your own file descriptors, which requires no privilege), madvise (same reasoning — you are advising about your own process's mapped memory), memmem via ctypes (you are calling a C library function, not a privileged system call), multiprocessing.shared_memory (Python manages this at the user level), the hand-built inverted index, the trigram index, suffix arrays, Bloom filters, Aho-Corasick, Rabin-Karp, the compressed corpus with LZ4, SQLite FTS5 in-memory, DuckDB in-memory, Apache Arrow, Whoosh with RamStorage, Tantivy-py with RamDirectory, Hyperscan at runtime, and the FAISS semantic search system.

The mental model to take away is this: the kernel reserves privileged operations for things that cross process boundaries or affect system-wide resource pools — mounting filesystems affects every user on the machine, and locking RAM affects the availability of physical memory for all other processes. Operations that are entirely contained within your own process — opening files you have permission to read, allocating heap memory, running algorithms, calling library functions — are ordinary user-space operations that the kernel has no reason to restrict.

You said: similarly, this time, make some lists for FreeBSD :

similarly, this time, make some lists for FreeBSD :

  1. make me one list of those of your given techniques above, which will not work in FreeBSD at all. give me another list of those, which will work, but are hard to apply in FreeBSD.

  2. make me a list of those, which requires me to use 'sudo' / become root in FreeBSD.

May 20

Claude responded: FreeBSD is a fascinating case to analyse, because it sits in a very different position from Windows when it comes to these techniques.

FreeBSD is a fascinating case to analyse, because it sits in a very different position from Windows when it comes to these techniques. Unlike Windows, FreeBSD is a fully POSIX-compliant Unix system with a proper kernel virtual memory subsystem, a mature C library, and a direct conceptual lineage from the same BSD Unix that influenced many of these APIs in the first place. So the compatibility picture is much more favourable than Windows overall — but FreeBSD is not Linux, and there are enough differences in system call constants, library availability, and default system configuration to cause real problems if you run the code as written without adjustment.

The single most important conceptual thing to understand before reading these lists is the distinction between a technique not working and a technique needing adaptation. Many of the techniques call into the C library using ctypes, using hardcoded numeric constants that were looked up from Linux header files. On FreeBSD, the same system calls often exist — madvise, mlockall, posix_fadvise — but some of the constant values that control their behaviour differ between the two kernels. The call itself works; the wrong constant value just silently does the wrong thing, which is actually more dangerous than an outright crash because you might not notice. This is why several techniques land in the "hard to apply" category rather than the "won't work" category — the mechanism exists, but using it correctly requires going and looking up the FreeBSD-specific header values rather than blindly reusing the Linux ones.


Part 1: What works and what doesn't on FreeBSD

Techniques that will not work on FreeBSD at all

The first technique in this category is /dev/shm as a RAM disk path. The code from the earlier responses uses /dev/shm as a hardcoded path, treating it as a tmpfs filesystem that is always present. On Linux with systemd, this path is mounted automatically at boot as a world-writable shared memory filesystem. FreeBSD does not do this. The path /dev/shm simply does not exist on a default FreeBSD installation, so any code that copies files to /dev/shm/ or reads from it will fail with a "no such file or directory" error immediately. It is important to understand that the concept works fine on FreeBSD — you can absolutely create a tmpfs mount and use it as a RAM disk — but the specific path /dev/shm and the assumption that it pre-exists are Linux-specific. You would need to manually mount a tmpfs at a path of your choosing, which puts this firmly in the "needs significant adaptation" territory and makes the code non-portable as written.

The second non-working technique is Intel Hyperscan via the hyperscan pip package. Intel's Hyperscan library officially targets Linux and Windows. The hyperscan pip package ships pre-compiled binary wheels that are built for Linux's ELF binary format. FreeBSD uses its own binary format and its own system ABI, so those pre-built wheels are simply incompatible. You could theoretically build Hyperscan from its C++ source code on FreeBSD, but the build system and CMake configuration are designed and tested for Linux, and getting it to compile on FreeBSD requires patching and working through build errors that the Hyperscan maintainers do not officially support. In practice, this technique is effectively unavailable on FreeBSD without a substantial porting effort.

The third non-working technique is FAISS via the faiss-cpu pip package. FAISS, Facebook's vector similarity search library, is developed and tested primarily on Linux. The pip package provides pre-compiled wheels for Linux and macOS, but not for FreeBSD. On FreeBSD, pip install faiss-cpu will attempt to compile FAISS from source if it cannot find a wheel, and this compilation fails due to missing Linux-specific build assumptions and differences in how FreeBSD's compiler toolchain handles certain SIMD intrinsics. This means the semantic search technique — the most conceptually novel one in the entire series — is effectively blocked on FreeBSD unless you are willing to invest significant effort in porting the FAISS build system. If you need vector search on FreeBSD, a practical workaround is to use hnswlib instead (pip install hnswlib), which is a pure C++ approximate nearest-neighbour library that compiles cleanly on FreeBSD and provides similar functionality to the flat FAISS index used in the examples.

The fourth non-working item is GNU Parallel as a system command. This is slightly different from the others because it is a command-line tool rather than a Python library, but since it featured prominently in the parallelism technique, it belongs on this list. FreeBSD does not include GNU Parallel in its base system. You can install it from the ports collection with pkg install parallel, but even then, FreeBSD's default grep is BSD grep rather than GNU grep, which has subtly different command-line flags. The combination of needing to install both GNU Parallel and ensuring you are calling the right grep version makes this technique work but with meaningful friction.


Techniques that will work on FreeBSD, but are hard to apply

The first hard-to-apply technique is the madvise code using hardcoded Linux constant values. FreeBSD's madvise system call exists and works well — FreeBSD has had excellent madvise support for a very long time. The problem is purely with the numeric constants. In the earlier code, these were defined as:

python
MADV_NORMAL     = 0
MADV_RANDOM     = 1
MADV_SEQUENTIAL = 2
MADV_WILLNEED   = 3
MADV_DONTNEED   = 4
MADV_FREE       = 8   # This is the Linux value — WRONG on FreeBSD

On FreeBSD, MADV_FREE is defined as 5, not 8. This means that if you run the code as written on FreeBSD and call advise(mm, 0, len(mm), MADV_FREE), you will actually be passing the value 8 to the kernel, which on FreeBSD corresponds to a different constant entirely (MADV_NOSYNC, which tells the kernel not to synchronise this mapping to disk — completely unrelated to what you intended). The kernel will not raise an error; it will simply do the wrong thing silently. To fix this, you need to look up the correct values from FreeBSD's <sys/mman.h> header file and use them in your Python code. The fix itself is simple once you know about the problem, but the difficulty is that this kind of silent misbehaviour is easy to miss.

The second hard-to-apply technique is mlockall and the memory locking limit system. The mlockall system call itself works identically on FreeBSD — it is a POSIX standard call and FreeBSD implements it correctly. The difficulty is in how you raise the memory lock limit for a normal user when you do not want to run your script as root every time. On Linux, you edit /etc/security/limits.conf and add a line like yourusername hard memlock unlimited. FreeBSD has a completely different resource limit system. Instead of limits.conf, FreeBSD uses /etc/login.conf, and the syntax and structure are entirely different. You would need to add an entry to a login class definition in that file, then rebuild the login database with cap_mkdb /etc/login.conf, and then ensure your user is assigned to that login class. This is not conceptually difficult, but it is a different enough system that anyone familiar only with the Linux approach will need to read FreeBSD-specific documentation to get it right.

The third hard-to-apply area is tantivy-py and other libraries requiring Rust compilation. On Linux, pip install tantivy almost always succeeds immediately because pre-built binary wheels are available. On FreeBSD, the pip package will not find a compatible wheel and will attempt to compile the Rust source code locally. This means you first need to install Rust via pkg install rust or via rustup, then install Tantivy. The compilation succeeds on FreeBSD in most cases, but it takes several minutes and occasionally runs into version compatibility issues between the Rust edition that Tantivy's Cargo.toml specifies and the version of Rust available in FreeBSD's ports tree. The same issue applies to pysais for suffix arrays and to pyahocorasick, both of which compile C extensions that work on FreeBSD but require the correct development toolchain to be installed first via pkg install python3 py39-pip gcc.

The fourth hard-to-apply technique is vmtouch. This is actually much more positive than the previous items — vmtouch's author explicitly supports FreeBSD, and pkg install vmtouch works cleanly. The difficulty is minor: the -l flag (for locking pages) behaves slightly differently on FreeBSD in terms of which kernel resources it consumes, and the output statistics it prints reference some Linux-specific /proc filesystem paths that do not exist on FreeBSD. For basic usage — loading files into the page cache with vmtouch -t — it works perfectly. The locking functionality works but requires root, just as on Linux.

The fifth hard-to-apply area is multiprocessing.shared_memory. FreeBSD implements POSIX shared memory using its own shared memory subsystem, which is different from Linux's in one practically important way: the size of the shared memory pool is controlled by the kern.ipc.shmall and kern.ipc.shmmax sysctl variables, which have conservative defaults on many FreeBSD installations. If you try to create a large shared memory segment — say, for a 1GB corpus — you may hit these limits and receive an OSError: [Errno 22] Invalid argument from SharedMemory(create=True, size=...). The fix is to raise these limits with sysctl kern.ipc.shmmax=2147483648 before running your script, which requires root. Once the limits are set appropriately, the shared memory code itself runs correctly.


Part 2: Techniques requiring sudo or root on FreeBSD

The list of runtime privilege requirements on FreeBSD is very similar to Linux, with one important structural difference: FreeBSD manages resource limits through /etc/login.conf rather than /etc/security/limits.conf, which means the one-time administrative setup to allow a normal user to perform privileged operations is done differently, even when the privilege concept is the same.

Mounting a tmpfs filesystem requires root on FreeBSD, just as on Linux. The command sudo mount -t tmpfs tmpfs /mnt/ramfs works on FreeBSD with the same syntax, and the same reasoning applies — mounting filesystems is a system-wide operation that affects all users, so the kernel requires root. This is the runtime requirement for the RAM disk technique. If you use only /dev/shm on Linux you avoid this requirement because the mount is pre-existing, but on FreeBSD there is no pre-existing equivalent, so you must mount one yourself, which always requires root.

mlockall at runtime requires root on FreeBSD, or a raised memorylocked limit in /etc/login.conf. This is identical in principle to the Linux situation, just with a different configuration file. The default memorylocked limit for normal users on FreeBSD is quite small — often 64KB or less — which means any attempt to lock a meaningful amount of memory into RAM will fail with ENOMEM unless you are root or have had your limit raised by an administrator. Running the mlockall script with sudo python3 search.py is the simplest fix for testing. For a permanent fix without root at runtime, you need a root-level edit to /etc/login.conf.

vmtouch -l (the locking flag) requires root on FreeBSD for exactly the same reason as mlockall — it uses mlock under the hood to pin page cache entries, which is a privileged operation. vmtouch -t (which simply reads files into the page cache without locking them) does not require root on FreeBSD, just as on Linux.

Raising shared memory kernel limits via sysctl requires root. If your shared memory code fails due to FreeBSD's conservative default kern.ipc.shmmax value, fixing it requires sudo sysctl kern.ipc.shmmax=2147483648. To make this persistent across reboots, you would add kern.ipc.shmmax=2147483648 to /etc/sysctl.conf, which requires root access to edit. Once this is done, the shared memory code itself runs as a normal user.

Installing packages with pkg requires root, just as apt or dnf does on Linux. Commands like sudo pkg install vmtouch ripgrep parallel rust are one-time setup steps rather than runtime requirements, but it is worth noting that on a FreeBSD system where you do not have root access at all — a shared server, for example — you cannot install these system-level tools and would need to ask an administrator to do it for you. The pure Python library installations via pip inside a virtual environment do not require root, exactly as on Linux.

The encouraging takeaway is that FreeBSD's privilege model is conceptually identical to Linux's — the kernel restricts the same categories of operations (filesystem mounting, physical memory locking, shared memory pool sizing) for the same good reasons. The main practical difference is that configuring the system to allow your normal user to perform these operations involves editing FreeBSD's own configuration files (/etc/login.conf, /etc/sysctl.conf) rather than the Linux-specific ones (/etc/security/limits.conf). Once you know where FreeBSD keeps its configuration, the privilege requirements are just as manageable as on Linux.