side effects

used claude code to continue iterating on my indexing code. it's hard to build out a whole pipeline without having a clear structure of the architecture. honestly the biggest mistake i made is having claude code do all the heavy lifting. i should've spent a few hours sitting down and sketching out a good framework. but i'm also new to this so i didn't know what a good framework looks like. i guess it's an iterative process. now i know using asyncio is better for this task, and it should have resume functionality, and it should have no side effects.

what are side effects?

side effects = hidden changes that happen inside a function (modifying global state, mutating objects, writing to files without returning the result)

bad - with side effects:

def process_chunks(file_path):
    chunks = chunk_file(file_path)

    for chunk in chunks:
        embedding = generate_embedding(chunk)
        db.insert(chunk, embedding)  # ← side effect: writes to DB immediately

    return len(chunks)

# if this crashes halfway, you get:
# - duplicate entries on retry
# - partial data in database
# - no way to know what succeeded

good - no side effects:

def process_chunks(file_path):
    chunks = chunk_file(file_path)

    results = []
    for chunk in chunks:
        embedding = generate_embedding(chunk)
        results.append({"chunk": chunk, "embedding": embedding})

    return results  # ← caller decides what to do with results

# if this crashes, just retry - no corruption
# can batch insert all at once
# easy to validate before committing to DB

why this matters for pipelines:

can retry any step without corrupting state
can run steps in parallel safely
can resume from crashes without side effects piling up
testing is simple: input → output, no setup needed

BENEDICT NEO 梁耀恩

side effects

what are side effects?