used claude code to continue iterating on my indexing code. it's hard to build out a whole pipeline without having a clear structure of the architecture. honestly the biggest mistake i made is having claude code do all the heavy lifting. i should've spent a few hours sitting down and sketching out a good framework. but i'm also new to this so i didn't know what a good framework looks like. i guess it's an iterative process. now i know using asyncio is better for this task, and it should have resume functionality, and it should have no side effects.
what are side effects?
side effects = hidden changes that happen inside a function (modifying global state, mutating objects, writing to files without returning the result)
bad - with side effects:
def process_chunks(file_path): chunks = chunk_file(file_path) for chunk in chunks: embedding = generate_embedding(chunk) db.insert(chunk, embedding) # ← side effect: writes to DB immediately return len(chunks) # if this crashes halfway, you get: # - duplicate entries on retry # - partial data in database # - no way to know what succeeded
good - no side effects:
def process_chunks(file_path): chunks = chunk_file(file_path) results = [] for chunk in chunks: embedding = generate_embedding(chunk) results.append({"chunk": chunk, "embedding": embedding}) return results # ← caller decides what to do with results # if this crashes, just retry - no corruption # can batch insert all at once # easy to validate before committing to DB
why this matters for pipelines:
- can retry any step without corrupting state
- can run steps in parallel safely
- can resume from crashes without side effects piling up
- testing is simple: input → output, no setup needed