performance client

learned about batch calls using the b10 performance client today

the problem we solve: even with async, you're bottlenecked by

python's GIL (no true parallelism)
no smart batching
no request hedging (p99 latency kills you)

what is request hedging?

imagine you send a request, and 99% of it come back in 100ms, but 1% takes 5s due to network or slow replica

request hedging is : after Xms, send a duplicate request, whichever finishes first wins, slow one gets cancelled

it's like calling an uber, a waymo, and a lyft, and whichever arrives first, you get on, the rest you cancel. (wouldn't that be a great app)

the catch: it costs extra requests. you can cap this at a budget with b10

from baseten_performance_client import PerformanceClient

client = PerformanceClient(
    base_url="https://api.openai.com",
    api_key="your-key"
)

texts = ["doc " + str(i) for i in range(100000)]

response = client.embed(
    input=texts,
    model="text-embedding-3-small",
    batch_size=128,              # pack by count
    max_chars_per_request=50000, # or by chars (hits limit first)
    max_concurrent_requests=256,
    hedge_delay=0.5              # send duplicate after 0.5s
)

went to the ferry building and tried lunette, the cambodian restaurant. the pork noodle soup was decent, esp for the $28 price tag, i had high expectations after watching that yt video, but i cannot trust youtubers. went to the main library and picked up two books from the bookstore. then worked out of there for a few hours. walked to ikea to get some meatballs. then worked out of saluhall, a modern food hall with tasteful decor and lights. i sat there getting more work done before i rushed to pickup the chicken rice i ordered from Gai and Rice and to catch my waymo home.

BENEDICT NEO 梁耀恩

performance client