learned about batch calls using the b10 performance client today
the problem we solve: even with async, you're bottlenecked by
- python's GIL (no true parallelism)
- no smart batching
- no request hedging (p99 latency kills you)
what is request hedging?
imagine you send a request, and 99% of it come back in 100ms, but 1% takes 5s due to network or slow replica
request hedging is : after Xms, send a duplicate request, whichever finishes first wins, slow one gets cancelled
it's like calling an uber, a waymo, and a lyft, and whichever arrives first, you get on, the rest you cancel. (wouldn't that be a great app)
the catch: it costs extra requests. you can cap this at a budget with b10
from baseten_performance_client import PerformanceClient client = PerformanceClient( base_url="https://api.openai.com", api_key="your-key" ) texts = ["doc " + str(i) for i in range(100000)] response = client.embed( input=texts, model="text-embedding-3-small", batch_size=128, # pack by count max_chars_per_request=50000, # or by chars (hits limit first) max_concurrent_requests=256, hedge_delay=0.5 # send duplicate after 0.5s )
went to the ferry building and tried lunette, the cambodian restaurant. the pork noodle soup was decent, esp for the $28 price tag, i had high expectations after watching that yt video, but i cannot trust youtubers. went to the main library and picked up two books from the bookstore. then worked out of there for a few hours. walked to ikea to get some meatballs. then worked out of saluhall, a modern food hall with tasteful decor and lights. i sat there getting more work done before i rushed to pickup the chicken rice i ordered from Gai and Rice and to catch my waymo home.