Explain when you would use: asyncio, threading, multiprocessing when developing a fast api app, which acts as a proxy for LLM inference.
Sorry for the novel—I got kind of into this question.
LLM inference is a miracle of our age, but when it comes to talking to it, it’s just an API that pends for awhile and send you back some data. That means we need to treat it like any other high-latency API. Hence, we can abstract away the LLM-ness of this problem and just ask ourselves if what we’re building is real-time, near real-time (streaming), or batch.
Most of the time, we’re probably going to want to use asyncio for real-time and streaming middleware APIs, regardless of the service they’re proxying. The benefit of this tool is that it operates in a single thread, suspending tasks while they pend. This means that we can use single-core pods/VMs in our deployment, which gives us better control over scaling. IIRC, all the modern API frameworks support some form of cooperative multitasking, but FastAPI is the one that supports Python’s native asyncio. This is because Flask (which uses gevent) and Tornado (which uses something else) came of age when Python’s native asyncio was pretty bad and the syntax for coroutines was even worse. Full disclosure that I haven’t had occasion to think about this scenario in a few years.
Multiprocessing is fantastic for batch processing, because it spawns completely independent Python interpreters for each batch. And since each process is independent, it’s trivial to use it alongside asyncio, which is fantastic for when we need to make tons of 3rd-party API requests and our task is simple enough to keep on one machine. I actually ran into this exact situation while I was at Etsy. They had a job that was using 500 (almost totally idle) Apache Beam workers making queries to a vector database. I replaced them with just one, and achieved higher throughput. The downside is that you need to serialize (pickle) your state when you fork and then again when you merge.
As for threading, I’ll be honest: in almost 18 years of writing Python, I have almost never found it to be the right tool. Technically, you can use it to create some interesting flows involving shared state that could theoretically outperform asyncio or multiprocessing in terms of wall clock time. But between debugging complexity and GIL issues, this has almost never been worth it. I say “almost” because I vaguely remember rigging up some complicated thing involving alarms and timers, but I don’t recall what.
I understand that the latest release of Python makes the GIL optional. I have not looked into this at all, in part because the complexity and bug risk is still there. Here’s the way I look at it: if you are so performance sensitive that you need to think about threads, you should probably be looking at writing this piece of your infra in a systems language like Go or Rust. If it needs tight integration into Python, you could write it in C. In any of those scenarios, it will probably be less painful to maintain.
Note that for batch workloads, you can trivially switch between multiprocessing and threading by using concurrent.futures, which is by far my preferred way to work with either of those. By yielding the results as they complete, you can transparently parallelize a piece of your code and hand it to any downstream code that expects an iterator.