Free Threading — Python's way to "goroutines", not really.
Published on Apr 25, 2026
Python’s Global Interpreter Lock (GIL) has long been the impediment for anyone trying to write truly concurrent Python. Threads? Locked with the GIL. CPU-bound parallelism with threads? Nope. AsyncIO? Cooperative, not parallel. Every Python developer eventually hits this wall and either reaches for multiprocessing (heavy, serialization overhead) or accepts the limitations and spawns multiple workers.
But Python 3.13+ shipped something that changes this: free-threading — an experimental build that removes the GIL entirely. And if you squint hard enough, threads in a no-GIL build start to look a lot like Go’s goroutines: lightweight execution units running truly in parallel, communicating via channels (queues, in Python’s case), sharing memory where convenient. Well, not really — they differ by a lot. But it’s the closest Python gets.
The GIL Problem
The GIL ensures only one thread executes Python bytecode at a time. This makes CPython’s memory management simple but kills true parallelism for CPU-bound work. AsyncIO gives you concurrency (cooperative multitasking on a single thread) but not parallelism. Standard threads give you… well, not much either, thanks to the GIL.
The key insight: remove the GIL, and threads suddenly work. Python 3.13 introduced free-threaded builds (also called “no-GIL” or python3.13t), and by Python 3.14 the feature has matured significantly. In a free-threaded build, multiple OS threads can execute Python bytecode simultaneously — no lock, no contention, no serial bottleneck.
This is exactly what PEP 703 delivers.
Free-Threading: No GIL, True Threads
Free-threading is a build-time option. Install a t-suffixed Python:
# Install the no-GIL Python interpreter
uv python install 3.14t
# Run your script with it
uv run --python 3.14t python freethread.py In a free-threaded build:
- Multiple threads execute Python bytecode truly in parallel
- Memory is shared across threads (no serialization, no deep copies)
- Standard threading module works out of the box — no new API to learn
- queue.Queue is all you need for thread-safe communication
The downside? Thread safety is now your responsibility. you’ll need locks, queues, or other synchronization primitives. This is the same tradeoff Go developers make daily with goroutines and sync.Mutex.
The Goroutine Analogy
Here’s how the two models line up:
| Concept | Go | Python Free-Threading |
|---|---|---|
| Execution unit | goroutine | thread + asyncio loop |
| Scheduler | Go runtime (M:N, work-stealing) | OS threads (1:1) |
| Communication | channels | queue.Queue |
| Shared memory | available via sync.Mutex | available via threading.Lock |
| Creation cost | ~2 KB (growable stack) | ~32 KB minimum stack + interpreter share |
| Parallelism | yes | yes (on free threaded build) |
The real magic happens when you combine free-threaded builds with asyncio. A goroutine is lightweight — you spawn thousands, each doing async I/O. A Python thread is heavier — you might spawn 4–8 — but each one runs its own asyncio event loop, handling thousands of coroutines internally. It’s like having a handful of Go’s M (machine threads), each managing many G (goroutines). See the pattern?
Let’s Build It: Parallel PokeAPI Fetcher
Enough theory, show me the damn code! Let’s build a concurrent Pokemon data fetcher using 4 worker threads (no GIL), each running an asyncio event loop, coordinated through standard queues — Python’s answer to Go channels, Queues are thread-safe by design.
import asyncio
import httpx
import queue
from threading import Thread
async def fetch_single(name, client):
"""Fetch a single pokemon."""
url = f"https://pokeapi.co/api/v2/pokemon/{name}"
resp = await client.get(url, timeout=30)
data = resp.json()
types = [t['type']['name'] for t in data['types']]
return f"Worker fetched {name}: {types}"
def worker_entrypoint(work_queue, result_queue, worker_id):
"""Worker thread entrypoint."""
async def process_batch(names, client, semaphore):
async def fetch_with_semaphore(name):
async with semaphore:
try:
return await fetch_single(name, client)
except Exception as e:
return f"Error fetching {name}: {e}"
tasks = [fetch_with_semaphore(name) for name in names]
results = await asyncio.gather(*tasks)
for result in results:
result_queue.put(f"[Worker {worker_id}] {result}")
async def run_worker():
limits = httpx.Limits(max_connections=100, max_keepalive_connections=100)
semaphore = asyncio.Semaphore(100)
async with httpx.AsyncClient(limits=limits) as client:
while True:
work_item = work_queue.get()
if work_item is None:
break
await process_batch(work_item, client, semaphore)
work_queue.task_done()
asyncio.run(run_worker())
result_queue.put(None)
def master_producer(work_queue, limit=40):
"""Producer that fetches pokemon list and distributes work."""
async def fetch_and_distribute():
async with httpx.AsyncClient() as client:
resp = await client.get(f"https://pokeapi.co/api/v2/pokemon?limit={limit}")
data = resp.json()
names = [entry['name'] for entry in data['results']]
chunk_size = len(names) // 4
for i in range(0, len(names), chunk_size):
work_queue.put(names[i:i + chunk_size])
# Wait until all work is processed
work_queue.join()
# Stop signals
for _ in range(4):
work_queue.put(None)
asyncio.run(fetch_and_distribute())
if __name__ == "__main__":
limit = 1000
work_q = queue.Queue()
results_q = queue.Queue()
threads = []
for i in range(4):
t = Thread(target=worker_entrypoint, args=(work_q, results_q, i))
t.start()
threads.append(t)
producer_t = Thread(target=master_producer, args=(work_q, limit))
producer_t.start()
threads.append(producer_t)
# Collect results until all workers signal done
workers_done = 0
while workers_done < 4:
result = results_q.get()
if result is None:
workers_done += 1
else:
print(f"[RESULT] {result}")
for t in threads:
t.join() Walking Through the Architecture
The setup has two layers of concurrency:
Layer 1 — Threads (parallelism):- 4 worker threads + 1 producer thread, all running on separate OS threads
- No GIL means they execute Python bytecode truly in parallel
- A single shared results_q collects output — threads share memory, no serialization needed
- queue.Queue is thread-safe by design: internal locks prevent race conditions on put()/get(). This is exactly what Go channels provide — safe communication between concurrent execution units.
Layer 2 — asyncio (concurrency): - Inside each worker thread, an asyncio event loop manages hundreds of HTTP requests concurrently
- asyncio.gather() fans out all Pokemon fetches within a batch, waiting on I/O cooperatively
- A Semaphore(100) limits concurrent connections per worker
The “Not Really” Caveats
Free-threaded threads are not goroutines. Here’s why the analogy has limits:
- Thread weight vs. goroutine weight. A goroutine starts at ~2 KB and grows as needed. A Python thread has a minimum stack of 32 KB (threading.stack_size). You can spawn 100,000 goroutines without breaking a sweat; 10,000 Python threads would be pushing it.
- No work-stealing scheduler. Go's runtime distributes goroutines across OS threads dynamically with work-stealing — a goroutine can start on one thread and finish on another. Python threads map 1:1 to OS threads with no runtime-level load balancing.
- Cooperative vs. preemptive scheduling. Since Go 1.14, the Go scheduler is preemptive. Python's asyncio is cooperative — if a coroutine hogs the event loop without an `await`, everything blocks.
- Thread safety is your job. Goroutines encourage the "share memory by communicating" philosophy. Python threads have queues, but they also give you full shared-memory access — which means data races are possible.
What about Subinterpreters?
If free-threading’s shared-memory model makes you nervous, or if you need strong isolation between workers, Python 3.13+ also offers subinterpreters via PEP 734.
def worker_entrypoint(work_queue, result_queue: interpreters.Queue):
"""Synchronous entrypoint for the subinterpreter to launch the event loop."""
async def process_batch(names, client):
tasks = [fetch_single(name, client) for name in names]
results = await asyncio.gather(*tasks, return_exceptions=True)
for name, result in zip(names, results):
if isinstance(result, Exception):
result_queue.put(f"Error fetching {name}: {result}")
else:
result_queue.put(result)
async def run_worker():
limits = httpx.Limits(max_connections=100,
max_keepalive_connections=100)
async with httpx.AsyncClient(limits=limits) as client:
while True:
work_batch = work_queue.get()
if work_batch is None:
break
await process_batch(work_batch, client)
asyncio.run(run_worker())
result_queue.put(None)
async def master_async_producer(work_queue, limit=40):
async with httpx.AsyncClient() as client:
resp = await client.get(f"https://pokeapi.co/api/v2/pokemon?limit={limit}")
data = resp.json()
names = [entry['name'] for entry in data['results']]
chunk_size = len(names) // 4
for i in range(0, len(names), chunk_size):
work_queue.put(names[i:i + chunk_size])
for _ in range(4):
work_queue.put(None)
def master_entrypoint(work_queue, limit):
asyncio.run(master_async_producer(work_queue, limit))
if __name__ == "__main__":
limit = 1000
work_q = interpreters.create_queue()
results_qs = [interpreters.create_queue() for _ in range(4)]
interps = []
threads = []
for i in range(4):
interp = interpreters.create()
interps.append(interp)
t = interp.call_in_thread(worker_entrypoint, work_q, results_qs[i])
threads.append(t)
master_interp = interpreters.create()
interps.append(master_interp)
master_t = master_interp.call_in_thread(master_entrypoint, work_q, limit)
threads.append(master_t)
total_results = 0
queues_alive = 4
while queues_alive > 0:
for q in results_qs:
try:
result = q.get(timeout=0.1)
if result is None:
queues_alive -= 1
else:
print(f"[RESULT] {result}")
total_results += 1
except:
pass
for t in threads:
t.join()
for i in interps:
i.close() Notice the extra ceremony:
- 5 queues instead of 2 — each worker needs its own result queue (no shared memory)
- Serialization Tax — all communication goes through interpreters.Queue, which pickles data across boundaries.
- Explicit cleanup — interpreters must be explicitly closed.
Subinterpreters work on standard Python — no special build required. This is the real value proposition if you cannot switch to the experimental t build.
Benchmarks: Does It Actually Deliver?
| Scenario | Approach | Avg Time | Peak Memory | CPU Time |
|---|---|---|---|---|
| Standard Python | Threading (GIL) | 9.00s | 85 MB | 1.2s |
| No-GIL (3.14t) | Free-threading | 5.45s | 283 MB | 6.55s |
| Standard Python | Subinterpreters | 6.01s | 386 MB | 19.66s |
With the GIL removed, threads are 1.65x faster just from flipping the build. Subinterpreters are faster than GIL-locked threads but carry a heavy CPU/Memory tax due to cross-interpreter serialization.
When to Use What
| If you need… | Use… |
|---|---|
| Maximum throughput (I/O or CPU) | Free-Threading |
| Shared data structures (numpy, lists) | Free-Threading |
| Strict security/isolation | Subinterpreters |
| Plugin systems / sandboxing | Subinterpreters |
| Legacy C-extensions (GIL-dependent) | Subinterpreters |
The Bigger Picture
Free-threading marks a philosophical shift. The thread + asyncio pattern — multiple parallel event loops communicating through queues — is the closest Python has ever come to Go’s concurrency model. With free-threading maturing through Python 3.14+, it’s finally time to stop reaching for multiprocessing by default.
PEP 703 — Making the Global Interpreter Lock Optional
Python docs — Free-Threaded CPython
PEP 734 — Multiple Interpreters in the Stdlib