Free Threading — Python's way to "goroutines", not really.

Published on Apr 25, 2026

#python#freethreading#concurrency#parallelism

Python’s Global Interpreter Lock (GIL) has long been the impediment for anyone trying to write truly concurrent Python. Threads? Locked with the GIL. CPU-bound parallelism with threads? Nope. AsyncIO? Cooperative, not parallel. Every Python developer eventually hits this wall and either reaches for multiprocessing (heavy, serialization overhead) or accepts the limitations and spawns multiple workers.

But Python 3.13+ shipped something that changes this: free-threading — an experimental build that removes the GIL entirely. And if you squint hard enough, threads in a no-GIL build start to look a lot like Go’s goroutines: lightweight execution units running truly in parallel, communicating via channels (queues, in Python’s case), sharing memory where convenient. Well, not really — they differ by a lot. But it’s the closest Python gets.


The GIL Problem

The GIL ensures only one thread executes Python bytecode at a time. This makes CPython’s memory management simple but kills true parallelism for CPU-bound work. AsyncIO gives you concurrency (cooperative multitasking on a single thread) but not parallelism. Standard threads give you… well, not much either, thanks to the GIL.

The key insight: remove the GIL, and threads suddenly work. Python 3.13 introduced free-threaded builds (also called “no-GIL” or python3.13t), and by Python 3.14 the feature has matured significantly. In a free-threaded build, multiple OS threads can execute Python bytecode simultaneously — no lock, no contention, no serial bottleneck.

This is exactly what PEP 703 delivers.


Free-Threading: No GIL, True Threads

Free-threading is a build-time option. Install a t-suffixed Python:

# Install the no-GIL Python interpreter
uv python install 3.14t

# Run your script with it
uv run --python 3.14t python freethread.py

In a free-threaded build:

  • Multiple threads execute Python bytecode truly in parallel
  • Memory is shared across threads (no serialization, no deep copies)
  • Standard threading module works out of the box — no new API to learn
  • queue.Queue is all you need for thread-safe communication

The downside? Thread safety is now your responsibility. you’ll need locks, queues, or other synchronization primitives. This is the same tradeoff Go developers make daily with goroutines and sync.Mutex.


The Goroutine Analogy

Here’s how the two models line up:


ConceptGoPython Free-Threading
Execution unitgoroutinethread + asyncio loop
SchedulerGo runtime (M:N, work-stealing)OS threads (1:1)
Communicationchannelsqueue.Queue
Shared memoryavailable via sync.Mutexavailable via threading.Lock
Creation cost~2 KB (growable stack)~32 KB minimum stack + interpreter share
Parallelismyesyes (on free threaded build)

The real magic happens when you combine free-threaded builds with asyncio. A goroutine is lightweight — you spawn thousands, each doing async I/O. A Python thread is heavier — you might spawn 4–8 — but each one runs its own asyncio event loop, handling thousands of coroutines internally. It’s like having a handful of Go’s M (machine threads), each managing many G (goroutines). See the pattern?


Let’s Build It: Parallel PokeAPI Fetcher

Enough theory, show me the damn code! Let’s build a concurrent Pokemon data fetcher using 4 worker threads (no GIL), each running an asyncio event loop, coordinated through standard queues — Python’s answer to Go channels, Queues are thread-safe by design.

import asyncio
import httpx
import queue
from threading import Thread


async def fetch_single(name, client):
    """Fetch a single pokemon."""
    url = f"https://pokeapi.co/api/v2/pokemon/{name}"
    resp = await client.get(url, timeout=30)
    data = resp.json()
    types = [t['type']['name'] for t in data['types']]
    return f"Worker fetched {name}: {types}"


def worker_entrypoint(work_queue, result_queue, worker_id):
    """Worker thread entrypoint."""

    async def process_batch(names, client, semaphore):
        async def fetch_with_semaphore(name):
            async with semaphore:
                try:
                    return await fetch_single(name, client)
                except Exception as e:
                    return f"Error fetching {name}: {e}"

        tasks = [fetch_with_semaphore(name) for name in names]
        results = await asyncio.gather(*tasks)

        for result in results:
            result_queue.put(f"[Worker {worker_id}] {result}")

    async def run_worker():
        limits = httpx.Limits(max_connections=100, max_keepalive_connections=100)
        semaphore = asyncio.Semaphore(100)
        async with httpx.AsyncClient(limits=limits) as client:
            while True:
                work_item = work_queue.get()
                if work_item is None:
                    break

                await process_batch(work_item, client, semaphore)
                work_queue.task_done()

    asyncio.run(run_worker())
    result_queue.put(None)


def master_producer(work_queue, limit=40):
    """Producer that fetches pokemon list and distributes work."""

    async def fetch_and_distribute():
        async with httpx.AsyncClient() as client:
            resp = await client.get(f"https://pokeapi.co/api/v2/pokemon?limit={limit}")
            data = resp.json()
            names = [entry['name'] for entry in data['results']]

            chunk_size = len(names) // 4
            for i in range(0, len(names), chunk_size):
                work_queue.put(names[i:i + chunk_size])

            # Wait until all work is processed
            work_queue.join()

            # Stop signals
            for _ in range(4):
                work_queue.put(None)

    asyncio.run(fetch_and_distribute())


if __name__ == "__main__":
    limit = 1000
    work_q = queue.Queue()
    results_q = queue.Queue()

    threads = []

    for i in range(4):
        t = Thread(target=worker_entrypoint, args=(work_q, results_q, i))
        t.start()
        threads.append(t)

    producer_t = Thread(target=master_producer, args=(work_q, limit))
    producer_t.start()
    threads.append(producer_t)

    # Collect results until all workers signal done
    workers_done = 0
    while workers_done < 4:
        result = results_q.get()
        if result is None:
            workers_done += 1
        else:
            print(f"[RESULT] {result}")

    for t in threads:
        t.join()


Walking Through the Architecture

The setup has two layers of concurrency:

Layer 1 — Threads (parallelism):
  • 4 worker threads + 1 producer thread, all running on separate OS threads
  • No GIL means they execute Python bytecode truly in parallel
  • A single shared results_q collects output — threads share memory, no serialization needed
  • queue.Queue is thread-safe by design: internal locks prevent race conditions on put()/get(). This is exactly what Go channels provide — safe communication between concurrent execution units.
    Layer 2 — asyncio (concurrency):
  • Inside each worker thread, an asyncio event loop manages hundreds of HTTP requests concurrently
  • asyncio.gather() fans out all Pokemon fetches within a batch, waiting on I/O cooperatively
  • A Semaphore(100) limits concurrent connections per worker

The “Not Really” Caveats

Free-threaded threads are not goroutines. Here’s why the analogy has limits:

  1. Thread weight vs. goroutine weight. A goroutine starts at ~2 KB and grows as needed. A Python thread has a minimum stack of 32 KB (threading.stack_size). You can spawn 100,000 goroutines without breaking a sweat; 10,000 Python threads would be pushing it.
  2. No work-stealing scheduler. Go's runtime distributes goroutines across OS threads dynamically with work-stealing — a goroutine can start on one thread and finish on another. Python threads map 1:1 to OS threads with no runtime-level load balancing.
  3. Cooperative vs. preemptive scheduling. Since Go 1.14, the Go scheduler is preemptive. Python's asyncio is cooperative — if a coroutine hogs the event loop without an `await`, everything blocks.
  4. Thread safety is your job. Goroutines encourage the "share memory by communicating" philosophy. Python threads have queues, but they also give you full shared-memory access — which means data races are possible.

What about Subinterpreters?

If free-threading’s shared-memory model makes you nervous, or if you need strong isolation between workers, Python 3.13+ also offers subinterpreters via PEP 734.


def worker_entrypoint(work_queue, result_queue: interpreters.Queue):
    """Synchronous entrypoint for the subinterpreter to launch the event loop."""

    async def process_batch(names, client):
        tasks = [fetch_single(name, client) for name in names]
        results = await asyncio.gather(*tasks, return_exceptions=True)

        for name, result in zip(names, results):
            if isinstance(result, Exception):
                result_queue.put(f"Error fetching {name}: {result}")
            else:
                result_queue.put(result)

    async def run_worker():
        limits = httpx.Limits(max_connections=100,
                              max_keepalive_connections=100)
        async with httpx.AsyncClient(limits=limits) as client:
            while True:
                work_batch = work_queue.get()
                if work_batch is None:
                    break

                await process_batch(work_batch, client)

    asyncio.run(run_worker())
    result_queue.put(None)


async def master_async_producer(work_queue, limit=40):

    async with httpx.AsyncClient() as client:
        resp = await client.get(f"https://pokeapi.co/api/v2/pokemon?limit={limit}")
        data = resp.json()
        names = [entry['name'] for entry in data['results']]

        chunk_size = len(names) // 4
        for i in range(0, len(names), chunk_size):
            work_queue.put(names[i:i + chunk_size])

        for _ in range(4):
            work_queue.put(None)


def master_entrypoint(work_queue, limit):
    asyncio.run(master_async_producer(work_queue, limit))


if __name__ == "__main__":
    limit = 1000
    work_q = interpreters.create_queue()
    results_qs = [interpreters.create_queue() for _ in range(4)]

    interps = []
    threads = []

    for i in range(4):
        interp = interpreters.create()
        interps.append(interp)
        t = interp.call_in_thread(worker_entrypoint, work_q, results_qs[i])
        threads.append(t)

    master_interp = interpreters.create()
    interps.append(master_interp)
    master_t = master_interp.call_in_thread(master_entrypoint, work_q, limit)
    threads.append(master_t)

    total_results = 0
    queues_alive = 4
    while queues_alive > 0:
        for q in results_qs:
            try:
                result = q.get(timeout=0.1)
                if result is None:
                    queues_alive -= 1
                else:
                    print(f"[RESULT] {result}")
                    total_results += 1
            except:
                pass

    for t in threads:
        t.join()
    for i in interps:
        i.close()

Notice the extra ceremony:

  • 5 queues instead of 2 — each worker needs its own result queue (no shared memory)
  • Serialization Tax — all communication goes through interpreters.Queue, which pickles data across boundaries.
  • Explicit cleanup — interpreters must be explicitly closed.

Subinterpreters work on standard Python — no special build required. This is the real value proposition if you cannot switch to the experimental t build.


Benchmarks: Does It Actually Deliver?


ScenarioApproachAvg TimePeak MemoryCPU Time
Standard PythonThreading (GIL)9.00s85 MB1.2s
No-GIL (3.14t)Free-threading5.45s283 MB6.55s
Standard PythonSubinterpreters6.01s386 MB19.66s

With the GIL removed, threads are 1.65x faster just from flipping the build. Subinterpreters are faster than GIL-locked threads but carry a heavy CPU/Memory tax due to cross-interpreter serialization.


When to Use What

If you need…Use…
Maximum throughput (I/O or CPU)Free-Threading
Shared data structures (numpy, lists)Free-Threading
Strict security/isolationSubinterpreters
Plugin systems / sandboxingSubinterpreters
Legacy C-extensions (GIL-dependent)Subinterpreters

The Bigger Picture

Free-threading marks a philosophical shift. The thread + asyncio pattern — multiple parallel event loops communicating through queues — is the closest Python has ever come to Go’s concurrency model. With free-threading maturing through Python 3.14+, it’s finally time to stop reaching for multiprocessing by default.


PEP 703 — Making the Global Interpreter Lock Optional
Python docs — Free-Threaded CPython
PEP 734 — Multiple Interpreters in the Stdlib

© 2026 Shubham Biswas. All Rights Reserved