Shubham Biswas

Python is fast enough—until you’re handling thousands of requests per second, which requires expensive class instantiation, accessing and mutating class attributes numerous times. Before you give up on the language, hold on—there are pools and slots of improvement, two optimization techniques hiding in plain sight: object pooling and __slots__.

Object pooling isn’t Python-specific — any language with expensive object construction benefits from it. __slots__ is a Python language feature whose performance characteristics are CPython-specific; it attacks the per-instance __dict__ overhead. One slashes repeated construction cost while the other slashes memory and accelerates attribute access. Together, they can transform a sluggish service into something that is blazingly fast ®!

Let’s start with the pattern that works everywhere.

Object Pooling: Stop Paying Construction Cost Twice

Consider this scenario: a Celery worker processing messages for 20 customer organizations. Each message requires a ProcessingService instance that loads org-specific config, database connections, and tenant models on __init__.

import time
import random


class ProcessingService:
    """Simulates a service that is expensive to construct per org.

    Contains deliberate reference cycles (logging callbacks, event
    handlers, connection pool registries) to trigger generational GC —
    the exact scenario where object pooling reduces GC pressure.
    """

    def __init__(self, org_id: str, build_cost: float = 0.01, process_cost: float = 0.001):
        self._build_cost = build_cost
        self._process_cost = process_cost
        self._org_id = org_id
        self._message = None
        self._ready = False
        self._handlers = []
        self._conn_pool = {}
        self._logger = self
        self._init()

    def _init(self):
        time.sleep(self._build_cost) # this is to emulate config load and other build cost

        # Reference cycles (to trigger GC without pooling):
        # Handler registry — closures capture self
        self._handlers = [
            lambda s=self: None,
            lambda s=self: None,
        ]
        # Connection pool dict referencing self
        self._conn_pool = {"owner": self}
        # Logger attribute referencing self
        self._logger = self

        self._ready = True

    def hydrate(self, message: dict):
        self._message = message

    def process(self) -> dict:
        if not self._ready:
            raise RuntimeError("Service not initialized")
        time.sleep(self._process_cost)
        return {
            "org_id": self._org_id,
            "message_id": self._message.get("id", "?") if self._message else "?",
            "passed": random.random() > 0.05,
            "details": {"checks": random.randint(3, 10)},
        }

    def reset(self):
        self._message = None

    def close(self):
        self._ready = False
        self._handlers.clear()
        self._conn_pool.clear()
        self._logger = None

Without pooling: 1,000 messages × 10ms config load = 10 seconds of cumulative construction. With 8 concurrent workers, your wall time is dominated by __init__.

The Pool Implementation

The pool is a generic, reusable object pool — that accepts a factory callable which is used to create objects. Here’s the full implementation:

import threading
import time
from contextlib import contextmanager
from dataclasses import dataclass
from typing import Callable, TypeVar

T = TypeVar("T")


@dataclass
class PoolStats:
    max_size: int = 0
    idle: int = 0
    active: int = 0
    total_timeouts: int = 0
    peak_active: int = 0

    def snapshot(self) -> "PoolStats":
        return PoolStats(
            max_size=self.max_size,
            idle=self.idle,
            active=self.active,
            total_timeouts=self.total_timeouts,
            peak_active=self.peak_active,
        )


class ObjectPool:
    """Thread-safe object pool with dynamic resizing."""

    def __init__(self, factory: Callable[[], T], max_size: int = 10):
        if max_size < 1:
            raise ValueError("max_size must be >= 1")

        self._factory = factory
        self._max_size = max_size

        self._lock = threading.Lock()
        self._condition = threading.Condition(self._lock)
        self._pool: list[T] = []
        self._active_count = 0
        self._shutdown = False

        self._total_timeouts = 0
        self._peak_active = 0

    @contextmanager
    def acquire(self, timeout: float | None = None):
        obj = None
        try:
            obj = self._acquire(timeout)
            yield obj
        finally:
            if obj is not None:
                self._release(obj)

    def _acquire(self, timeout: float | None = None) -> T:
        with self._condition:
            if self._shutdown:
                raise RuntimeError("Pool is shut down")

            deadline = None if timeout is None else time.monotonic() + timeout

            while True:
                if self._pool:
                    obj = self._pool.pop()
                    self._active_count += 1
                    if self._active_count > self._peak_active:
                        self._peak_active = self._active_count
                    return obj

                if self._active_count < self._max_size:
                    obj = self._factory()
                    self._active_count += 1
                    if self._active_count > self._peak_active:
                        self._peak_active = self._active_count
                    return obj

                if timeout is not None:
                    remaining = deadline - time.monotonic()
                    if remaining <= 0:
                        self._total_timeouts += 1
                        raise TimeoutError(
                            f"Timed out waiting for pooled object "
                            f"(timeout={timeout}s, pool_size={self._max_size})"
                        )
                    self._condition.wait(timeout=remaining)
                else:
                    self._condition.wait()

                if self._shutdown:
                    raise RuntimeError("Pool is shut down")

    def _release(self, obj: T):
        with self._condition:
            self._active_count -= 1

            if self._shutdown:
                self._destroy_object(obj)
                return

            if len(self._pool) >= self._max_size:
                self._destroy_object(obj)
                return

            self._pool.append(obj)
            self._condition.notify()

    def _destroy_object(self, obj: T):
        if hasattr(obj, "close"):
            try:
                obj.close()
            except Exception:
                pass

    def resize(self, new_max_size: int):
        if new_max_size < 1:
            raise ValueError("max_size must be >= 1")

        with self._condition:
            self._max_size = new_max_size

            while len(self._pool) > new_max_size:
                obj = self._pool.pop()
                self._destroy_object(obj)

            self._condition.notify_all()

    @property
    def stats(self) -> PoolStats:
        with self._lock:
            return PoolStats(
                max_size=self._max_size,
                idle=len(self._pool),
                active=self._active_count,
                total_timeouts=self._total_timeouts,
                peak_active=self._peak_active,
            )

    def shutdown(self):
        with self._condition:
            self._shutdown = True
            while self._pool:
                obj = self._pool.pop()
                self._destroy_object(obj)
            self._condition.notify_all()

Idea here is everytime a new request comes into to worker (celery or any other worker setup) it tries to acquire an existing object from pool and then use it process, at the end move the object back to pool. In all practical sense each org can have different config or attributes, so instances aren’t interchangeable across orgs. The solution is a dict of pools keyed by org_id — one ObjectPool per org, each with an org-specific factory that bakes in the org’s config. This is just a usage pattern on top of the generic pool — nothing in ObjectPool knows about orgs.

pools: dict[str, ObjectPool] = {}
for org in ["org_001", "org_002", "org_003"]:
    pools[org] = ObjectPool(
        factory=lambda org=org: ProcessingService(org_id=org),
        max_size=4,
    )

The worker then picks the right pool:

def process_message(msg: dict):
    pool = pools[msg["org_id"]]
    with pool.acquire(timeout=30.0) as svc:
        svc.hydrate(msg)
        return svc.process()

How We Benchmarked — Object Pooling

Benchmarks are written with pytest-benchmark — a pytest plugin that auto-calibrates iteration count per round, produces min/max/mean/median/stddev statistics, and supports --benchmark-compare to track performance changes across commits.

Config: 200 messages, 4 concurrent workers, 10 customer orgs (4 pool slots each), 10ms build cost, 0.5ms process cost.

import concurrent.futures
import os
import random

from object_pooling import ObjectPool, ProcessingService

ORG_IDS = [f"org_{i:03d}" for i in range(100)]
BUILD_COST = 0.01       # 10ms — org config + DB connection load
PROCESS_COST = 0.0005   # 0.5ms per-message validation
NUM_REQUESTS = 200
NUM_WORKERS = 4
NUM_ORGS = 10
POOL_SIZE_PER_ORG = 4


def _run_batch(messages, worker_fn):
    """Distribute messages across NUM_WORKERS threads."""
    chunk_size = max(1, len(messages) // NUM_WORKERS)
    chunks = [messages[i:i + chunk_size]
              for i in range(0, len(messages), chunk_size)]

    def _worker(chunk):
        for msg in chunk:
            worker_fn(msg)

    with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_WORKERS) as ex:
        list(ex.map(_worker, chunks))


MESSAGE_IDS = [f"msg_{i:06d}" for i in range(10000)]


def _make_messages(n: int = NUM_REQUESTS, orgs: int = NUM_ORGS) -> list[dict]:
    """Generate synthetic messages spread across orgs."""
    org_pool = ORG_IDS[:orgs]
    return [
        {
            "id": random.choice(MESSAGE_IDS),
            "org_id": random.choice(org_pool),
            "payload": os.urandom(64).hex(),
        }
        for _ in range(n)
    ]


def _make_per_org_pools(orgs: list[str]) -> dict[str, ObjectPool]:
    """Create one ObjectPool per org, each with org-specific factory."""
    pools = {}
    for org in orgs:
        pools[org] = ObjectPool(
            factory=lambda org=org: ProcessingService(
                org_id=org, build_cost=BUILD_COST, process_cost=PROCESS_COST
            ),
            max_size=POOL_SIZE_PER_ORG,
        )
    return pools

No-pool benchmark — each request pays full 10ms config load:

def test_bench_alloc(benchmark):
    def run():
        msgs = _make_messages()
        def worker(msg):
            svc = ProcessingService(
                org_id=msg["org_id"],
                build_cost=BUILD_COST,
                process_cost=PROCESS_COST,
            )
            svc.hydrate(msg)
            svc.process()
            svc.reset()
        _run_batch(msgs, worker)
    benchmark(run)

Pooled benchmark — per-org pools, instances created lazily on first acquire (10 orgs × 4 slots):

def test_bench_pools(benchmark):
    def run():
        orgs = ORG_IDS[:NUM_ORGS]
        msgs = _make_messages(orgs=NUM_ORGS)
        pools = _make_per_org_pools(orgs)

        def worker(msg):
            pool = pools[msg["org_id"]]
            with pool.acquire(timeout=30.0) as svc:
                svc.hydrate(msg)
                svc.process()
                svc.reset()

        _run_batch(msgs, worker)
        for p in pools.values():
            p.shutdown()
    benchmark(run)

We also benchmark single-request latency — measuring per-message cost rather than batch throughput:

def test_bench_single_alloc(benchmark):
    """Per-message cost without pool."""
    def run():
        msg = _make_messages(n=1)[0]
        svc = ProcessingService(org_id=msg["org_id"],
                                build_cost=BUILD_COST,
                                process_cost=PROCESS_COST)
        svc.hydrate(msg)
        svc.process()
        svc.reset()
    benchmark(run)


def test_bench_single_pool(benchmark):
    """Per-message cost with pool."""
    pool = ObjectPool(
        factory=lambda: ProcessingService(
            org_id=ORG_IDS[0],
            build_cost=BUILD_COST,
            process_cost=PROCESS_COST,
        ),
        max_size=4,
    )

    def run():
        msg = _make_messages(n=1, orgs=1)[0]
        with pool.acquire(timeout=30.0) as svc:
            svc.hydrate(msg)
            svc.process()
            svc.reset()

    benchmark(run)
    pool.shutdown()

Finally we benchmark at higher volume to see how pooling benefit scales — 1000 messages, 8 workers, 20 pool slots per org:

def _run_batch_scaled(messages, worker_fn, workers):
    chunk_size = max(1, len(messages) // workers)
    chunks = [messages[i:i + chunk_size]
              for i in range(0, len(messages), chunk_size)]

    def _worker(chunk):
        for msg in chunk:
            worker_fn(msg)

    with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as ex:
        list(ex.map(_worker, chunks))

def _make_scaled_pools(orgs, pool_size):
    pools = {}
    for org in orgs:
        pools[org] = ObjectPool(
            factory=lambda org=org: ProcessingService(
                org_id=org, build_cost=BUILD_COST,
                process_cost=PROCESS_COST,
            ),
            max_size=pool_size,
        )
    return pools


def test_bench_alloc_1k(benchmark):
    """No pool — 1000 requests, 8 workers."""
    N, WORKERS = 1000, 8

    def run():
        msgs = _make_messages(n=N, orgs=NUM_ORGS)

        def worker(msg):
            svc = ProcessingService(
                org_id=msg["org_id"],
                build_cost=BUILD_COST,
                process_cost=PROCESS_COST,
            )
            svc.hydrate(msg)
            svc.process()
            svc.reset()

        _run_batch_scaled(msgs, worker, WORKERS)
    benchmark(run)


def test_bench_pools_1k(benchmark):
    """Per-org pools — 1000 requests, 8 workers, 20 slots/org."""
    N, WORKERS, POOL_SIZE = 1000, 8, 20
    orgs = ORG_IDS[:NUM_ORGS]

    def run():
        msgs = _make_messages(n=N, orgs=NUM_ORGS)
        pools = _make_scaled_pools(orgs, POOL_SIZE)

        def worker(msg):
            pool = pools[msg["org_id"]]
            with pool.acquire(timeout=30.0) as svc:
                svc.hydrate(msg)
                svc.process()
                svc.reset()

        _run_batch_scaled(msgs, worker, WORKERS)

        for p in pools.values():
            p.shutdown()
    benchmark(run)

Same messages. Same workers. Same class. The only difference: construct once vs. construct per-request.

Benchmark Results — Batch Throughput

Volume	alloc (no pool)	pools (per-org)	Speedup
200 msg, 4 workers, 4 slots/org	534.7 ms	167.2 ms	3.20×
1000 msg, 8 workers, 20 slots/org	1,334.5 ms	350.4 ms	3.81×

Benchmark Results — Single Request (Per-Message Latency)

Metric	alloc (no pool)	pools (per-org)	Change
Mean time	10.67 ms	573 µs	18.6× faster
Throughput (o/s)	93.69	1744.00	18.6× higher
Median	10.67 ms	572 µs	18.7× faster

The GC Bonus

ProcessingService contains deliberate reference cycles. The handler closures — lambda s=self: None — capture self in their closure scope. self holds _handlers, _handlers holds closures, closures hold a reference back to self. That’s a real cycle that appears in production without anyone intending it: event handlers, callbacks, retry hooks. The self._logger = self cycle is also realistic — services registering themselves as log handlers. CPython uses reference counting as its primary memory management, but reference cycles cannot be freed by refcounting. They require the cyclic garbage collector to detect and free them.

Without pooling, short-lived objects with cycles go out of scope after each request. Refcounting can’t free them, so they accumulate in GC generations until the collector runs. With pooling, instances stay alive in their pools — never become garbage — no cycles to collect.

The benchmark includes an instrumented comparison using gc.callbacks to count cyclic GC events at different message volumes (same alloc vs. pools setup as the batch tests):

def _count_gc_events(fn):
    gc.collect()
    events = 0
    pause_ns = 0

    def cb(phase, _):
        nonlocal events, pause_ns
        if phase == "start":
            events += 1
            pause_ns -= time.monotonic_ns()
        elif phase == "stop":
            pause_ns += time.monotonic_ns()

    gc.callbacks.append(cb)
    fn()
    gc.callbacks.remove(cb)
    return events, pause_ns / 1_000_000

Results across message volumes:

Messages	alloc GC events	alloc pause	pools pause
200	0	0.0ms	0.0ms
500	1	4.6ms	0.0ms
1,000	3	4.5ms	0.0ms
2,000	7	4.3ms	0.0ms
5,000	17	5.5ms	0.0ms

With pooling: zero GC events at every volume. Without pooling: cyclic GC kicks in around 500 messages, with roughly one collection event per 300 additional messages.

Each pause runs under a millisecond and construction cost dominates the speedup at these volumes — GC isn’t the headline. But the trend is real: at higher throughput or with larger cycle graphs (ML pipelines, real connection pools), the pauses accumulate and the CPU overhead of tracing object graphs grows. Worth knowing, even if it’s not why you’d reach for pooling first.

Dynamic Resize Under Load

Pools aren’t static. test_bench_resize demonstrates scaling pool capacity while processing messages — resize from 2 → 8 → 4 in a single benchmark round. The resize operation is non-blocking and completes in 32.9ms mean across 610 rounds with minimal variance (stddev 0.28ms).

def test_bench_resize(benchmark):
    """Dynamic resize — scale from 2→8→4 under concurrent load."""
    org = ORG_IDS[0]
    msgs = _make_messages(n=500, orgs=1)

    def run():
        pool = ObjectPool(
            factory=lambda: ProcessingService(
                org_id=org,
                build_cost=BUILD_COST,
                process_cost=PROCESS_COST,
            ),
            max_size=2,
        )

        def worker(msg):
            with pool.acquire(timeout=30.0) as svc:
                svc.hydrate(msg)
                svc.process()
                svc.reset()

        # Process some to warm up at size=2
        for _ in range(10):
            worker(msgs[0])

        pool.resize(8)   # scale up: more capacity
        for _ in range(20):
            worker(msgs[1])

        pool.resize(4)   # scale down: excess idle destroyed,
                         # active instances stay until released
        for _ in range(10):
            worker(msgs[2])

        pool.shutdown()
    benchmark(run)

At max_size=2 with concurrent workers, slots are saturated. Resize to 8 adds capacity immediately — workers unblock without restarting. Shrink back to 4: idle instances are destroyed, active ones remain until released.

When to Reach for Object Pooling

Condition	Why it matters
High construction cost	Config loading, DB connections, ML models
Reference cycles	Callbacks, registries, circular refs → GC piles up, pool eliminates it
High throughput	Continuous requests → construction cost dominates
Per-tenant config	Each tenant has different setup → per-tenant pools, not shared

If none of these apply — object is cheap to create, has no cycles, low request rate — pooling adds complexity for no gain. Use plain instantiation.

Note: instances are created lazily on first acquire() — no upfront construction cost. This avoids paying for orgs that never receive traffic. For production, create pools at worker startup so the first real request doesn’t pay the build cost.

Pitfalls of Object Pooling

Pooling is a tradeoff: you trade construction cost for the responsibility of managing object lifecycle. Get the lifecycle wrong and you trade one class of bug for another.

1. Stale state and connections between requests

This is the most common failure mode. When an instance returns to the pool, it carries whatever state the previous request left behind. The reset() method in ProcessingService clears _message — but in a real service, incomplete resets are subtle: a transaction left uncommitted, a cache warmed to the previous org’s data, a flag never cleared. The next request acquires an instance that looks clean but isn’t. A database connection held in the pool for 30 minutes may have been silently dropped by the server’s wait_timeout — the pool has no visibility into this.
The fix is a discipline problem, not a code one. reset()r must be exhaustive and reviewed whenever __init__ gains a new attribute. Validate connections on acquire with a cheap liveness check, and set pool TTLs shorter than your server’s idle timeout. A good test acquires an instance, runs a request, releases it, acquires it again, and asserts it’s identical to a freshly constructed one.

2. Per-org pools and cross-org contamination

The dict-of-pools pattern is correct, but only if the routing logic is correct. A bug that assigns msg["org_id"] to the wrong pool key — a typo, a missing default, a race condition in pool creation — hands org_001’s service instance to an org_002 request. With pooled state (loaded config, cached tenant models, connection credentials), that’s a data isolation failure, not just a logic bug. Treat pool routing as security-critical code.

3. Reference cycles and close()

_destroy_object calls close() when an instance is evicted. But if close() breaks cycles incompletely — for instance, it clears _handlers but not _logger — the instance doesn’t become garbage. The cyclic GC still has to collect it, and you lose the GC reduction benefit pooling was supposed to deliver. A complete close() should null out every reference that participates in a cycle. The test: after close(), gc.collect() should not find the instance in the collected set.

4. Pool exhaustion and hidden backpressure

acquire(timeout=30.0) will block for up to 30 seconds if all slots are taken. Under sustained load, this becomes invisible queuing: requests pile up waiting for a slot, latency climbs, and the pool’s stats.total_timeouts counter starts ticking. Without monitoring on that counter and on active/idle ratios, you won’t know the pool is the bottleneck until requests start timing out. Instrument the stats endpoint and alert on timeout rate, not just error rate.

Object pooling solves the “create once, reuse many” problem in any language. Now let’s look at something Python-specific: what if the object itself is bloated?

The Hidden Cost of `dict`

Every Python object carries a __dict__ — a per-instance dictionary that stores attributes. It’s convenient (dynamic attributes, monkey-patching, ORM lazy loading) but expensive. There are two ways to measure that cost, and they answer different questions:

Tool	What it measures	NoSlots (100 attrs)
`sys.getsizeof`	Shallow — the container envelope only, does NOT recurse into attribute values	3,376 bytes
`tracemalloc`	Full heap — every allocation: dict, keys, values, instance	6.74 KB

The benchmarks below use tracemalloc because it reflects actual RAM cost. But first, looking at just the container wrapper — __slots__ eliminates the __dict__ hash table entirely:

Component	Size (bytes)
NoSlots (header + `__dict__`)	3,376
WithSlots (header + array)	832
Wrapper savings	75%

The NoSlots wrapper is 3,376 bytes total: a 48-byte instance header plus a 3,328-byte __dict__ hash table with 256 slots (CPython maintains a ~2/3 load factor; 100 entries need 256 slots × 24 bytes each). __slots__ replaces the dict with a dense C array of 100 pointers at 800 bytes, plus a 32-byte instance header, for 832 bytes total. That 75% reduction is just the envelope — sys.getsizeof doesn’t count the actual attribute values. When you measure the full heap with tracemalloc (same benchmark below), memory drops from 6.74 KB to 3.97 KB, a 40.9% savings.

What `slots` Actually Does

__slots__ tells CPython: “this class has a fixed set of attributes, allocate a C array for them instead of a dict.” No per-instance dict, no per-attribute dict entries, no hash table overhead.

class NoSlots:
    def __init__(self):
        self.a = 1
        self.b = 2.0
        self.c = "hello"
        # ... 97 more attributes

class WithSlots:
    __slots__ = ("a", "b", "c", ...)  # 100 attribute names

    def __init__(self):
        self.a = 1
        self.b = 2.0
        self.c = "hello"
        # ... 97 more attributes

How We Benchmarked — `slots`

The test classes generate 100 attributes per instance with mixed types to simulate real DTOs:

N_ATTRS = 100
_attr_names = [f"attr_{i}" for i in range(N_ATTRS)]

class NoSlots:
    def __init__(self):
        for i, name in enumerate(_attr_names):
            if i % 5 == 0:
                val = i * 7          # int
            elif i % 5 == 1:
                val = i * 3.1415     # float
            elif i % 5 == 2:
                val = f"val_{i}"     # short string
            elif i % 5 == 3:
                val = i % 2 == 0     # bool
            else:
                val = (i, i + 1, i + 2)  # small tuple
            setattr(self, name, val)

class WithSlots:
    __slots__ = tuple(_attr_names)

    def __init__(self):
        for i, name in enumerate(_attr_names):
            # same mixed-type initialization as NoSlots
            ...
            setattr(self, name, val)

Memory is measured via tracemalloc — tracking actual heap allocations, not estimates. Timing uses timeit for stable micro-benchmarks. Profiling uses cProfile + gprof2dot for call graph visualization:

import timeit
import tracemalloc

N_INSTANCES = 10_000
N_REPEATS = 5
N_LOOPS = 100


def mem_usage(cls, n=N_INSTANCES):
    """Measure per-instance memory via tracemalloc."""
    tracemalloc.start()
    objs = [cls() for _ in range(n)]
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    per_obj = current / n / 1024  # KB
    return per_obj, current / 1024**2  # MB


def bench_create(cls):
    """Time N_LOOPS object creations, repeat N_REPEATS times."""
    def _create():
        for _ in range(N_LOOPS):
            cls()
    return _create


def bench_access(instances):
    """Time reading attr_0, attr_50, attr_99 across N_INSTANCES objects."""
    def _access():
        for obj in instances:
            _ = obj.attr_0
            _ = obj.attr_50
            _ = obj.attr_99
    return _access


def bench_setattr(instances):
    """Time writing to 3 attributes across N_INSTANCES objects."""
    def _setattr():
        for obj in instances:
            obj.attr_0 = 42
            obj.attr_50 = 42
            obj.attr_99 = 42
    return _setattr

The benchmark: 10,000 instances, 100 small attributes each (int, float, short string, bool, small tuple). Here’s what changes:

Benchmark Results — 10,000 Small Objects

Memory:

Class	Per Object (KB)	Total 10,000 objs (MB)
NoSlots	6.74	65.8
WithSlots	3.97	38.8
Savings	40.9%	27.0 MB

Attribute Access:

Operation	NoSlots	WithSlots	Speedup
Creation	0.011 s	0.010 s	1.08×
Read	0.029 s	0.004 s	7.70×
Write	0.012 s	0.005 s	2.32×

Reads are nearly 8× faster. A __dict__ read is a full dict lookup: hash the key, probe the table, compare keys, resolve collisions. A slot read eliminates all of that. Each slot name becomes a slot descriptor on the type (not the instance). When you access obj.attr_0, CPython resolves the descriptor through the type’s attribute cache, and its __get__ reads from object->slots_array[slot_offset] — a pre-computed index into a dense C array. There’s still one pointer indirection (type → descriptor → slot offset), but it’s far cheaper than dict.__getitem__.

cProfile Confirms It

Profiling 200,000 setattr calls across 2,000 objects:

Function	NoSlots (cumtime)	WithSlots (cumtime)
`setattr`	0.050 s	0.035 s
`__init__`	0.118 s	0.091 s
Total	0.129 s	0.096 s

Slots-based setattr is 1.43× faster per call. __init__ is 1.3× faster. Same function call count (202,002) — just less work inside each call.

When `slots` Does Nothing

The benchmark was also run with numpy 20×20 float64 matrices (325 KB each) as attribute values:

Metric	NoSlots	WithSlots	Difference
Memory per obj	328.65 KB	325.82 KB	0.8%
Read time	0.842 s	1.103 s	slower

The __dict__ overhead (~3.3 KB) is noise against 325 KB of data — less than 1% of total object size. Saving 2.5 KB when each object already holds 329 KB doesn’t matter. Worse, reads actually regressed slightly. For large-data objects, the “read” is just returning a reference to an already-allocated numpy array — the lookup mechanism itself is the entire cost. CPython’s dict lookup is highly optimized for repeated attribute access on stable objects (version-tagged hash tables skip re-hashing when the dict hasn’t been mutated), and the slot descriptor overhead — the cached type lookup, __get__ call, and slot offset dereference — becomes measurable when data access itself is negligible. __slots__ helps when the object wrapper is a significant fraction of total size. When data dwarfs the wrapper, focus optimization effort on the data structures themselves.

The “10 KB per inst?” threshold isn’t arbitrary — it’s the inflection point between the two benchmarks above. At 6.74 KB, __slots__ saved 40.9%. At 328 KB, it saved 0.8%. Between those extremes sits a judgment call. If your object data exceeds ~10 KB, the wrapper overhead is already roughly 25% of total, and slots’s diminishing returns make it a lower-priority optimization.

Pitfalls of `slots`

__slots__ is a class-level declaration with class-level consequences. The attribute access speedup is real, but so are the constraints it imposes — and several of them only surface at the boundaries: inheritance, serialization, and third-party code that assumes __dict__ exists.

1. Inheritance breaks slots silently

If a slotted class inherits from an unslotted one, the subclass gets a __dict__ anyway — the parent’s __dict__ is inherited and the child’s __slots__ declaration is effectively ignored for memory purposes. No error, no warning, just the overhead you thought you eliminated.

class Base:
    pass  # has __dict__

class Child(Base):
    __slots__ = ("x", "y")  # __dict__ still present, inherited from Base

The fix: every class in the inheritance chain must declare __slots__. For a chain where the root class needs no slots of its own, declare an empty tuple:

class Base:
    __slots__ = ()

class Child(Base):
    __slots__ = ("x", "y")  # now truly no __dict__

This is easy to get wrong when adding a mixin or base class later — the new parent silently reintroduces __dict__ and you don’t notice until you measure again.

2. __weakref__ is opt-in

By default, slotted classes can’t be weakly referenced. This matters if anything in your stack uses weakref — some caches, ORM session managers, and observer patterns rely on it. The fix is explicit: add "__weakref__" to __slots__. But if you’re inheriting from a class that already includes it, adding it again raises a TypeError. Check your chain before adding it.

class Record:
    __slots__ = ("id", "value", "__weakref__")  # opt back in

3. Pickling requires __getstate__ / __setstate__

The default pickle protocol reads __dict__ to serialize instance state. With __slots__, there’s no __dict__, so pickling a slotted instance either raises a TypeError or silently produces an empty object on unpickle, depending on the protocol version. If your objects cross process boundaries — Celery tasks, multiprocessing queues, joblib — you need to implement both methods:

class Record:
    __slots__ = ("id", "value")

    def __getstate__(self):
        return {s: getattr(self, s) for s in self.__slots__}

    def __setstate__(self, state):
        for k, v in state.items():
            setattr(self, k, v)

4. Dynamic attributes are gone

__slots__ eliminates __dict__, which means you can’t attach arbitrary attributes at runtime. This is usually the point — but it breaks any code that relies on it: ORMs that lazy-load relationships onto instances, test fixtures that monkey-patch attributes, dataclasses with field(default_factory=...) in edge cases, and debugging tools that annotate objects in-place. If a library you don’t control adds attributes to your instances, __slots__ will raise AttributeError at runtime, not at definition time.

5. The single-class optimization trap

__slots__ only pays off when you instantiate the class many times. Declaring __slots__ on a singleton, a config object, or a class that gets instantiated a handful of times per process adds complexity and constraints for zero measurable gain. The decision rule from the benchmark holds: if you’re not creating hundreds of thousands of instances, profile first and reach for __slots__ only if memory or attribute access shows up in the results.

When to use?

	Object Pooling	`__slots__`
Scope	Any language (Go, Java, C++, Rust…)	CPython only
What it fixes	Repeated expensive construction	Per-instance `__dict__` overhead
Memory	Eliminates GC from cyclic garbage	~40% savings on small objects
Speed	3× batch throughput, 18× per-message	3–8× faster attribute reads
Best for	Long-lived services with expensive init	Millions of small DTOs, events, records
Cost	Pool management complexity, state hygiene	No dynamic attributes, no `__weakref__`
When to skip	Cheap construction, no cycles, low load	Objects holding large arrays/buffers

Both techniques are free — no dependencies, no build steps, just a different mental model of where your program spends its time. Pooling is the right lever when construction is expensive; slots is the right lever when you’re paying for memory and attribute access you didn’t know you had. Profile first. The fix is usually simpler than switching languages.

Shubham Biswas

Solver | Explorer | Builder

Optimization: pools and slots you might need

Object Pooling: Stop Paying Construction Cost Twice

The Pool Implementation

How We Benchmarked — Object Pooling

Benchmark Results — Batch Throughput

Benchmark Results — Single Request (Per-Message Latency)

The GC Bonus

Dynamic Resize Under Load

When to Reach for Object Pooling

Pitfalls of Object Pooling

The Hidden Cost of `dict`

What `slots` Actually Does

How We Benchmarked — `slots`

Benchmark Results — 10,000 Small Objects

cProfile Confirms It

When `slots` Does Nothing

Pitfalls of `slots`

When to use?

Shubham Biswas

Solver | Explorer | Builder

Optimization: pools and slots you might need

Object Pooling: Stop Paying Construction Cost Twice

The Pool Implementation

How We Benchmarked — Object Pooling

Benchmark Results — Batch Throughput

Benchmark Results — Single Request (Per-Message Latency)

The GC Bonus

Dynamic Resize Under Load

When to Reach for Object Pooling

Pitfalls of Object Pooling

The Hidden Cost of __dict__

What __slots__ Actually Does

How We Benchmarked — __slots__

Benchmark Results — 10,000 Small Objects

cProfile Confirms It

When __slots__ Does Nothing

Pitfalls of __slots__

When to use?

The Hidden Cost of `dict`

What `slots` Actually Does

How We Benchmarked — `slots`

When `slots` Does Nothing

Pitfalls of `slots`