Optimization: pools and slots you might need
Published on May 4, 2026
Python is fast enough—until you’re handling thousands of requests per second, which requires expensive class instantiation, accessing and mutating class attributes numerous times. Before you give up on the language, hold on—there are pools and slots of improvement, two optimization techniques hiding in plain sight: object pooling and __slots__.
Object pooling isn’t Python-specific — any language with expensive object construction benefits from it. __slots__ is a Python language feature whose performance characteristics are CPython-specific; it attacks the per-instance __dict__ overhead. One slashes repeated construction cost while the other slashes memory and accelerates attribute access. Together, they can transform a sluggish service into something that is blazingly fast ®!
Let’s start with the pattern that works everywhere.
Object Pooling: Stop Paying Construction Cost Twice
Consider this scenario: a Celery worker processing messages for 20 customer organizations. Each message requires a ProcessingService instance that loads org-specific config, database connections, and tenant models on __init__.
import time
import random
class ProcessingService:
"""Simulates a service that is expensive to construct per org.
Contains deliberate reference cycles (logging callbacks, event
handlers, connection pool registries) to trigger generational GC —
the exact scenario where object pooling reduces GC pressure.
"""
def __init__(self, org_id: str, build_cost: float = 0.01, process_cost: float = 0.001):
self._build_cost = build_cost
self._process_cost = process_cost
self._org_id = org_id
self._message = None
self._ready = False
self._handlers = []
self._conn_pool = {}
self._logger = self
self._init()
def _init(self):
time.sleep(self._build_cost) # this is to emulate config load and other build cost
# Reference cycles (to trigger GC without pooling):
# Handler registry — closures capture self
self._handlers = [
lambda s=self: None,
lambda s=self: None,
]
# Connection pool dict referencing self
self._conn_pool = {"owner": self}
# Logger attribute referencing self
self._logger = self
self._ready = True
def hydrate(self, message: dict):
self._message = message
def process(self) -> dict:
if not self._ready:
raise RuntimeError("Service not initialized")
time.sleep(self._process_cost)
return {
"org_id": self._org_id,
"message_id": self._message.get("id", "?") if self._message else "?",
"passed": random.random() > 0.05,
"details": {"checks": random.randint(3, 10)},
}
def reset(self):
self._message = None
def close(self):
self._ready = False
self._handlers.clear()
self._conn_pool.clear()
self._logger = None Without pooling: 1,000 messages × 10ms config load = 10 seconds of cumulative construction. With 8 concurrent workers, your wall time is dominated by __init__.
The Pool Implementation
The pool is a generic, reusable object pool — that accepts a factory callable which is used to create objects. Here’s the full implementation:
import threading
import time
from contextlib import contextmanager
from dataclasses import dataclass
from typing import Callable, TypeVar
T = TypeVar("T")
@dataclass
class PoolStats:
max_size: int = 0
idle: int = 0
active: int = 0
total_timeouts: int = 0
peak_active: int = 0
def snapshot(self) -> "PoolStats":
return PoolStats(
max_size=self.max_size,
idle=self.idle,
active=self.active,
total_timeouts=self.total_timeouts,
peak_active=self.peak_active,
)
class ObjectPool:
"""Thread-safe object pool with dynamic resizing."""
def __init__(self, factory: Callable[[], T], max_size: int = 10):
if max_size < 1:
raise ValueError("max_size must be >= 1")
self._factory = factory
self._max_size = max_size
self._lock = threading.Lock()
self._condition = threading.Condition(self._lock)
self._pool: list[T] = []
self._active_count = 0
self._shutdown = False
self._total_timeouts = 0
self._peak_active = 0
@contextmanager
def acquire(self, timeout: float | None = None):
obj = None
try:
obj = self._acquire(timeout)
yield obj
finally:
if obj is not None:
self._release(obj)
def _acquire(self, timeout: float | None = None) -> T:
with self._condition:
if self._shutdown:
raise RuntimeError("Pool is shut down")
deadline = None if timeout is None else time.monotonic() + timeout
while True:
if self._pool:
obj = self._pool.pop()
self._active_count += 1
if self._active_count > self._peak_active:
self._peak_active = self._active_count
return obj
if self._active_count < self._max_size:
obj = self._factory()
self._active_count += 1
if self._active_count > self._peak_active:
self._peak_active = self._active_count
return obj
if timeout is not None:
remaining = deadline - time.monotonic()
if remaining <= 0:
self._total_timeouts += 1
raise TimeoutError(
f"Timed out waiting for pooled object "
f"(timeout={timeout}s, pool_size={self._max_size})"
)
self._condition.wait(timeout=remaining)
else:
self._condition.wait()
if self._shutdown:
raise RuntimeError("Pool is shut down")
def _release(self, obj: T):
with self._condition:
self._active_count -= 1
if self._shutdown:
self._destroy_object(obj)
return
if len(self._pool) >= self._max_size:
self._destroy_object(obj)
return
self._pool.append(obj)
self._condition.notify()
def _destroy_object(self, obj: T):
if hasattr(obj, "close"):
try:
obj.close()
except Exception:
pass
def resize(self, new_max_size: int):
if new_max_size < 1:
raise ValueError("max_size must be >= 1")
with self._condition:
self._max_size = new_max_size
while len(self._pool) > new_max_size:
obj = self._pool.pop()
self._destroy_object(obj)
self._condition.notify_all()
@property
def stats(self) -> PoolStats:
with self._lock:
return PoolStats(
max_size=self._max_size,
idle=len(self._pool),
active=self._active_count,
total_timeouts=self._total_timeouts,
peak_active=self._peak_active,
)
def shutdown(self):
with self._condition:
self._shutdown = True
while self._pool:
obj = self._pool.pop()
self._destroy_object(obj)
self._condition.notify_all() Idea here is everytime a new request comes into to worker (celery or any other worker setup) it tries to acquire an existing object from pool and then use it process, at the end move the object back to pool. In all practical sense each org can have different config or attributes, so instances aren’t interchangeable across orgs. The solution is a dict of pools keyed by org_id — one ObjectPool per org, each with an org-specific factory that bakes in the org’s config. This is just a usage pattern on top of the generic pool — nothing in ObjectPool knows about orgs.
pools: dict[str, ObjectPool] = {}
for org in ["org_001", "org_002", "org_003"]:
pools[org] = ObjectPool(
factory=lambda org=org: ProcessingService(org_id=org),
max_size=4,
) The worker then picks the right pool:
def process_message(msg: dict):
pool = pools[msg["org_id"]]
with pool.acquire(timeout=30.0) as svc:
svc.hydrate(msg)
return svc.process() How We Benchmarked — Object Pooling
Benchmarks are written with pytest-benchmark — a pytest plugin that auto-calibrates iteration count per round, produces min/max/mean/median/stddev statistics, and supports --benchmark-compare to track performance changes across commits.
Config: 200 messages, 4 concurrent workers, 10 customer orgs (4 pool slots each), 10ms build cost, 0.5ms process cost.
import concurrent.futures
import os
import random
from object_pooling import ObjectPool, ProcessingService
ORG_IDS = [f"org_{i:03d}" for i in range(100)]
BUILD_COST = 0.01 # 10ms — org config + DB connection load
PROCESS_COST = 0.0005 # 0.5ms per-message validation
NUM_REQUESTS = 200
NUM_WORKERS = 4
NUM_ORGS = 10
POOL_SIZE_PER_ORG = 4
def _run_batch(messages, worker_fn):
"""Distribute messages across NUM_WORKERS threads."""
chunk_size = max(1, len(messages) // NUM_WORKERS)
chunks = [messages[i:i + chunk_size]
for i in range(0, len(messages), chunk_size)]
def _worker(chunk):
for msg in chunk:
worker_fn(msg)
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_WORKERS) as ex:
list(ex.map(_worker, chunks))
MESSAGE_IDS = [f"msg_{i:06d}" for i in range(10000)]
def _make_messages(n: int = NUM_REQUESTS, orgs: int = NUM_ORGS) -> list[dict]:
"""Generate synthetic messages spread across orgs."""
org_pool = ORG_IDS[:orgs]
return [
{
"id": random.choice(MESSAGE_IDS),
"org_id": random.choice(org_pool),
"payload": os.urandom(64).hex(),
}
for _ in range(n)
]
def _make_per_org_pools(orgs: list[str]) -> dict[str, ObjectPool]:
"""Create one ObjectPool per org, each with org-specific factory."""
pools = {}
for org in orgs:
pools[org] = ObjectPool(
factory=lambda org=org: ProcessingService(
org_id=org, build_cost=BUILD_COST, process_cost=PROCESS_COST
),
max_size=POOL_SIZE_PER_ORG,
)
return pools No-pool benchmark — each request pays full 10ms config load:
def test_bench_alloc(benchmark):
def run():
msgs = _make_messages()
def worker(msg):
svc = ProcessingService(
org_id=msg["org_id"],
build_cost=BUILD_COST,
process_cost=PROCESS_COST,
)
svc.hydrate(msg)
svc.process()
svc.reset()
_run_batch(msgs, worker)
benchmark(run) Pooled benchmark — per-org pools, instances created lazily on first acquire (10 orgs × 4 slots):
def test_bench_pools(benchmark):
def run():
orgs = ORG_IDS[:NUM_ORGS]
msgs = _make_messages(orgs=NUM_ORGS)
pools = _make_per_org_pools(orgs)
def worker(msg):
pool = pools[msg["org_id"]]
with pool.acquire(timeout=30.0) as svc:
svc.hydrate(msg)
svc.process()
svc.reset()
_run_batch(msgs, worker)
for p in pools.values():
p.shutdown()
benchmark(run) We also benchmark single-request latency — measuring per-message cost rather than batch throughput:
def test_bench_single_alloc(benchmark):
"""Per-message cost without pool."""
def run():
msg = _make_messages(n=1)[0]
svc = ProcessingService(org_id=msg["org_id"],
build_cost=BUILD_COST,
process_cost=PROCESS_COST)
svc.hydrate(msg)
svc.process()
svc.reset()
benchmark(run)
def test_bench_single_pool(benchmark):
"""Per-message cost with pool."""
pool = ObjectPool(
factory=lambda: ProcessingService(
org_id=ORG_IDS[0],
build_cost=BUILD_COST,
process_cost=PROCESS_COST,
),
max_size=4,
)
def run():
msg = _make_messages(n=1, orgs=1)[0]
with pool.acquire(timeout=30.0) as svc:
svc.hydrate(msg)
svc.process()
svc.reset()
benchmark(run)
pool.shutdown() Finally we benchmark at higher volume to see how pooling benefit scales — 1000 messages, 8 workers, 20 pool slots per org:
def _run_batch_scaled(messages, worker_fn, workers):
chunk_size = max(1, len(messages) // workers)
chunks = [messages[i:i + chunk_size]
for i in range(0, len(messages), chunk_size)]
def _worker(chunk):
for msg in chunk:
worker_fn(msg)
with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as ex:
list(ex.map(_worker, chunks))
def _make_scaled_pools(orgs, pool_size):
pools = {}
for org in orgs:
pools[org] = ObjectPool(
factory=lambda org=org: ProcessingService(
org_id=org, build_cost=BUILD_COST,
process_cost=PROCESS_COST,
),
max_size=pool_size,
)
return pools
def test_bench_alloc_1k(benchmark):
"""No pool — 1000 requests, 8 workers."""
N, WORKERS = 1000, 8
def run():
msgs = _make_messages(n=N, orgs=NUM_ORGS)
def worker(msg):
svc = ProcessingService(
org_id=msg["org_id"],
build_cost=BUILD_COST,
process_cost=PROCESS_COST,
)
svc.hydrate(msg)
svc.process()
svc.reset()
_run_batch_scaled(msgs, worker, WORKERS)
benchmark(run)
def test_bench_pools_1k(benchmark):
"""Per-org pools — 1000 requests, 8 workers, 20 slots/org."""
N, WORKERS, POOL_SIZE = 1000, 8, 20
orgs = ORG_IDS[:NUM_ORGS]
def run():
msgs = _make_messages(n=N, orgs=NUM_ORGS)
pools = _make_scaled_pools(orgs, POOL_SIZE)
def worker(msg):
pool = pools[msg["org_id"]]
with pool.acquire(timeout=30.0) as svc:
svc.hydrate(msg)
svc.process()
svc.reset()
_run_batch_scaled(msgs, worker, WORKERS)
for p in pools.values():
p.shutdown()
benchmark(run) Same messages. Same workers. Same class. The only difference: construct once vs. construct per-request.
Benchmark Results — Batch Throughput
| Volume | alloc (no pool) | pools (per-org) | Speedup |
|---|---|---|---|
| 200 msg, 4 workers, 4 slots/org | 534.7 ms | 167.2 ms | 3.20× |
| 1000 msg, 8 workers, 20 slots/org | 1,334.5 ms | 350.4 ms | 3.81× |
Benchmark Results — Single Request (Per-Message Latency)
| Metric | alloc (no pool) | pools (per-org) | Change |
|---|---|---|---|
| Mean time | 10.67 ms | 573 µs | 18.6× faster |
| Throughput (o/s) | 93.69 | 1744.00 | 18.6× higher |
| Median | 10.67 ms | 572 µs | 18.7× faster |
The GC Bonus
ProcessingService contains deliberate reference cycles. The handler closures — lambda s=self: None — capture self in their closure scope. self holds _handlers, _handlers holds closures, closures hold a reference back to self. That’s a real cycle that appears in production without anyone intending it: event handlers, callbacks, retry hooks. The self._logger = self cycle is also realistic — services registering themselves as log handlers. CPython uses reference counting as its primary memory management, but reference cycles cannot be freed by refcounting. They require the cyclic garbage collector to detect and free them.
Without pooling, short-lived objects with cycles go out of scope after each request. Refcounting can’t free them, so they accumulate in GC generations until the collector runs. With pooling, instances stay alive in their pools — never become garbage — no cycles to collect.
The benchmark includes an instrumented comparison using gc.callbacks to count cyclic GC events at different message volumes (same alloc vs. pools setup as the batch tests):
def _count_gc_events(fn):
gc.collect()
events = 0
pause_ns = 0
def cb(phase, _):
nonlocal events, pause_ns
if phase == "start":
events += 1
pause_ns -= time.monotonic_ns()
elif phase == "stop":
pause_ns += time.monotonic_ns()
gc.callbacks.append(cb)
fn()
gc.callbacks.remove(cb)
return events, pause_ns / 1_000_000 Results across message volumes:
| Messages | alloc GC events | alloc pause | pools GC events | pools pause |
|---|---|---|---|---|
| 200 | 0 | 0.0ms | 0 | 0.0ms |
| 500 | 1 | 4.6ms | 0 | 0.0ms |
| 1,000 | 3 | 4.5ms | 0 | 0.0ms |
| 2,000 | 7 | 4.3ms | 0 | 0.0ms |
| 5,000 | 17 | 5.5ms | 0 | 0.0ms |
With pooling: zero GC events at every volume. Without pooling: cyclic GC kicks in around 500 messages, with roughly one collection event per 300 additional messages.
Each pause runs under a millisecond and construction cost dominates the speedup at these volumes — GC isn’t the headline. But the trend is real: at higher throughput or with larger cycle graphs (ML pipelines, real connection pools), the pauses accumulate and the CPU overhead of tracing object graphs grows. Worth knowing, even if it’s not why you’d reach for pooling first.
Dynamic Resize Under Load
Pools aren’t static. test_bench_resize demonstrates scaling pool capacity while processing messages — resize from 2 → 8 → 4 in a single benchmark round. The resize operation is non-blocking and completes in 32.9ms mean across 610 rounds with minimal variance (stddev 0.28ms).
def test_bench_resize(benchmark):
"""Dynamic resize — scale from 2→8→4 under concurrent load."""
org = ORG_IDS[0]
msgs = _make_messages(n=500, orgs=1)
def run():
pool = ObjectPool(
factory=lambda: ProcessingService(
org_id=org,
build_cost=BUILD_COST,
process_cost=PROCESS_COST,
),
max_size=2,
)
def worker(msg):
with pool.acquire(timeout=30.0) as svc:
svc.hydrate(msg)
svc.process()
svc.reset()
# Process some to warm up at size=2
for _ in range(10):
worker(msgs[0])
pool.resize(8) # scale up: more capacity
for _ in range(20):
worker(msgs[1])
pool.resize(4) # scale down: excess idle destroyed,
# active instances stay until released
for _ in range(10):
worker(msgs[2])
pool.shutdown()
benchmark(run) At max_size=2 with concurrent workers, slots are saturated. Resize to 8 adds capacity immediately — workers unblock without restarting. Shrink back to 4: idle instances are destroyed, active ones remain until released.
When to Reach for Object Pooling
| Condition | Why it matters |
|---|---|
| High construction cost | Config loading, DB connections, ML models |
| Reference cycles | Callbacks, registries, circular refs → GC piles up, pool eliminates it |
| High throughput | Continuous requests → construction cost dominates |
| Per-tenant config | Each tenant has different setup → per-tenant pools, not shared |
If none of these apply — object is cheap to create, has no cycles, low request rate — pooling adds complexity for no gain. Use plain instantiation.
Note: instances are created lazily on first acquire() — no upfront construction cost. This avoids paying for orgs that never receive traffic. For production, create pools at worker startup so the first real request doesn’t pay the build cost.
Pitfalls of Object Pooling
Pooling is a tradeoff: you trade construction cost for the responsibility of managing object lifecycle. Get the lifecycle wrong and you trade one class of bug for another.
1. Stale state and connections between requests
This is the most common failure mode. When an instance returns to the pool, it carries whatever state the previous request left behind. The reset() method in ProcessingService clears _message — but in a real service, incomplete resets are subtle: a transaction left uncommitted, a cache warmed to the previous org’s data, a flag never cleared. The next request acquires an instance that looks clean but isn’t. A database connection held in the pool for 30 minutes may have been silently dropped by the server’s wait_timeout — the pool has no visibility into this.
The fix is a discipline problem, not a code one. reset()r must be exhaustive and reviewed whenever __init__ gains a new attribute. Validate connections on acquire with a cheap liveness check, and set pool TTLs shorter than your server’s idle timeout. A good test acquires an instance, runs a request, releases it, acquires it again, and asserts it’s identical to a freshly constructed one.
2. Per-org pools and cross-org contamination
The dict-of-pools pattern is correct, but only if the routing logic is correct. A bug that assigns msg["org_id"] to the wrong pool key — a typo, a missing default, a race condition in pool creation — hands org_001’s service instance to an org_002 request. With pooled state (loaded config, cached tenant models, connection credentials), that’s a data isolation failure, not just a logic bug. Treat pool routing as security-critical code.
3. Reference cycles and close()
_destroy_object calls close() when an instance is evicted. But if close() breaks cycles incompletely — for instance, it clears _handlers but not _logger — the instance doesn’t become garbage. The cyclic GC still has to collect it, and you lose the GC reduction benefit pooling was supposed to deliver. A complete close() should null out every reference that participates in a cycle. The test: after close(), gc.collect() should not find the instance in the collected set.
4. Pool exhaustion and hidden backpressure
acquire(timeout=30.0) will block for up to 30 seconds if all slots are taken. Under sustained load, this becomes invisible queuing: requests pile up waiting for a slot, latency climbs, and the pool’s stats.total_timeouts counter starts ticking. Without monitoring on that counter and on active/idle ratios, you won’t know the pool is the bottleneck until requests start timing out. Instrument the stats endpoint and alert on timeout rate, not just error rate.
Object pooling solves the “create once, reuse many” problem in any language. Now let’s look at something Python-specific: what if the object itself is bloated?
The Hidden Cost of __dict__
Every Python object carries a __dict__ — a per-instance dictionary that stores attributes. It’s convenient (dynamic attributes, monkey-patching, ORM lazy loading) but expensive. There are two ways to measure that cost, and they answer different questions:
| Tool | What it measures | NoSlots (100 attrs) |
|---|---|---|
sys.getsizeof | Shallow — the container envelope only, does NOT recurse into attribute values | 3,376 bytes |
tracemalloc | Full heap — every allocation: dict, keys, values, instance | 6.74 KB |
The benchmarks below use tracemalloc because it reflects actual RAM cost. But first, looking at just the container wrapper — __slots__ eliminates the __dict__ hash table entirely:
| Component | Size (bytes) |
|---|---|
NoSlots (header + __dict__) | 3,376 |
| WithSlots (header + array) | 832 |
| Wrapper savings | 75% |
The NoSlots wrapper is 3,376 bytes total: a 48-byte instance header plus a 3,328-byte __dict__ hash table with 256 slots (CPython maintains a ~2/3 load factor; 100 entries need 256 slots × 24 bytes each). __slots__ replaces the dict with a dense C array of 100 pointers at 800 bytes, plus a 32-byte instance header, for 832 bytes total. That 75% reduction is just the envelope — sys.getsizeof doesn’t count the actual attribute values. When you measure the full heap with tracemalloc (same benchmark below), memory drops from 6.74 KB to 3.97 KB, a 40.9% savings.
What __slots__ Actually Does
__slots__ tells CPython: “this class has a fixed set of attributes, allocate a C array for them instead of a dict.” No per-instance dict, no per-attribute dict entries, no hash table overhead.
class NoSlots:
def __init__(self):
self.a = 1
self.b = 2.0
self.c = "hello"
# ... 97 more attributes
class WithSlots:
__slots__ = ("a", "b", "c", ...) # 100 attribute names
def __init__(self):
self.a = 1
self.b = 2.0
self.c = "hello"
# ... 97 more attributes How We Benchmarked — __slots__
The test classes generate 100 attributes per instance with mixed types to simulate real DTOs:
N_ATTRS = 100
_attr_names = [f"attr_{i}" for i in range(N_ATTRS)]
class NoSlots:
def __init__(self):
for i, name in enumerate(_attr_names):
if i % 5 == 0:
val = i * 7 # int
elif i % 5 == 1:
val = i * 3.1415 # float
elif i % 5 == 2:
val = f"val_{i}" # short string
elif i % 5 == 3:
val = i % 2 == 0 # bool
else:
val = (i, i + 1, i + 2) # small tuple
setattr(self, name, val)
class WithSlots:
__slots__ = tuple(_attr_names)
def __init__(self):
for i, name in enumerate(_attr_names):
# same mixed-type initialization as NoSlots
...
setattr(self, name, val) Memory is measured via tracemalloc — tracking actual heap allocations, not estimates. Timing uses timeit for stable micro-benchmarks. Profiling uses cProfile + gprof2dot for call graph visualization:
import timeit
import tracemalloc
N_INSTANCES = 10_000
N_REPEATS = 5
N_LOOPS = 100
def mem_usage(cls, n=N_INSTANCES):
"""Measure per-instance memory via tracemalloc."""
tracemalloc.start()
objs = [cls() for _ in range(n)]
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
per_obj = current / n / 1024 # KB
return per_obj, current / 1024**2 # MB
def bench_create(cls):
"""Time N_LOOPS object creations, repeat N_REPEATS times."""
def _create():
for _ in range(N_LOOPS):
cls()
return _create
def bench_access(instances):
"""Time reading attr_0, attr_50, attr_99 across N_INSTANCES objects."""
def _access():
for obj in instances:
_ = obj.attr_0
_ = obj.attr_50
_ = obj.attr_99
return _access
def bench_setattr(instances):
"""Time writing to 3 attributes across N_INSTANCES objects."""
def _setattr():
for obj in instances:
obj.attr_0 = 42
obj.attr_50 = 42
obj.attr_99 = 42
return _setattr The benchmark: 10,000 instances, 100 small attributes each (int, float, short string, bool, small tuple). Here’s what changes:
Benchmark Results — 10,000 Small Objects
Memory:
| Class | Per Object (KB) | Total 10,000 objs (MB) |
|---|---|---|
| NoSlots | 6.74 | 65.8 |
| WithSlots | 3.97 | 38.8 |
| Savings | 40.9% | 27.0 MB |
Attribute Access:
| Operation | NoSlots | WithSlots | Speedup |
|---|---|---|---|
| Creation | 0.011 s | 0.010 s | 1.08× |
| Read | 0.029 s | 0.004 s | 7.70× |
| Write | 0.012 s | 0.005 s | 2.32× |
Reads are nearly 8× faster. A __dict__ read is a full dict lookup: hash the key, probe the table, compare keys, resolve collisions. A slot read eliminates all of that. Each slot name becomes a slot descriptor on the type (not the instance). When you access obj.attr_0, CPython resolves the descriptor through the type’s attribute cache, and its __get__ reads from object->slots_array[slot_offset] — a pre-computed index into a dense C array. There’s still one pointer indirection (type → descriptor → slot offset), but it’s far cheaper than dict.__getitem__.
cProfile Confirms It
Profiling 200,000 setattr calls across 2,000 objects:
| Function | NoSlots (cumtime) | WithSlots (cumtime) |
|---|---|---|
setattr | 0.050 s | 0.035 s |
__init__ | 0.118 s | 0.091 s |
| Total | 0.129 s | 0.096 s |
Slots-based setattr is 1.43× faster per call. __init__ is 1.3× faster. Same function call count (202,002) — just less work inside each call.
When __slots__ Does Nothing
The benchmark was also run with numpy 20×20 float64 matrices (325 KB each) as attribute values:
| Metric | NoSlots | WithSlots | Difference |
|---|---|---|---|
| Memory per obj | 328.65 KB | 325.82 KB | 0.8% |
| Read time | 0.842 s | 1.103 s | slower |
The __dict__ overhead (~3.3 KB) is noise against 325 KB of data — less than 1% of total object size. Saving 2.5 KB when each object already holds 329 KB doesn’t matter. Worse, reads actually regressed slightly. For large-data objects, the “read” is just returning a reference to an already-allocated numpy array — the lookup mechanism itself is the entire cost. CPython’s dict lookup is highly optimized for repeated attribute access on stable objects (version-tagged hash tables skip re-hashing when the dict hasn’t been mutated), and the slot descriptor overhead — the cached type lookup, __get__ call, and slot offset dereference — becomes measurable when data access itself is negligible. __slots__ helps when the object wrapper is a significant fraction of total size. When data dwarfs the wrapper, focus optimization effort on the data structures themselves.
The “10 KB per inst?” threshold isn’t arbitrary — it’s the inflection point between the two benchmarks above. At 6.74 KB, __slots__ saved 40.9%. At 328 KB, it saved 0.8%. Between those extremes sits a judgment call. If your object data exceeds ~10 KB, the wrapper overhead is already roughly 25% of total, and slots’s diminishing returns make it a lower-priority optimization.
Pitfalls of __slots__
__slots__ is a class-level declaration with class-level consequences. The attribute access speedup is real, but so are the constraints it imposes — and several of them only surface at the boundaries: inheritance, serialization, and third-party code that assumes __dict__ exists.
1. Inheritance breaks slots silently
If a slotted class inherits from an unslotted one, the subclass gets a __dict__ anyway — the parent’s __dict__ is inherited and the child’s __slots__ declaration is effectively ignored for memory purposes. No error, no warning, just the overhead you thought you eliminated.
class Base:
pass # has __dict__
class Child(Base):
__slots__ = ("x", "y") # __dict__ still present, inherited from Base The fix: every class in the inheritance chain must declare __slots__. For a chain where the root class needs no slots of its own, declare an empty tuple:
class Base:
__slots__ = ()
class Child(Base):
__slots__ = ("x", "y") # now truly no __dict__ This is easy to get wrong when adding a mixin or base class later — the new parent silently reintroduces __dict__ and you don’t notice until you measure again.
2. __weakref__ is opt-in
By default, slotted classes can’t be weakly referenced. This matters if anything in your stack uses weakref — some caches, ORM session managers, and observer patterns rely on it. The fix is explicit: add "__weakref__" to __slots__. But if you’re inheriting from a class that already includes it, adding it again raises a TypeError. Check your chain before adding it.
class Record:
__slots__ = ("id", "value", "__weakref__") # opt back in 3. Pickling requires __getstate__ / __setstate__
The default pickle protocol reads __dict__ to serialize instance state. With __slots__, there’s no __dict__, so pickling a slotted instance either raises a TypeError or silently produces an empty object on unpickle, depending on the protocol version. If your objects cross process boundaries — Celery tasks, multiprocessing queues, joblib — you need to implement both methods:
class Record:
__slots__ = ("id", "value")
def __getstate__(self):
return {s: getattr(self, s) for s in self.__slots__}
def __setstate__(self, state):
for k, v in state.items():
setattr(self, k, v) 4. Dynamic attributes are gone
__slots__ eliminates __dict__, which means you can’t attach arbitrary attributes at runtime. This is usually the point — but it breaks any code that relies on it: ORMs that lazy-load relationships onto instances, test fixtures that monkey-patch attributes, dataclasses with field(default_factory=...) in edge cases, and debugging tools that annotate objects in-place. If a library you don’t control adds attributes to your instances, __slots__ will raise AttributeError at runtime, not at definition time.
5. The single-class optimization trap
__slots__ only pays off when you instantiate the class many times. Declaring __slots__ on a singleton, a config object, or a class that gets instantiated a handful of times per process adds complexity and constraints for zero measurable gain. The decision rule from the benchmark holds: if you’re not creating hundreds of thousands of instances, profile first and reach for __slots__ only if memory or attribute access shows up in the results.
When to use?
| Object Pooling | __slots__ | |
|---|---|---|
| Scope | Any language (Go, Java, C++, Rust…) | CPython only |
| What it fixes | Repeated expensive construction | Per-instance __dict__ overhead |
| Memory | Eliminates GC from cyclic garbage | ~40% savings on small objects |
| Speed | 3× batch throughput, 18× per-message | 3–8× faster attribute reads |
| Best for | Long-lived services with expensive init | Millions of small DTOs, events, records |
| Cost | Pool management complexity, state hygiene | No dynamic attributes, no __weakref__ |
| When to skip | Cheap construction, no cycles, low load | Objects holding large arrays/buffers |
Both techniques are free — no dependencies, no build steps, just a different mental model of where your program spends its time. Pooling is the right lever when construction is expensive; slots is the right lever when you’re paying for memory and attribute access you didn’t know you had. Profile first. The fix is usually simpler than switching languages.