Introducing ClipTagger-12b.Learn More

    Ultimate Gradient Checkpointing Performance Guide for Neural Networks

    Published on Aug 17, 2025

    Get Started

    Fast, scalable, pay-per-token APIs for the top frontier models like DeepSeek V3 and Llama 3.3. Fully OpenAI-compatible. Set up in minutes. Scale forever.

    Within LLM inference optimization techniques, hitting a GPU memory bottleneck is a familiar roadblock: you start training a big model, and it dies when activations and optimizer states overwhelm memory. Want to keep building without buying more hardware? Gradient Checkpointing saves fewer activations during the forward pass and recomputes them during backpropagation, cutting activation storage and freeing GPU memory so you can scale models using gradient accumulation, mixed precision, or model parallelism.

    To train larger, more powerful neural networks efficiently on limited hardware without running into a memory bottleneck, Inference's AI inference APIs let you enable checkpointing and memory optimization with simple settings, handling recomputation and resource management so you can focus on model quality rather than low-level tuning.

    What are the Fundamental Concepts of Gradient Checkpointing and How Does it Work?

    Person Working - Gradient Checkpointing

    Activation checkpointing, often called gradient checkpointing, is a memory-saving trick for training deep neural networks. During a regular forward pass, the model stores many intermediate activations so the backward pass can compute gradients.

    Checkpointing clears the internal intermediate tensors for selected modules and keeps only the module inputs and outputs. Then, during the backward pass, the library recomputes those freed activations on demand so gradients can be computed.

    The result:

    Lower peak memory use at the cost of extra compute from recomputation.

    How Checkpointing Works Step By Step, With A Simple Forward And Backward Example

    Start with a small example: four layers A, B, C, D executed in that order during the forward pass.

    Normal training without checkpointing

    • Forward: compute A -> store its activations, compute B -> store, compute C -> store, compute D -> store
    • Backward: use stored activations for D, then C, then B, then A to compute gradients.
    • Peak memory holds activations for A through D

    With checkpointing applied to B and C

    • Forward: compute A and keep its activations; compute B then free internal intermediates but keep B input and B output; compute C, then free internals but keep C input and C output; compute D and keep its activations
    • Backward: process D backward using stored activations; when coming to C, recompute C forward from its input to regenerate internal activations then do C backward; next recompute B forward as needed, then do B backward; finally, do A backward.

    This pattern lowers the peak memory required during the forward pass and the early part of the backward pass because internal tensors from checkpointed modules are not held in memory. The tradeoff is the extra forward work to recompute those activations during backward.

    Memory Versus Compute Tradeoff Explained Clearly

    Why trade compute for memory? On modern GPUs, memory can become the limiting factor for model size or batch size. Checkpointing lets you fit larger models or bigger batches by sacrificing extra compute time spent recomputing activations.

    The effective peak memory can drop substantially because only the checkpointed module inputs and outputs and the activations outside checkpointed regions remain resident. Expect runtime to increase roughly proportional to the amount and cost of recomputation; for deep stacks of cheap layers, the overhead can be moderate, for heavy layers, it can be larger.

    What Exactly is Stored And What Gets Freed When a Module is Checkpointed

    When you checkpoint a module

    • Kept: the module input tensors and the module output tensors.
    • Freed during forward: any intermediate tensors that would normally be retained inside the module.
    • Recreated during backward: those intermediate tensors are recomputed from the saved module inputs when needed for gradient calculation.

    This pattern reduces peak activation memory between the checkpoint and the next saved point because the backward of later layers finishes before the recomputation happens.

    Code Examples for Practical Use in Plain Pytorch Style

    Example 1: Checkpoint a single module call in a model definitionimport torch.nn as nnimport torch.nn.functional as Ffrom smdistributed.modelparallel.torch.patches.checkpoint import checkpointclass Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 32, 3, 1) self.conv2 = nn.Conv2d(32, 64, 3, 1) self.fc1 = nn.Linear(9216, 128) self.fc2 = nn.Linear(128, 10) def forward(self, x): x = self.conv1(x) x = self.conv2(x) x = F.max_pool2d(x, 2) x = torch.flatten(x, 1) # This call of fc1 will be checkpointed x = checkpoint(self.fc1, x) x = self.fc2(x) return F.log_softmax(x, 1)Example 2: Checkpoint a sequential blockimport torch.nn as nnfrom smdistributed.modelparallel.torch.patches.checkpoint import checkpoint_sequentialclass Net(nn.Module): def __init__(self): super(Net, self).__init__() self.seq = nn.Sequential( nn.Conv2d(1,20,5), nn.ReLU(), nn.Conv2d(20,64,5), nn.ReLU() ) def forward(self, x): # This call of self.seq will be checkpointed x = checkpoint_sequential(self.seq, x) return F.log_softmax(x, 1)

    Using Checkpointing With A Prebuilt Model Such As Hugging Face Transformers

    When you import a model from a library and you want to checkpoint transformer layers, do the following:

    • Initialize smdistributed.modelparallel with smp.init.
    • Wrap the model with smp.DistributedModel.
    • Identify the container object that holds sequential transformer layers.
    • Call smp.set_activation_checkpointing on that container.
    • Import smdistributed.modelparallel.torch as smp
    • From transformers import AutoModelForCausalLM

    smp.init()

    model = AutoModelForCausalLM(*args, **kwargs)

    model = smp.DistributedModel(model)# Call set_activation_checkpointing APItransformer_layers = model.module.module.module.transformer.seq_layerssmp.set_activation_checkpointing( transformer_layers, pack_args_as_tuple=True, strategy='each')

    Practical Rules And Restrictions When Using Smdistributed.Modelparallel Checkpointing

    • Granularity: Activation checkpointing works at module granularity. For most torch.nn modules, you can checkpoint a module tree only if it is fully contained within a single pipeline partition.
    • Sequential special case: For torch.nn.Sequential, each nested module tree inside the sequential block must lie entirely within one partition for checkpointing to work.
    • Partitioning interactions: If a module is split across ranks so descendants live on different ranks, the library will ignore the checkpoint request for that module and log a warning that it will not be checkpointed.
    • Manual partitioning: When you partition models manually, watch these constraints and assign layers so checkpointable groups remain in a single partition.
    • Logs: With automated partitioning, check the training job logs for the partition assignment lines that start with Partition assignments: to see where modules landed.

    Allreduce And Compatibility Notes

    • The smdistributed.modelparallel library supports both overlapping and non-overlapping allreduce operations together with activation checkpointing.
    • PyTorch native checkpointing API is not compatible with smdistributed.modelparallel.
    • Use the smdistributed.modelparallel checkpoint APIs instead.

    Version Requirement And Availability

    Activation checkpointing support in the SageMaker model parallelism library is available starting with v1.6.0 and later. Confirm you are on that version or a newer one before relying on these APIs.

    How The Library Handles Recomputation Timing And Peak Memory

    Checkpointed modules free internal tensors during forward so the backward of later modules can complete with a lower memory footprint. When you recompute checkpointed tensors during backward, the layers beyond that checkpoint have already completed backward, so peak memory is often lower than without checkpointing. The recompute cost will be paid at that point in the backward timeline.

    Questions You Should Ask When Deciding To Use Checkpointing

    • Which layers are the heaviest in memory and compute? Checkpoint light layers first to limit runtime hit.
    • How much extra runtime is acceptable for the memory gain? Estimate recompute cost versus memory saved.
    • Is the module placement across pipeline partitions compatible with module-level checkpointing? Check the partition assignments in logs.
    • Do downstream libraries or APIs require PyTorch native checkpointing? If so, plan to replace that with sm-distributed model-parallel checkpoint APIs for compatibility.

    When To Reach For Activation Checkpointing During Model Development

    You should try activation checkpointing when memory prevents the batch size or model size you need, and you prefer to spend extra compute instead of moving to distributed data parallel or sharded states. It works well as a targeted optimization for blocks of your model that consume a lot of activation memory and that can be grouped inside a single partition.

    What are The Reasons for Implementing Gradient Checkpointing?

    Man Working - Gradient Checkpointing

    Gradient checkpointing exists to solve a plain problem: GPU memory runs out long before compute does. The primary benefit is reducing the dynamic memory consumed by activations so that you can fit larger models or larger batches on limited hardware. That means you can run bigger experiments on a 16 GB GPU, push batch sizes up to something that converges, or train deeper networks without buying more memory.

    Secondary benefits follow:

    • Fewer out-of-memory errors
    • The ability to use fewer gradient accumulation steps
    • More headroom for mixed precision or data parallel overhea

    The tradeoff is extra computation from recomputing activations, but in many workflows that trade compute for memory and enable otherwise impossible runs, the exchange is worth it.

    How Model Memory Allocation Works In Training

    Static memory is mostly model weights. Even small modern networks carry millions of parameters, and production models reach hundreds of millions to billions. That static weight footprint sets a floor.

    Dynamic memory holds activations and autograd bookkeeping. Every forward pass produces activations that autograd keeps, so backward can compute gradients. Those activation tensors are stored per sample in the batch and grow with model depth and batch size. When activations dominate, you must reduce batch size or shrink the model to fit the GPU.

    Static Memory Scale and a Practical Hardware Example

    Weights dominate the static bill. For reference, models with 100 to 150 million parameters are near the practical memory limit for an NVIDIA T4 with 16 GB.

    Larger backbones require GPUs with more memory or memory reduction techniques. If weights fill most of the device RAM, you still need to trim activation memory to increase batch sizes or add layers.

    What Gradient Checkpointing Does And Why It Saves Memory

    Gradient checkpointing keeps only a subset of activations during the forward pass and discards the rest. During backward, PyTorch recomputes the discarded activations on the fly from saved inputs, then computes gradients. That reduces the activation memory footprint at the expense of extra forward work.

    This moves you along the classic compute for memory tradeoff curve. For deep models, recomputation often costs less than adding more GPU memory or accepting tiny batch sizes that hurt convergence.

    How the Algorithm Works Under the Hood in Autograd

    During forward, checkpointing stores the callable and its input tensors instead of all intermediate activations. At backward, autograd uses the saved inputs and the callable to recompute outputs and intermediate values as needed, then applies the chain rule.

    This tends to double the forward computation for the checkpointed segments, since some parts are recomputed. The checkpoint API implements this by memoizing inputs at forward time and rerunning the functions backward.

    Pytorch Apis: Checkpoint And Checkpoint_sequential And How To Use Them

    PyTorch exposes two main entry points: checkpoint_sequential and checkpoint. checkpoint_sequential slices a Sequential model into N segments and checkpoints every segment but the last. It is convenient and straightforward for linear stacks, but gives you little control over boundary placement.

    Checkpoint accepts any callable and its arguments and is flexible for arbitrary module graphs. Use a checkpoint for modular architectures where you control precisely which functions to recompute.Example: checkpoint_sequential with a tiny MLPimport torchimport torch.nn as nnfrom torch.utils.checkpoint import checkpoint_sequentialmodel = nn.Sequential( nn.Linear(100, 50), nn.ReLU(), nn.Linear(50, 20), nn.ReLU(), nn.Linear(20, 5), nn.ReLU())input_var = torch.randn(1, 100, requires_grad=True)segments = 2out = checkpoint_sequential(model, segments, input_var)out.sum().backward()Example: using checkpoint inside a custom forwardfrom torch.utils.checkpoint import checkpointclass CIFAR10Model(nn.Module): def forward(self, X): X = self.cnn_block_1(X) X = self.dropout_1(X) X = checkpoint(self.cnn_block_2, X) X = self.dropout_2(X) X = self.flatten(X) X = self.linearize(X) X = self.dropout_3(X) X = self.out(X) return X

    When to Choose One Api Over The Other

    If your model is a simple sequential stack, checkpoint_sequential is quick to apply. For models with branches or custom control flow, use a checkpoint to mark specific modules or functions. Checkpoint gives you exact control, so you can avoid checkpointing layers that rely on nondeterministic behavior.

    Gotchas and Incompatibilities You Must Watch For

    Do not checkpoint nondeterministic layers that change behavior between runs. Dropout and batch normalization are common offenders because recomputing them can produce inconsistent outputs across forward and backward passes. To use checkpointing, move dropout and batchnorm outside checkpointed segments or refactor so the checkpointed callable remains deterministic.

    Another Gotcha is RNG state.

    When you recompute a segment, RNG state must be handled so that random operations produce consistent values. Detached tensors and in-place operations can also break the checkpoint contract.

    Checkpointing early in the model may mistakenly freeze weights if the input tensors are incorrectly marked as having no gradient. That can prevent training on those layers, so test carefully when applying a checkpoint at the start of the network.

    Algorithmic Complexity And The Original Paper

    The 2016 paper Training Deep Nets With Sublinear Memory Cost describes a recomputation strategy that reduces dynamic memory from O(n) to O(sqrt(n)) under confident partitioning choices.

    That means you can squeeze deep networks into far less memory by checkpointing strategically. The empirical results in the paper show dramatic reductions, for example, compressing an ImageNet variant from 48 GB to 7 GB.

    Practical Performance Impact And Benchmark Signals

    Checkpointing typically reduces peak memory by a large fraction and raises runtime by a modest factor. For example, enabling gradient checkpointing on a BERT-based sentiment model cut peak memory by about 60 percent while increasing training time by roughly 25 percent.

    That shift was helpful because it allowed larger batch sizes that would not have fit otherwise. Careful segmentation produced batch size increases from 24 to 132. Those numbers depend on architecture, precision mode, and which layers you checkpoint.

    Transformer Models And Flipping The Switch In Practice

    Popular transformer libraries include checkpointing support. For Hugging Face transformers, you can enable gradient checkpointing by setting the configuration flag to True when creating the model.

    Example:cfg = transformers.PretrainedConfig.get_config_dict("bert-base-uncased")[0]cfg["output_hidden_states"] = Truecfg["gradient_checkpointing"] = Truecfg = transformers.BertConfig.from_dict(cfg)self.bert = transformers.BertModel.from_pretrained("bert-base-uncased", config=cfg)That single flag often integrates checkpointing across encoder layers so you can benefit without rewriting the forward pass.

    Best Practices To Get The Most From Checkpointing

    Decide what to checkpoint based on memory hot spots. Checkpoint the deep compute-heavy blocks and avoid checkpointing layers that use nondeterministic behavior. Combine checkpointing with mixed precision training to multiply memory savings.

    If activation memory still blocks you, add gradient accumulation, optimizer state partitioning, or model parallel techniques like ZeRO stage approaches. Profile memory and time with representative batch sizes before and after enabling checkpointing, so you measure real impact.

    Actionable Checklist Before Enabling Checkpointing

    • Identify which modules use dropout or batchnorm and move them out of checkpointed regions.
    • Test gradient flow on small inputs to avoid silent freezing of parameters.
    • Profile forward and backward time to understand the computation penalty.
    • Combine with mixed precision to lower the float32 activation cost.
    • Use gradient accumulation if you need a larger adequate batch size without raising peak memory.

    Common Questions Practitioners Ask

    • Want to fit a 1 billion parameter model on a 16 GB GPU?
    • Checkpointing alone usually will not be enough because weights alone will exceed device memory.
    • Want to raise batch size to improve training stability?
    • Checkpointing often buys you that headroom.
    • Can you check everything?
    • No. Avoid layers that break when recomputed. How much slower will training be?

    Typically tens of percent rather than multiples, but measure on your model and data.

    References And Where To Read More

    The PyTorch autograd documentation explains how saved tensors and the computation graph work with checkpointing. The original paper provides the formal algorithm and complexity analysis. Search for activation checkpointing, activation recomputation, memory-efficient backprop, and recompute strategies to find more implementations and community notes.

    • KV Cache Explained
    • LLM Performance Benchmarks
    • LLM Performance Metrics
    • Inference Latency
    • LLM Serving
    • Pytorch Inference
    • Serving Ml Models
    • LLM Benchmark Comparison
    • Inference Optimization

    How Do I Implement Gradient Checkpointing for My Neural Network?

    Illustration of Neural Network - Gradient Checkpointing

    Gradient checkpointing, also called rematerialization or remat, avoids storing significant intermediate activations during the forward pass and recomputes them on the backward pass. Wrap the expensive block so its outputs are not kept in memory; when autodiff needs them, JAX reruns the forward computation from saved inputs.

    That trade cuts activation memory at the cost of extra FLOPs and wall time, and it can let you train broader or deeper models on the same device.

    How Jax.Checkpoint Works in Practice and Quick Examples

    Wrap the heavy function or module with jax.checkpoint (alias jax.remat). Use the checkpointed function instead of the original in your forward path. Example patterns and tips:

    • Basic wrap: compute_block_ckpt = jax.checkpoint(compute_block); z = compute_block_ckpt(x, W1, W2)
    • When functions are pure and deterministic, remat is safe. Pass RNG keys explicitly if your block uses randomness.
    • Use static_argnums for any args that are compile-time constants to avoid recompiling.
    • Consider prevent_cse=True if you need to prevent common subexpression elimination that could remove the remat boundary.
    • In Flax, use flax.linen.remat on module call methods to remat whole modules.

    Quick code note:

    put jax.jit around the gradient function, then call .block_until_ready() when measuring runtime to avoid measuring asynchronous overhead.

    Measure Memory and Compute Impact What to Track and How

    Track peak device memory, end-to-end step time, and FLOPs if available. Use these tools:

    • nvidia-smi or ROCm tools for GPU memory and utilization.
    • jax.profiler.trace and TensorBoard for timelines and memory traces.
    • Time the compiled workload with time.time() and block_until_ready to get accurate wall time.

    Record:

    • Peak device memory per step
    • Forward time, backward time, total step time
    • Gradient equivalence: max absolute difference between gradients

    Watch for increased memory spikes during recomputation of simultaneous blocks.

    Heuristics to Pick Where to Remat and How Aggressive to Be

    Use these rules of thumb:

    • Target blocks that produce very large activations relative to their parameter size. For example, wide linear layers and attention logits.
    • Favor blocks where the forward cost is small compared with the backward cost. The recompute overhead ratio ≈ is forward/(forward+backward). If backward is much larger, remat overhead is modest.
    • If activation size per sample times batch size dominates memory, prioritize remating that block or reducing batch size.
    • Start small: remat only the top N blocks or the heaviest single block, measure, then expand.
    • Aim for a runtime overhead budget: Typical acceptable overhead is 5 to 50 percent, depending on training schedule and resource constraints.

    Estimate Recomputation Overhead With A Formula

    Let F be forward flops for the rematted section and B be backward flops for the whole step. Extra cost ratio ≈ (F recomputed)/ (F + B) so total cost ≈ 1 + F/(F + B). If backward dominates (B >> F), the multiplier is close to 1. This helps decide whether extra FLOPs are worth the saved memory.

    Checkpointing Granularity Strategies

    • Module-level remat: wrap whole layers or blocks. Least bookkeeping, easy to reason about.
    • Segment remat: partition N layers into S segments and remat each segment. This is useful for transformers: remat every K layers.
    • Layerwise remat: remat every layer when memory is very tight. This gives maximal savings with high compute overhead.
    • Mixed scheme: remat only specific high-activation layers like feedforward big hidden layers and attention score computations, leaving small layers intact.
    • Test a few granularities: each choice changes peak memory and step time differently.

    Common Pitfalls And How to Avoid Them

    • RNG and nondeterminism: pass and split PRNGKeys inside the rematted block to ensure consistent randomness on recompute.
    • Side effects: remat assumes pure functions. Do not remat blocks that mutate global state, update running stats implicitly, or perform IO.
    • BatchNorm and streaming stats: avoid rematting modules that update running averages unless you separate the statistic update from the forward values.
    • Collective ops and communication: rematting across device collectives can duplicate cross-host communication during recompute and inflate latency. Prefer rematting inside a shard or local device block to avoid extra collectives.
    • Mixed precision differences: recomputation may use different rounding if precision contexts change between forward and recompute; keep precision consistent.

    Advanced JAX Knobs and Policies

    • jax.checkpoint has options: static_argnums, prevent_cse, and policy. Use policy to control which primitives are rematted when your function contains a mix of cheap and expensive primitives.
    • Use prevent_cse=True if JAX optimizations are folding or removing the remat boundary.
    • In pjit or pmap contexts, test remat behavior with your mesh because remat can change communication patterns and memory distribution.
    • When using named axes or sharding, ensure recomputation does not cause unexpected recompilation across different sharding specs.

    Combine Checkpointing With Other Memory Techniques

    • Activation offloading: move some activations to host memory and remat only the heaviest ones.
    • Mixed precision: reduce activation sizes with float16 or bfloat16 while remating to cut memory further.
    • Gradient accumulation: use smaller microbatches to reduce per-step activation memory, then accumulate gradients to achieve a large adequate batch size.
    • Parameter sharding and model parallelism: cut per-device parameter memory so remat can be used in tandem to fit activations.
    • Activation compression: quantize activations before storing if the recompute cost is high and some error is tolerable.

    A Practical Tuning Plan You Can Run In An Hour

    • Identify candidate blocks: instrument a single forward to print activation shapes or use model inspector utilities.
    • Baseline: run one step without remat, record peak memory and total step time.
    • Single-block test: remat the single largest block. Re-run and record memory and time. Check gradients match.
    • Segment expansion: remat progressively more blocks or set K-layer intervals. Track marginal gains and overhead.
    • Combine: add mixed precision or gradient accumulation if remat overhead becomes too large.
    • Choose a point on the Pareto curve where memory savings allow your desired batch size or model size, and runtime overhead fits your training budget.

    Actionable Checklist Before Applying Remat In Production

    • Make the block pure and pass RNG keys explicitly.
    • Start by rematting the largest activation producer.
    • Use jax.jit around your grad function; use block_until_ready for timing.
    • Measure peak memory with nvidia-smi and timeline traces with jax.profiler.
    • Check gradient numerical closeness; test training for a few steps to surface hidden issues.
    • Watch for extra collectives and increase in cross-host latency.
    • If overhead is too high, reduce remat scope, add mixed precision, or use activation offload.

    Questions for You to Answer Before Tuning

    • Which layers produce the biggest activations in your architecture?
    • How much runtime overhead can you accept per training step?
    • Do you have the option to shard parameters or offload activations across host memory?
    • KV Caching
    • Inference Acceleration
    • Inference Solutions
    • vLLM Multi Gpu
    • Distributed Inference
    • Continuous Batching LLM
    • Memory Efficient Attention

    Start Building with $10 in Free API Credits Today!

    Inference exposes OpenAI-compatible serverless inference APIs that let you call top open source LLM models with familiar request semantics. The serverless model handles autoscaling, container lifecycle, concurrency limits, and warm starts, so you avoid manual management of GPU clusters.

    You get predictable per-request latency and transparent cost tracking while the platform optimizes model placement and GPU utilization for you. Want lower cold start or lower cost per token, or both? You pick the model, batch size, and sampling settings, and the platform executes the tradeoffs.

    Specialized Batch Processing For Large-Scale Async Ai Workloads

    For async workloads, Inference offers batch pipelines that group requests, schedule jobs across GPUs, and run worker pools tuned for throughput. Batch scheduling supports request coalescing, controlled latency windows, and chunked processing so you can process millions of documents or queries with predictable GPU hours.

    The system exposes job queues and callbacks that integrate with existing async frameworks so you can scale without rewriting application logic. Consider tradeoffs between larger batch sizes that increase throughput and smaller batches that reduce tail latency.

    Document Extraction For Rag Applications And Embedding Pipelines

    Document extraction features focus on chunking, text cleaning, layout aware parsing and embedding generation for retrieval augmented generation workflows. Inference supports:

    • OCR
    • Language detection
    • Selectable chunk size and overlap
    • Vector embedding outputs
    • Metadata tagging

    So embedding quality aligns with your vector store strategy. You can stream extraction results, run parallel embedding jobs, and apply postprocessing rules for better retrieval recall. How you chunk and overlap affects both retrieval accuracy and computational cost.

    Getting Started With $10in Free API Credits and Cost-Efficient Models

    Start building with $10 in free API credits to test models and pipelines before you scale. Use smaller models or quantized variants to iterate cheaply. Compare token price, latency, and output quality across models with the free credits so you can make data-driven choices.

    Try mixed precision or int8 inference to cut memory and cost while monitoring output fidelity. Will you optimize for lower latency or lower price per request during early experiments?

    Operational hints for production inference

    Profile memory footprint per model, per batch size, and context length before you set production defaults. Run experiments that vary segment size for checkpointing and measure recompute overhead in milliseconds.

    Use telemetry to detect outlier latency caused by recompute spikes and tune batch windows or enable warm containers. Automate model selection so low-traffic routes use smaller quantized models and high-throughput routes use larger models with batch optimization.

    Questions to guide your choices

    • Do you need lower latency or higher throughput?
    • Is context length the limiter, or is batch size the constraint?
    • Would you rather pay for extra GPUs to avoid recompute or accept added compute for lower memory use?

    Answering these will determine whether rematerialization and activation checkpointing fit your production stack.


    START BUILDING TODAY

    15 minutes could save you 50% or more on compute.