How to Choose Batch Sizes to Minimize Cost per Token in LLM Serving

Running large language models is expensive. If you are paying for every token generated without optimizing your infrastructure, you are likely throwing money away. The biggest lever you can pull to reduce these costs isn't always switching to a cheaper model-it’s how you group your requests. This process, known as batch processing, allows you to squeeze more work out of your GPUs, potentially cutting API overhead costs by up to 90%.

But here is the catch: picking the wrong batch size can destroy your latency or crash your server with out-of-memory errors. You need to find the sweet spot where throughput is high, latency is acceptable, and the cost per token is minimized. Let’s break down exactly how to calculate that number for your specific use case.

The Economics of Batching: Why It Matters

When an LLM processes a single request, the GPU spends a lot of time idle while waiting for data to move in and out of memory. This is inefficient. By grouping multiple requests into a single batch, you keep the GPU busy, amortizing the fixed computational costs across many tokens. This improves resource utilization dramatically.

Consider the pricing evolution at OpenAI. Between March 2023 and September 2024, GPT-4 pricing dropped from $36 to $5 per million tokens. However, if you implement proper batch processing on top of that, you can drive costs down further-to around $2.50 per million tokens. That is an additional 50% savings just through operational efficiency. For enterprise deployments, this difference is massive. A 70B parameter model served on A100 GPUs can cost between $2,000 and $3,000 per day. Optimizing batch sizes directly impacts whether that daily bill is manageable or catastrophic.

Cost Impact of Batch Processing Strategies
Strategy	Estimated Cost Reduction	Latency Impact
Static Batching	Up to 50%	Moderate increase
Dynamic/Continuous Batching	Up to 87%	Minimal (optimized)
Model Cascading + Batching	Compound Savings	Variable

Determining the Optimal Batch Size for Your Task

There is no universal "best" batch size. The optimal number depends entirely on what your application is doing. Different tasks have different computational profiles, which changes how much memory and compute each request consumes.

Here are the industry-standard ranges based on current benchmarks:

Text Generation: Aim for batches of 10-50 requests. These tasks are compute-heavy during the decoding phase, so moderate batching balances speed and capacity.
Classification Tasks: You can push larger batches, typically 100-500 requests. Since the output is short (a label or score), the GPU can process many inputs simultaneously without significant latency penalties.
Simple Q&A Systems: These perform best with 50-200 requests per batch. The context window is usually smaller than complex generation tasks, allowing for higher concurrency.

For example, a fintech engineering lead reported achieving a 58% cost reduction by moving from individual API calls to a batch size of 35 for customer support ticket classification. However, they noted it took three weeks of tuning to find that specific number. Start with these ranges, then monitor your metrics.

Understanding the Trade-offs: Latency vs. Throughput

The fundamental challenge in batch sizing is the tug-of-war between throughput (how many tokens you generate per second) and latency (how long a user waits for a response). As you increase the batch size, throughput generally increases because the GPU is utilized more efficiently. However, latency also increases because each request has to wait for its turn within the batch.

Benchmark studies show clear diminishing returns. For a 7B parameter model, latency drops from 976 ms at a batch size of 1 to 126 ms at a batch size of 8. But beyond a batch size of 64, throughput plateaus, and latency continues to climb. This is often due to GPU memory constraints. Larger batches require more KV cache memory, which can lead to out-of-memory errors if you exceed your hardware limits.

If your application is real-time, like a chatbot, you cannot afford high latency. In this case, dynamic batching is superior to static batching. Dynamic batching groups requests as they arrive, ensuring that no single request waits too long for a batch to fill up. For non-interactive applications, such as document summarization or data extraction, you can tolerate higher latency in exchange for maximum throughput and lower costs.

Risograph illustration of documents batching into a happy GPU

Advanced Techniques: Continuous Batching and Model Cascading

To truly minimize cost per token, you need to look beyond simple static batching. Continuous batching is an advanced technique where new sequences are inserted into the batch as older ones complete generation. This keeps the GPU fully utilized at all times. Implementations like vLLM achieve up to 24x higher throughput compared to standard Hugging Face implementations using this method. AnyScale reports up to 23x higher throughput with reduced p50 latency.

Another powerful strategy is model cascading. Instead of sending every query to a premium model like GPT-4, you route 90% of queries to a smaller, cheaper model like Mistral 7B (which costs approximately $0.00006 per 300 tokens). Only complex requests are escalated to the larger model. When combined with batch processing, this can yield compound cost reductions of up to 87%. Scribd, for instance, reported a 42% reduction in monthly LLM costs after implementing dynamic batching with optimal size parameters for their document processing workload.

Hardware Constraints and Selection

Your choice of GPU dictates your maximum feasible batch size. Research published in February 2025 highlights that selecting the right GPU type can enhance cost-efficiency performance by up to 2.27x. Smaller models require less compute and memory, making consumer-grade GPUs attractive due to their superior memory bandwidth per unit price-approximately 1.9x higher than A100 and H100 GPUs.

However, for large models like LLaMA2-70B, you need the massive memory capacity of enterprise GPUs. The Databricks engineering team notes that large batch sizes increase the KV cache size, which in turn may require more GPUs to serve the model. This creates a complex trade-off: increasing batch size reduces cost per token but may increase infrastructure costs if you need to add more hardware to handle the memory load. Always profile your specific model and input sequence lengths before scaling hardware.

Risograph art of a scale balancing cost savings and latency

Implementation Checklist and Pitfalls

Implementing batch processing is not just about changing a configuration flag. It requires careful architectural planning. Here is what you need to watch out for:

Input Sequence Length Variability: Longer sequences consume more memory. If your users send varying lengths of text, your effective batch size will be limited by the longest sequence in the batch. Consider padding strategies or truncation policies.
Streaming Overhead: Streaming responses typically cost 20-40% more than batch processing due to higher computational overhead and constant resource maintenance. Disable streaming for non-interactive tasks.
Early Stopping: Configure models to halt token generation when satisfactory completions are reached. This can reduce output tokens by 20-40% without affecting user experience, compounding your batch savings.
Caching: Use retrieval-augmented generation (RAG) caching for repetitive queries. This can further reduce costs by 15-25% by avoiding redundant inference entirely.

Expect a learning curve. Most engineering teams spend 2-4 weeks tuning their batch sizes. Monitor your p50 and p95 latency metrics closely. If latency spikes beyond your service level agreements (SLAs), reduce the batch size. If your GPU utilization is below 70%, increase it. The goal is sustained high utilization without breaking user experience.

Future Trends in Batch Optimization

The landscape of LLM serving is evolving rapidly. Automated batch size optimization systems are emerging, which dynamically adjust batch sizes based on real-time workload characteristics and cost constraints. Anthropic’s roadmap indicates planned integration of such features in late 2025. Additionally, mixed-integer linear programming for scheduling is being explored to achieve up to 41% higher throughput within the same price budget. As the LLM market grows toward a projected $36.1 billion valuation by 2030, mastering these optimization techniques will remain a critical competitive advantage.

What is the ideal batch size for text generation?

For most text generation tasks, a batch size between 10 and 50 is optimal. This range balances the computational heaviness of decoding with acceptable latency. Larger batches may cause significant delays for users expecting real-time responses.

How does continuous batching differ from static batching?

Static batching waits for a fixed number of requests before processing them, which can introduce unnecessary latency if requests arrive sporadically. Continuous batching dynamically inserts new requests into the batch as previous ones finish, keeping the GPU fully utilized and improving throughput by up to 24x.

Can I use batch processing for real-time chatbots?

Yes, but you must use dynamic or continuous batching rather than static batching. Static batching can cause unacceptable delays. Ensure your infrastructure supports low-latency insertion of new requests into ongoing batches to maintain a smooth user experience.

Why is streaming more expensive than batch processing?

Streaming requires maintaining open connections and processing tokens individually as they are generated, which adds computational overhead. Batch processing aggregates requests, allowing the GPU to process them in parallel with fewer connection management costs, reducing expenses by 20-40%.

How do I determine my maximum batch size?

Your maximum batch size is constrained by your GPU’s memory capacity. Start with small batches and gradually increase until you hit out-of-memory errors or see latency degrade significantly. Factors like model size, input sequence length, and KV cache requirements all impact this limit.