"StockSim: A Dual-Mode Order-Level Simulator for Evaluating Multi-Agent LLMs in Financial Markets"

Abstract

We present StockSim, an open-source, multi-modal simulation platform for systematic evaluation of large language models (LLMs) in realistic financial decision-making scenarios. Unlike previous toolkits that offer limited scope, StockSim delivers a comprehensive system that fully models market dynamics and supports diverse simulation modes of varying granularity. It incorporates critical real-world factors, such as latency, slippage, and order-book microstructure, that were previously neglected, enabling more faithful and insightful assessment of LLM-based trading agents. An extensible, role-based agent framework supports heterogeneous trading strategies and multi-agent coordination, making StockSim a uniquely capable testbed for NLP research on reasoning under uncertainty and sequential decision-making.

Evaluation

Scalability and Consistency of StockSim are evaluated through a series of controlled simulation tests using varying numbers of deterministic agents.

Each agent follows predefined strategies, such as moving average crossovers or buy-and-hold, allowing us to observe the simulation engine’s behavior under repeatable conditions. To ensure that the evaluation reflects only the core behavior of the engine, we exclude LLMs, which introduce variability in latency, resource usage, and output consistency due to differences in deployment mode, reasoning strategy, and stochastic outputs.

CPU usage for varying numbers of deterministic agents — System performance metrics (memory/CPU usage) for varying numbers of deterministic agents.

Memory usage for varying numbers of deterministic agents — System performance metrics (memory/CPU usage) for varying numbers of deterministic agents.

The results confirm StockSim’s consistency: across all runs, simulation outputs (including order placements, executions, and performance metrics) remain identical. This repeatability empirically verifies the platform’s deterministic behavior and validates its correctness, since any deviation would indicate flaws in the design or execution logic.

Scalability is assessed by monitoring system-level metrics during each run, including CPU utilization across all cores and memory usage (in MB) for both the simulation engine and RabbitMQ. Results across agent configurations are presented in the figure above, confirming that StockSim scales almost linearly up to ~150 agents: the simulation container’s mean CPU load increases from 8% to 27%, while memory usage rises from 0.8 GB to 2 GB, both roughly proportional to the agent count. Beyond this point, the workload becomes super-linear: at 300 and 500 agents, mean CPU usage surges to 123% and 418%, and memory climbs to 4.1 GB and 5.6 GB, respectively, with peak values reaching nearly four times the averages.

Despite this growth, the resource demands of the simulation framework remain modest; even at maximum load, usage peaks at 5.6 GB of RAM and a few CPU cores. All experiments are conducted on a MacBook Pro with an Apple M3 Pro chip (11-core CPU) and 18 GB of unified memory, underscoring StockSim’s efficiency. Running 500 concurrent LLM agents in parallel is practically infeasible on such hardware, whereas this analysis demonstrates that StockSim can handle such scale with ease.

LLM Trading Behavior

To demonstrate the ease with which insights about model behavior can be extracted using StockSim, we run a simulation for two LLMs, GPT-o4-mini and GPT-o3, using the same prompt on the NVIDIA stock over a two-month period, from April 28, 2025, to June 28, 2025. The simulation assumes daily trading, with orders placed before market open. The results based on the performance metrics provided by StockSim, are presented in the table below, revealing distinct trading patterns and strategic behaviors between LLMs.

Metric	GPT-o4-mini	GPT-o3
ROI (↑)	0.0734	0.2956
Sharpe Ratio - SR (↑)	0.1652	0.376
Annualized SR (↑)	2.6218	5.9682
Sortino Ratio (↑)	0.2868	1.0587
Win Rate (↑)	0.6667	1.0
Profit Factor (↑)	2.3691	999.0
Max Drawdown (↓)	0.0306	0.0323
Num Trades	31	9
Num Closed Trades	21	6
Total Traded Volume	931,416.775	368,306.25
Average Trade Size	30,045.70	40,922.92
ROIC	0.0151	0.1633
Profit per Trade (↑)	258.47	4,520.13
Last Portfolio Value (↑)	107,338.30	129,556.75
Realized P&L	5,427.80	27,120.75

Metrics such as ROI, Profit per Trade, and Profit Factor highlight that GPT-o3 pursues a more selective trading strategy characterized by fewer, larger-sized positions with higher conviction, resulting in greater profitability and reduced downside risk, as demonstrated by its superior Sortino Ratio and perfect Win Rate. Conversely, GPT-o4-mini exhibits a more active trading style, evidenced by its higher number of trades and greater traded volume, indicating frequent market interactions but lower profit efficiency per transaction. The contrasting Sharpe Ratio and Annualized Sharpe Ratio further underscore GPT-o3’s superior ability to maintain consistent, risk-adjusted returns over time, while GPT-o4-mini’s lower metrics suggest that its strategy involves more frequent but less decisive market positions. Overall, the evaluator's results effectively capture and distinguish the underlying strategic differences between the two LLMs, allowing clear interpretation of their respective trading behaviors. Importantly, we are able to obtain these results without writing any code, paving the way for exploring more LLM-driven trading strategies.

BibTeX

BibTeX entry will be added soon.

StockSim: A Dual-Mode Order-Level Simulator for Evaluating Multi-Agent LLMs in Financial Markets

Video Presentation

Abstract

Architecture

Evaluation

BibTeX