We present StockSim, an open-source, multi-modal simulation platform for systematic evaluation of large language models (LLMs) in realistic financial decision-making scenarios. Unlike previous toolkits that offer limited scope, StockSim delivers a comprehensive system that fully models market dynamics and supports diverse simulation modes of varying granularity. It incorporates critical real-world factors, such as latency, slippage, and order-book microstructure, that were previously neglected, enabling more faithful and insightful assessment of LLM-based trading agents. An extensible, role-based agent framework supports heterogeneous trading strategies and multi-agent coordination, making StockSim a uniquely capable testbed for NLP research on reasoning under uncertainty and sequential decision-making.
StockSim employs a modular, asynchronous architecture designed around four core components that enable comprehensive LLM evaluation in realistic trading environments. The figure below illustrates the system's data flow and component interactions, highlighting two execution mechanisms—order level and candlestick level execution—seamlessly integrated with shared modules for market data retrieval, indicator computation, news/fundamentals integration, and agent interactions. This design ensures consistency, flexibility, and scalability, supporting diverse experimental setups and facilitating reproducible experimentation on sequential decision-making in financial contexts. More details can be found in the paper (PDF).
Scalability and Consistency of StockSim are evaluated through a series of controlled simulation tests using varying numbers of deterministic agents.
Each agent follows predefined strategies, such as moving average crossovers or buy-and-hold, allowing us to observe the simulation engine’s behavior under repeatable conditions. To ensure that the evaluation reflects only the core behavior of the engine, we exclude LLMs, which introduce variability in latency, resource usage, and output consistency due to differences in deployment mode, reasoning strategy, and stochastic outputs.
The results confirm StockSim’s consistency: across all runs, simulation outputs (including order placements, executions, and performance metrics) remain identical. This repeatability empirically verifies the platform’s deterministic behavior and validates its correctness, since any deviation would indicate flaws in the design or execution logic.
Scalability is assessed by monitoring system-level metrics during each run, including CPU utilization across all cores and memory usage (in MB) for both the simulation engine and RabbitMQ. Results across agent configurations are presented in the figure above, confirming that StockSim scales almost linearly up to ~150 agents: the simulation container’s mean CPU load increases from 8% to 27%, while memory usage rises from 0.8 GB to 2 GB, both roughly proportional to the agent count. Beyond this point, the workload becomes super-linear: at 300 and 500 agents, mean CPU usage surges to 123% and 418%, and memory climbs to 4.1 GB and 5.6 GB, respectively, with peak values reaching nearly four times the averages.
Despite this growth, the resource demands of the simulation framework remain modest; even at maximum load, usage peaks at 5.6 GB of RAM and a few CPU cores. All experiments are conducted on a MacBook Pro with an Apple M3 Pro chip (11-core CPU) and 18 GB of unified memory, underscoring StockSim’s efficiency. Running 500 concurrent LLM agents in parallel is practically infeasible on such hardware, whereas this analysis demonstrates that StockSim can handle such scale with ease.
LLM Trading Behavior
To demonstrate the ease with which insights about model behavior can be extracted using StockSim, we run a simulation for two LLMs, GPT-o4-mini and GPT-o3, using the same prompt on the NVIDIA stock over a two-month period, from April 28, 2025, to June 28, 2025. The simulation assumes daily trading, with orders placed before market open. The results based on the performance metrics provided by StockSim, are presented in the table below, revealing distinct trading patterns and strategic behaviors between LLMs.
Metric | GPT-o4-mini | GPT-o3 |
---|---|---|
ROI (↑) | 0.0734 | 0.2956 |
Sharpe Ratio - SR (↑) | 0.1652 | 0.376 |
Annualized SR (↑) | 2.6218 | 5.9682 |
Sortino Ratio (↑) | 0.2868 | 1.0587 |
Win Rate (↑) | 0.6667 | 1.0 |
Profit Factor (↑) | 2.3691 | 999.0 |
Max Drawdown (↓) | 0.0306 | 0.0323 |
Num Trades | 31 | 9 |
Num Closed Trades | 21 | 6 |
Total Traded Volume | 931,416.775 | 368,306.25 |
Average Trade Size | 30,045.70 | 40,922.92 |
ROIC | 0.0151 | 0.1633 |
Profit per Trade (↑) | 258.47 | 4,520.13 |
Last Portfolio Value (↑) | 107,338.30 | 129,556.75 |
Realized P&L | 5,427.80 | 27,120.75 |
Metrics such as ROI, Profit per Trade, and Profit Factor highlight that GPT-o3 pursues a more selective trading strategy characterized by fewer, larger-sized positions with higher conviction, resulting in greater profitability and reduced downside risk, as demonstrated by its superior Sortino Ratio and perfect Win Rate. Conversely, GPT-o4-mini exhibits a more active trading style, evidenced by its higher number of trades and greater traded volume, indicating frequent market interactions but lower profit efficiency per transaction. The contrasting Sharpe Ratio and Annualized Sharpe Ratio further underscore GPT-o3’s superior ability to maintain consistent, risk-adjusted returns over time, while GPT-o4-mini’s lower metrics suggest that its strategy involves more frequent but less decisive market positions. Overall, the evaluator's results effectively capture and distinguish the underlying strategic differences between the two LLMs, allowing clear interpretation of their respective trading behaviors. Importantly, we are able to obtain these results without writing any code, paving the way for exploring more LLM-driven trading strategies.