Releasing software and datasets is part of the NET4AI project outcomes, to contribute to the dissemination of our work and support further developments by providing reusable tools to the community.

The published code aims to facilitate experimentation, reproducibility, and validation of the proposed approaches. It also enables other researchers and practitioners to build upon our contributions and adapt them to their own use cases.

We report here the main open-sourced material during the project

SimAI (Fork)

We release the open-source code of SimAI, available here: https://github.com/NetMeasurements-Team/SimAI

This repository is a fork of the original SimAI project and is maintained by our team. It addresses issues encountered in the original version and includes new features required to use the simulator within our project scope.

SimAI [1] is a unified simulator designed to precisely and efficiently model large-scale LLM training. As reported in Figure 1, its architecture combines three main components: SimAI Workload Generator, which generates realistic workloads by hijacking mainstream training frameworks such as Megatron [2] and DeepSpeed [3]; SimAI Computation Simulator, which simulates computation at the kernel level; and SimAI Communication Simulator, which simulates collective communication by reproducing key NCCL behaviors.

Figure 1: The SimAI architecture

For scalability, SimAI-CM uses UNISON (built directly on NS-3) to spread network simulation across multiple CPU cores, combined with lock-free global context sharing to avoid thread synchronization delays. This NS-3 foundation delivers a 23× speedup and supports accurate simulations of 1,000+ GPU clusters.

By relying on simulation, SimAI allows rapid experimentation without requiring access to large-scale computing infrastructure. This makes it a practical tool to prototype, validate, and compare approaches under controlled and reproducible conditions.

[1] Wang, Xizheng, et al. “{SimAI}: unifying architecture design and performance tuning for {Large-Scale} large language model training with scalability and precision.” 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). 2025.

[2] Shoeybi, Mohammad, et al. “Megatron-lm: Training multi-billion parameter language models using model parallelism.” arXiv preprint arXiv:1909.08053 (2019).

[3] Rasley, Jeff, et al. “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters.” Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020.