As of early 2023, emerging Generative Artificial Intelligence (GenAI) systems [65], such as chat-GPT and DALL-E, have reached more than 200 million users. Owing to new ‘transformer’ architectures, this family of AI models can generate original content starting from a given context (e.g., text, images, code). This remarkable feature stems from sophisticated pre-trained machine learning algorithms, also known as foundation models. These models, trained on massive amounts of data can then be fined tuned on specific datasets to excel at a particular task. Unlike traditional machine learning, the growing complexity and size of GenAI models present considerable scaling and power challenges. Indeed, Large Language Models (LLM) i.e., a particular type of GenAI models, necessitate expensive highly customized computing facilities equipped with thousands of GPUs for both training and inference. For instance, industry giants such as Meta and Google are designing dedicated clusters [21, 36] to accelerate LLaMA and PaLM training, with the challenging objective of reducing training time from a month to a single day.

Amidst efforts to scale up these models expecting performance improvements, DeepMind [20] has derived a predictable relationships between model size, number of tokens, and compute budget. Their study suggests that 1) for every doubling of model size, the number of training tokens should also be doubled, and 2) parameter count and number of training tokens should accompany any increase in computational resources. Impressively, state of the art models such as GPT-4 [64] and Claude-3 boast 220 and 500 Billion parameters respectively, underpinning their training on an unprecedented scale of data tokens and computational effort; GPT-4 is said to have been trained over 13 Trillion tokens during 200k Petaflops days (PF-days) over 20k GPUs. Meanwhile, open-source counterparts such as LLaMA-2 [49] (and Mistral 7B [25]) also feature a considerable amount of parameters: 70 Billion (7 billion). LLaMa-2 has been trained over 2 Trillion tokens during 10k PF-days over 2k GPUs. According to HuggingFace, the 2024 industry target is to train a large model over more than 100 Trillion tokens during 5 Million PF-days over 110k GPUs. Despite little architectural advancements have been achieved since the inception of Transformers in 2017 [52], significant progress continues through scaling in model size, computational power, and data volume.

While large models have demonstrated exceptional capabilities, potentially catalyzing transformative changes across various businesses, they present substantial construction and maintenance costs in terms of compute and energy [41, 47]. These costs significantly affect their efficiency thereby calling for research efforts into more resource effective models and infrastructures. A promising direction has been the creation of smaller models, such as Meta LLama2-13B and MistralAI-7B models, that match the performance of their larger counterparts on various benchmarks (e.g., translation, question-answering). Despite this shift, to efficiently train and serve such models, traditional High-Performance Computing (HPC) and AI model training techniques (e.g., compute-intensive only or based on data parallelism) are not suitable anymore.

Nonetheless, significant uncertainties threaten the timeline and the very same progress towards such full-fledged 5G deployments. First, URLLC-based and other MEC-enabled applications represent a brand-new market whose growth and economic worth are hard to predict: although the first tangible examples of commercial millisecond-response-scale edge services are appearing, whether they will warrant the huge capital expenditure to deploy a MEC infrastructure is still unclear. Second, the additional complexity of orchestrating network functions and resources in a MEC environment calls for zero-touch management solutions, whose algorithmic design and performance evaluation in presence of 5G data traffic are open research problems. Third, the sustainability of 5GC in terms of energy consumption is a major societal concern, and quantitative assessments from early deployments as well as dedicated energy-prudent MEC solutions are needed to dispel all doubts in that direction.

Hence, there is a notable shift towards implementing advanced model parallelism techniques [46], where such big models are split into smaller and independent chunks to be distributed across tens of thousands GPUs or TPUs (Graphics or Tensor Processing Units). Such scaling brings unprecedented challenges. GPUs need to efficiently communicate to make progress as LLM training is not embarrassingly parallel. The efficiency of intense communications between GPUs significantly contributes to the Model FLOPs Utilization (MFU), i.e., the ratio of the observed throughput to the theoretical maximum throughput assuming 100% of peak FLOPs (standard metric to evaluate training efficiency). As latency and congestion quickly build up in large clusters, the network fabric becomes a substantial bottleneck and necessitates the development of novel network acceleration strategies to overcome such limitations. To organize network operations, dedicated collective communication “algorithms” are used to route data in the network and schedule the necessary computation (e.g., a sum in All-Reduce) while optimizing for latency and bandwidth characteristics of each link in the network. Inefficiencies in collective communication algorithms can cause poor network utilization, causing GPUs to remain idle until transfers complete, and thus reducing the overall efficiency of distributed training and inference. A highly stable and reliable pipeline is also a must for cost-efficient and sustainable GenAI. As LLMs take a long time to train, characterized by high variability in execution time, and involve a high number of hardware and software resources, failures and stragglers become the norm rather than the exception. Failures are very costly, and it is essential to reduce the recovery time and avoid them. Bloom [27] and Falcon reported that there has been 1-2 GPU failures per week with 400 GPUs and 1 per day in average with 2k GPUs, leading to interruptions and often manual interventions to resume training from checkpoints. A straggler not only affects its own work, but slows down the entire job involving tens of thousands of GPUs. Therefore, AI fabric awareness (e.g., performance, availability, misbehavior) is critical for workload orchestration in the GenAI pipelines.

Net4AI Objectives and scientific hypothesis

While model-parallelism [62] or model architecture modifications can also improve performance, in this project we will work on making the infrastructure used for training and operating GenAI models more efficient. Net4AI has the following scientific objectives:

  • Objective 1: High-accuracy AI fabric awareness. We will design advanced information retrieval and monitoring techniques to increase network awareness about AI workloads and workload awareness about the infrastructure (network and compute). The main objective is to feed traffic scheduling algorithms so that they take more informed decisions, and that overall fault-tolerance, energy-efficiency and resource utilization is improved.
  • Objective 2: Optimized traffic scheduling algorithms. We will create efficient collective algorithms that are tightly integrated with adaptive routing, traffic control, fast fail-over, and packet scheduling which are key to handle concurrent data flows generated by GenAI applications. We will also develop network and compute aware orchestration algorithms for AI pipeline execution to improve completion times, stability, reliability and fairness.
  • Objective 3: Large-scale evaluation and experimentation. We will develop new performance evaluation tools for the research community, i.e., beyond accuracy and training time, for the end-to-end evaluation of AI workloads from network and power efficiency points of view. We will use scalable simulation tools as well as cutting-edge experimental infrastructures (SLICES-RI).

Contract nb: ANR-24-CE25-5120