Carbon Footprint Of Training GPT-3 And Large Language Models

Introduction – Large Language Models

Today we’re interested in the Carbon Footprint Of Training GPT-3. By now most of the world knows about GPT-3 and its related cousin ChatGPT from OpenAI, a for-profit research institute that created it and has made its implementations available to any user. GPT-3 and ChatGPT are two examples of a general class of software architecture known as “Large language models”.

Large language models are a type of natural language processing (NLP) that uses machine learning to generate text with a high degree of accuracy. They are trained on a massive amount of data and are used in many applications such as chatbots, machine translation, and text generation.

Transformer architecture at the heart of GPT-3 and other large language models (LLMs). The left side is the encoder; the right side is the decoder.

The training process involves feeding the model with text and then adjusting the weights of the model based on the results. The training process is computationally very expensive and is at the heart of reasons why we want to compute carbon footprint of training GPT-3.

LLMs Need To Be Trained Computationally

LLMs are built on transformer architectures which ingest massive amounts of text (e.g. Common Crawl, WebText2) and use a mechanism of encoding words as positionally aware embeddings to learn hidden high order correlation between words.

During training, the LLM is given a large corpus of text broken into “inputs” and “outputs”, where the inputs are simply a piece of text that precedes the output, a piece of text that succeeds the former. The LLM takes the inputs and masked pieces of the output, and trains using standard neural network backpropagation with the goal of optimizing for generating the outputs when given the inputs.

Training for GPT-3 means the semi-random adjustment of 175 billion parameters that must be “tuned” or “adjusted” so the output texts are what humans expect.

This is supposed to be done only once. But in reality, testing and development of the software, that “once” may happen over and over again. End users also have the option to “fine tune” the models which means more computation to adjust the parameters.

After many rounds of training, the result is a transformer neural network with billions of optimized parameter values that is able to not only “autocomplete” text, but to the degree where it writes entire paragraphs in a coherent way, respond in depth to questions, contextualizations, instructions.

GPT-3 is only the latest in a long line of language models and the field continues to explode in terms of novelty and frantic attempts to capitalize on the technology.

GPT-3 and ChatGPT are certainly the most popular but are not the largest. For example the Open Source BLOOM and proprietary PaLM from Google research are respectively 176 billion parameters and 500 billion parameters.

Training An LLM Is Carbon Intensive

First, the size of the data sets is massive. Common Crawl is one such training data set of text obtained by crawling the accessible internet. The Oct 2022 version contains 380 TiB (418 Terabytes) comprising 3.15 billion pages of the internet.

Second, the number of parameters in the billions is important because the more parameters there are, the longer it will take to train the model on the terabyte data sets, barring some unforeseen strange computational behavior.

Combined, it means that hundreds of thousands of hours of compute power are needed. This is why carbon footprint of training GPT-3 is such a hot topic. The compute takes place at large cloud data centers. These centers are distributed through the world.

As readers of this site know, the electric grid carbon intensity differs from place to place depending on its mix of energy sources. Solar, wind, hydro and nuclear power are ultra low in carbon intensity, whereas fossil fuels from natural gas to coal to oil are high carbon intensity.

Studies Have Examined And Proposed Ways To Quantify CO2 Impact

At least four studies have been released or presented on the carbon footprint of training LLMs. We will mention two of them here. Bannour et al 2021 reviewed 6 tools that purport to measure the CO2 impact of LLMs using a common framework of thinking. One of their cited tools is Lacoste et al 2019, which two years prior had published their software tool to compute the carbon impact of LLMs.

Estimating GPT-3 CO2 Footprint Due To Training

We used the Lacoste tool that is found here. We found a reference regarding the amount of compute needed to train GPT-3. Narayanan et al 2021 reported in the preprint.

Let us consider the GPT-3 model with 𝑃 =175 billion parameters as
an example. This model was trained on 𝑇 = 300 billion tokens. On
𝑛 = 1024 A100 GPUs using batch size 1536, we achieve 𝑋 = 140 teraFLOP/s per GPU. As a result, the time required to train this model
is 34 days.

Narayanan et al 2021 arXiv

Bearing in mind that carbon footprints are highly dependent on the local electrical grid, we will explore the footprint due to use of more than just one data center. First, there are two key numbers given above: n = 1024 over 34 days is equivalent to 835,584 hours of A100 GPU compute time. Second, we will need to estimate the CO2 footprint for different cloud providers over their different data centers.

Carbon footprint of training GPT-3 in different server regions for 3 providers: Google (GCP), Amazon (AWS) and Microsoft (Azure).

Results Show 50x Differences

The differences are astonishing. At the far left are the lowest carbon servers which expend about 4,000 kg CO2 for training the entire model. Two of the lowest three are in Canada in regions where hydroelectric power is dominant and one of them is in Switzerland which has a strong carbon neutrality initiative underway. At the far right, in the lowest three are two South African locations and one in India. Both countries are known to rely on carbon intensive electric grids with a high proportion of coal and oil sources.

As an aside, we’ve mentioned on this site, in some regions of the world where the grid is highly carbon intensive, the energy use is so low that these areas are not the problem in terms of carbon emissions for the world. The big emitters are the highly industrialized, high GDP countries with large populations like the US, China, parts of Europe.

We present the numbers from the chart in the table below.

Tabular Form Shows Numbers More Clearly

CloudRegionServer-Regionkg CO2 (lowest to highest)
GCPeurope-west6GCP europe-west64,178
AzureCanada EastAzure Canada East4,178
AWSCanada (Central)AWS Canada (Central)4,178
GCPnorthamerica-northeast1GCP northamerica-northeast16,267
AWSEU (Stockholm)AWS EU (Stockholm)10,445
AWSSouth America (Sao Paolo)AWS South America (Sao Paolo)10,445
AzureCanada CentralAzure Canada Central14,623
AzureFrance CentralAzure France Central20,890
AzureFrance SouthAzure France South20,890
AWSEU (Paris)AWS EU (Paris)20,890
GCPsouthamerica-east1GCP southamerica-east141,779
AzureBrazil SouthAzure Brazil South41,779
GCPeurope-north1GCP europe-north143,868
GCPus-west2GCP us-west250,135
AzureWest USAzure West US50,135
AWSUS Weest (North California)AWS US Weest (North California)50,135
GCPeurope-west1GCP europe-west156,402
GCPus-west3GCP us-west356,402
GCPus-west1GCP us-west162,669
AzureWest Central USAzure West Central US62,669
AzureWest US 2Azure West US 262,669
AWSAWS GovCloud (US)AWS AWS GovCloud (US)62,669
AWSUS Weest (Oregon)AWS US Weest (Oregon)62,669
GCPus-east1GCP us-east177,292
GCPus-east4GCP us-east477,292
AzureEast USAzure East US77,292
AzureEast US 2Azure East US 277,292
AWSUS East (North Virginia)AWS US East (North Virginia)77,292
GCPasia-southeast1GCP asia-southeast187,736
AzureSoutheast AsiaAzure Southeast Asia87,736
AWSAsia Pacific (Singapore)AWS Asia Pacific (Singapore)87,736
AzureSouth Central USAzure South Central US96,092
GCPasia-northeast1GCP asia-northeast1108,626
GCPasia-northeast2GCP asia-northeast2108,626
AzureJapan EastAzure Japan East108,626
AzureJapan WestAzure Japan West108,626
AzureKorea CentralAzure Korea Central108,626
AzureKorea SouthAzure Korea South108,626
AWSAsia Pacific (Osaka-Local)AWS Asia Pacific (Osaka-Local)108,626
AWSAsia Pacific (Seoul)AWS Asia Pacific (Seoul)108,626
AWSAsia Pacific (Tokyo)AWS Asia Pacific (Tokyo)108,626
GCPasia-east1GCP asia-east1116,982
GCPeurope-west4GCP europe-west4119,071
GCPus-central1GCP us-central1119,071
AzureNorth Central USAzure North Central US119,071
AzureWest EuropeAzure West Europe119,071
AWSAWS GovCloud (US-East)AWS AWS GovCloud (US-East)119,071
AWSUS East (Ohio)AWS US East (Ohio)119,071
GCPeurope-west3GCP europe-west3127,427
AWSEU (Frankfurt)AWS EU (Frankfurt)127,427
GCPeurope-west2GCP europe-west2129,516
AzureNorth EuropeAzure North Europe129,516
AzureUK SouthAzure UK South129,516
AzureUK WestAzure UK West129,516
AWSEU (Ireland)AWS EU (Ireland)129,516
AWSEU (London)AWS EU (London)129,516
AWSChina (Beijing)AWS China (Beijing)142,049
AWSChina (Ningxia)AWS China (Ningxia)142,049
GCPasia-east2GCP asia-east2146,227
AzureEast AsiaAzure East Asia146,227
AWSAsia Pacific (Hong Kong)AWS Asia Pacific (Hong Kong)146,227
AzureCentral USAzure Central US154,583
GCPaustralia-southeast1GCP australia-southeast1167,117
AzureAustralia EastAzure Australia East167,117
AWSAsia Pacific (Sydney)AWS Asia Pacific (Sydney)167,117
AzureAustralia SoutheastAzure Australia Southeast169,206
AzureAustralia CentralAzure Australia Central188,006
AzureAustralia Central 2Azure Australia Central 2188,006
GCPasia-south1GCP asia-south1192,184
AzureCentral IndiaAzure Central India192,184
AzureSouth IndiaAzure South India192,184
AzureWest IndiaAzure West India192,184
AWSAsia Pacific (Mumbai)AWS Asia Pacific (Mumbai)192,184
AzureSouth Africa NorthAzure South Africa North210,985
AzureSouth Africa WestAzure South Africa West210,985
Carbon intensity of training GPT-3 and large language models in kg CO2


Ok, now that we know GPT-3/LLMs are very carbon intensive to train and set up, lets put it into the right context. The average American emits about 15 tons of CO2 per year. The average person emits about 4-5 tons of CO2 per year. The most carbon intensive numbers for training GPT-3 hovers around 200,000 kg CO2, which translates to 200 tons of CO2 or 13 American’s emissions for one year or 50 (non-American) people’s emissions for one year.

The least carbon intensive numbers are around 4,000 kg CO2. Therefore, training GPT-3 with a clean grid is only 25% of an American’s or one non-American person’s worth of emissions for one year.

A car emits 5 tons of CO2 per year’s worth of driving. Therefore, training GPT-3 once on a clean grid is equal to driving 1 car for one year, and training on a dirty grid is equal to driving 40 cars for one year.

Let’s summarize the comparison of various carbon emitters to GPT-3 once.

EmitterEquivalent number to training GPT-3 once
Plane Ride345 flights across the US
Car40 cars driven for one year
Person13 American’s annual emissions or 50 non-American’s annual emissions
Equivalent emissions to training GPT-3 and LLMs once

Now, there are billions of people on the planet. Each person emitting 4 tons of CO2 per year is what causes the entire planet to generate gigatons of emissions.

How many GPT-3 models would we have to train to get to the gigaton range? The answer is 5,000,000 times. Let’s estimate whether we are anywhere near that number. There might be 100 large language models out there including BERT, GPT-2, PaLM, BLOOM etc, and let’s say each one was trained 100 times to get to where it is today. That would be 10,000 training sessions, or 2 million tons of CO2 emitted. That’s 1,000 times less than the gigaton range.

Basically, the LLM’s contribution of carbon emissions at the moment is a thousand times less than the other big three contributors of food, transportation, and heating and cooling.

Things may change however. Let’s say language models get bigger, into the trillion parameter range and the efficiency of training doesn’t change. Then perhaps we will see our world slowly ramp up the total LLM carbon emission cost.

LLMs Are Really Good At Modeling Language

One question readers may have is what use are these LLMs that warrant such expensive calculations? LLMs are already deployed for assistance with programming. Github Co-pilot works with programmers to generate code in an automated way. An open source version GPT-Code-Clippy carries out the same task. Besides programming, LLMs are so generally useful that they can do the following:

1. Machine Translation: Large language models can be used to accurately translate text from one language to another.

2. Speech Recognition: Large language models can be used to accurately recognize and transcribe speech.

3. Chatbots: Large language models can be used to develop more sophisticated chatbots that are capable of understanding natural language and carrying out conversations with humans.

4. Text Generation: Large language models can be used to generate coherent text that matches the style and content of the original source text.

5. Question Answering: Large language models can be used to answer questions based on a given context.

6. Image Captioning: Large language models can be used to generate captions for images.

Conclusions – Carbon Footprint Of Training GPT-3

We’re not saying that the carbon footprint of training GPT-3 doesn’t matter. Rather, many things matter, and if we had to prioritize what matters, it makes most sense to start with reducing emissions of our heating and cooling, reducing emissions related to food, and reducing emissions related to transportation. Starting with LLMs would not be productive. LLM training indeed will have an effect on carbon emissions but for now, its disproportionately small compared to other factors that dominate carbon emission.

Staff Writer
+ posts

Leave a Comment