Loading...
Loading...
Weekly AI insights —
Real strategies, no fluff. Unsubscribe anytime.
Written by Gareth Simono, Founder and CEO of Agentik {OS}. Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise platforms. Gareth orchestrates 267 specialized AI agents to deliver production software 10x faster than traditional development teams.
Founder & CEO, Agentik {OS}
Training GPT-4 consumed the energy of 120 US households for a year. Inference now dominates. Here is what you can do about it and why incentives align.

Training GPT-4 consumed roughly the equivalent of what 120 US households use in electricity over an entire year. One training run. One model. Two years ago.
Numbers have gotten significantly bigger since then. More parameters. More training data. More compute. And now millions of applications running inference 24 hours a day, 7 days a week, across every industry.
The conversation around AI and the environment is dominated by two equally unhelpful positions. "AI is destroying the planet" and "technology will solve it, stop worrying." Both are lazy. The reality is more nuanced, more quantifiable, and significantly more actionable than either position suggests.
I am going to give you the real numbers, the honest tradeoffs, and the specific actions that actually move the needle. The good news: the sustainability case and the economic case point in the same direction. Green AI is also cheap AI.
Let me start with what we actually know, because a lot of AI sustainability discourse conflates different parts of the problem.
Training costs get most of the headlines. They are large and declining per unit of performance. Newer architectures achieve equivalent capabilities with significantly less compute. GPT-3 required roughly 1,287 MWh for training. More recent models of comparable capability require substantially less due to architectural improvements, better data curation, and improved training techniques.
But training is a one-time cost. The number that actually matters at scale is inference.
Inference dominates total consumption. A model that took 1,000 MWh to train might run 10 million inference requests per day. At 0.001 kWh per request (a rough average for a small model), that is 10,000 kWh per day, or 3.6 GWh per year. After three days of operation, inference has already consumed more energy than training.
At the scale of the entire AI industry, inference consumption dwarfs training consumption by a significant multiple. Training gets the headlines because it is visible and measurable. Inference is distributed and largely invisible.
The implication: your organization's AI energy consumption is dominated by your inference footprint, not your training footprint. And inference is something you directly control.
The industry's sustainability attention is focused on training efficiency because that is what makes good press releases. The actual leverage is in inference optimization, which is quieter but more impactful for most organizations.
Before discussing solutions, let me make the problem concrete with rough calculations that apply to a typical mid-size AI application.
Assume you are running an AI-powered feature that processes 100,000 requests per day. Each request uses a mid-size language model (roughly 70B parameters, served on GPU infrastructure) with an average input of 2,000 tokens and output of 500 tokens.
Rough energy calculation:
That is comparable to 3-10 passenger vehicles driven for a year. Not catastrophic for 100,000 daily users. Scale to 10 million daily users and the numbers become a meaningful organizational impact.
For context:
These comparisons are rough. The exact numbers depend heavily on hardware efficiency, data center power usage effectiveness (PUE), and the carbon intensity of the local grid. But the order of magnitude is useful for proportionality.
The single most impactful sustainability action you can take is using the smallest model that meets your quality bar.
This sounds obvious. It is shockingly underimplemented.
I regularly encounter organizations using 70B or 175B parameter models for tasks that a 7B model handles with equivalent quality. They chose the large model because larger models are associated with quality, and they never systematically evaluated whether the large model's quality advantage was meaningful for their specific use case.
The energy difference is not marginal. A 70B parameter model uses roughly 10x the compute of a 7B model per request. If the 7B model delivers 95% of the quality for your use case, you are using 10x the energy for a 5% quality gain that your users may not even be able to perceive.
Evaluation approach:
For model selection in production, this framework also produces significant cost savings alongside environmental benefits. The sustainability and economics arguments align perfectly here.
Quantization reduces model precision from 32-bit or 16-bit floating point to 8-bit or 4-bit integers. The mathematical information in the model weights is compressed.
The quality impact is smaller than you would expect. On most benchmarks, a 4-bit quantized model scores 2-5% lower than the full-precision version. For most production applications, this difference is imperceptible to users.
The efficiency gains are substantial:
For a model that is already right-sized for your use case, quantization is a straightforward additional optimization. The tooling is mature. llama.cpp, GGUF format, and HuggingFace Transformers all support quantized inference with good documentation.
# Example: Loading a quantized model with Transformers
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 4-bit quantization configuration
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True, # Nested quantization for better compression
bnb_4bit_quant_type="nf4" # NormalFloat4 for better accuracy
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=quantization_config,
device_map="auto"
)
# Energy: ~40% less than full precision equivalent
# Quality: ~2-4% lower on benchmarks, often imperceptible in practiceCaching is the highest-leverage optimization most organizations underinvest in.
Many AI applications answer the same questions repeatedly. Customer support chatbots field the same ten questions over and over. Code assistants provide the same documentation lookups. Search applications process similar queries constantly.
Every time you run the same input through the model again, you consume the same energy as the first time. Semantic caching lets you serve cached results for similar (not just identical) queries.
The implementation:
Cache hit rates vary dramatically by application. A support chatbot might see 60-80% cache hit rates, essentially eliminating those requests' energy cost. A creative writing assistant might see 5-10% hit rates. Know your application's cache potential before investing heavily in the infrastructure.
Not all AI inference is latency-sensitive. Batch processing, background jobs, and non-interactive workloads can be scheduled strategically.
Scheduling AI batch jobs during off-peak grid hours reduces carbon intensity even when total energy consumption is identical. The carbon impact of a kWh varies significantly based on when and where it is consumed. Grid carbon intensity is typically lower at night when renewable generation exceeds demand and higher during peak afternoon hours when fossil fuel peakers run.
For organizations with flexibility, scheduling batch AI workloads during low-carbon grid periods can reduce effective carbon footprint 20-40% with no change in total energy consumption.
Cloud providers increasingly offer tools for carbon-aware scheduling. Google Cloud's Carbon-Aware SDK, Azure's carbon-aware features, and similar tools in other clouds allow workloads to automatically run when and where grid carbon intensity is lowest.
# Example: Carbon-aware batch scheduling
from google.cloud import compute
from carbon_aware_sdk import CarbonAwareClient
async def schedule_carbon_aware_batch(batch_job: BatchJob) -> None:
client = CarbonAwareClient()
# Find lowest-carbon window in next 24 hours
optimal_window = await client.get_optimal_window(
region="us-central1",
duration_minutes=60,
window_hours=24
)
# Schedule job to run in optimal window
await schedule_at(batch_job, optimal_window.start_time)The carbon intensity of AI inference depends heavily on where the computation happens.
Different cloud regions have dramatically different grid carbon intensities. Google Cloud's us-central1 (Iowa) runs primarily on renewable energy. Us-east4 (Virginia) has higher carbon intensity from the regional grid mix. The difference can be 5-10x in carbon per kWh.
For latency-tolerant workloads, routing to low-carbon regions is straightforward. For latency-sensitive workloads, it requires more careful architecture but is still achievable through geo-replication that prioritizes low-carbon regions.
Hardware efficiency also matters. Newer GPU generations improve energy efficiency significantly. An H100 delivers roughly 3-4x more AI compute per watt than an A100. Provider upgrade cycles vary, but optimizing for efficient hardware is worth the operational complexity.
| Region (Google Cloud) | Approx. Carbon Intensity | Renewable Mix |
|---|---|---|
| us-central1 (Iowa) | Very Low | >90% renewable |
| europe-west4 (Netherlands) | Low | ~70% renewable |
| us-east4 (Virginia) | Medium | ~30% renewable |
| asia-southeast1 (Singapore) | High | ~5% renewable |
You cannot improve what you do not measure.
Build carbon tracking into your AI observability stack alongside cost and performance metrics. When engineers see carbon cost alongside dollar cost, they make different decisions. Not because they are activists. Because visible waste bothers problem solvers.
The measurement approach:
Target metrics:
I want to end with this because it matters for how you make the case internally.
AI sustainability optimizations are not sacrifice. They are intelligent engineering.
Smaller models cost less per request. Quantization reduces infrastructure costs. Caching reduces API bills dramatically. Carbon-aware scheduling often uses cheaper off-peak compute rates. The same optimizations that reduce environmental impact also reduce costs.
This alignment means sustainability is not an ethical add-on that competes with business priorities. It is a business priority that happens to also be ethical.
Frame it that way. Build the monitoring that makes the cost and carbon savings visible. Let the numbers make the argument.
Q: What is the environmental impact of AI?
AI's environmental impact comes primarily from the energy required for model training and inference. Training a large language model can consume as much electricity as 100 US homes use in a year. However, AI inference (using trained models) is far less energy-intensive, and AI applications often reduce net environmental impact by optimizing energy use in other industries.
Q: How can businesses reduce AI's environmental footprint?
Reduce AI's footprint by choosing efficient models (smaller models for simple tasks), using cloud providers with renewable energy commitments, implementing caching to reduce redundant API calls, batching requests for efficiency, and selecting inference endpoints in low-carbon regions. Model tiering — using small models for most tasks — is the single biggest lever.
Q: Does AI help or hurt sustainability overall?
On balance, AI is likely net positive for sustainability. While AI infrastructure consumes energy, AI applications optimize energy use across industries — improving manufacturing efficiency, reducing waste in supply chains, optimizing building energy management, and enabling precision agriculture. The energy saved by AI applications typically exceeds the energy AI consumes.
Full-stack developer and AI architect with years of experience shipping production applications across SaaS, mobile, and enterprise. Gareth built Agentik {OS} to prove that one person with the right AI system can outperform an entire traditional development team. He has personally architected and shipped 7+ production applications using AI-first workflows.

AI Ethics in Practice: What Actually Works
Every company has AI ethics principles. Most are meaningless. Here's how to turn 'we are committed to responsible AI' into measurable, enforceable practice.

Open Source vs Closed AI: What Actually Works
This debate generates more heat than light. No ideology, just engineering and business reality. The practical answer for most organizations is not either/or.

Model Selection Guide 2026: Pick the Right AI Model
Picking the wrong model is the most expensive mistake nobody talks about. Here's what we learned routing millions of requests across multiple AI providers.
Stop reading about AI and start building with it. Book a free discovery call and see how AI agents can accelerate your business.