LLM ESG Detection: Optimizing Large Language Models for Financial Text Analysis

📌 Key Takeaways

  • Open-source LLMs outperform proprietary models: Llama2 7B achieved 84.9% F1 score, beating GPT-4o Mini by over 7 percentage points when fine-tuned with synthetic data.
  • Synthetic data is a game-changer: Augmenting 212 manually labeled examples to 1,272 instances improved detection accuracy by up to 14.8 percentage points across models.
  • Small models deliver big results: Models with just 2-7 billion parameters match or exceed larger proprietary solutions, making LLM ESG detection accessible to resource-constrained organizations.
  • Cost-effective compliance: Fine-tuning the best-performing model costs approximately $2.62, with inference at just 2.14 seconds per classification — enabling scalable EU taxonomy analysis.
  • Data privacy advantage: Open-source models run locally, addressing the critical need for confidentiality when processing sensitive financial disclosures and sustainability reports.

Why LLM ESG Detection Is Reshaping Financial Analysis

The intersection of artificial intelligence and sustainable finance has reached a critical inflection point. As regulatory bodies worldwide tighten Environmental, Social, and Governance reporting requirements, financial institutions face an unprecedented challenge: analyzing massive volumes of corporate disclosures to determine alignment with sustainability taxonomies. LLM ESG detection represents the most promising solution to this challenge, leveraging the natural language understanding capabilities of large language models to automate what has traditionally been an expensive, slow, and error-prone manual process.

A groundbreaking study published in 2025 by researchers from Sapienza University of Rome and the University of Modena introduces the ESG-Activities benchmark — a novel dataset of 1,325 labeled text segments designed to evaluate how effectively current-generation LLMs can identify text related to specific environmental activities defined in the EU ESG taxonomy. The research demonstrates that fine-tuned open-source models not only match but exceed the performance of expensive proprietary alternatives, achieving F1 scores as high as 84.9% through innovative synthetic data augmentation techniques.

This finding has profound implications for the financial industry. According to the European Commission’s sustainable finance framework, companies are increasingly required to disclose how their activities align with the EU taxonomy. The manual effort required to analyze these disclosures — known as Non-Financial Disclosures (NFDs) — creates a bottleneck that LLM ESG detection can effectively eliminate. For organizations exploring how AI transforms document analysis, understanding this research is essential to staying competitive in the rapidly evolving landscape of AI-powered financial tools.

Understanding the EU ESG Taxonomy Framework

Before diving into the technical details of LLM ESG detection, it is crucial to understand the regulatory framework that makes this technology necessary. The EU ESG taxonomy, established under the European Green Deal of 2021, creates a unified classification system that defines which economic activities qualify as environmentally sustainable. The ambitious goal: a climate-neutral Europe by 2050, aligned with the United Nations Sustainable Development Goals.

The taxonomy organizes activities by industry sector using NACE codes — the European standard for classifying economic activities. For each sector, specific environmental objectives are defined, and companies must demonstrate that their operations contribute substantially to at least one objective without significantly harming others. The transport industry alone encompasses dozens of qualifying activities, from developing zero-emission vessel infrastructure with electric charging and hydrogen refueling capabilities to promoting low-carbon rail transport with CO2-free or bimodal trains.

The compliance challenge is staggering in scope. Financial investors, asset managers, and regulators must evaluate thousands of pages of corporate sustainability reports to determine whether reported activities genuinely align with taxonomy definitions. This process requires deep domain expertise in both the technical specifications of each activity and the nuanced language companies use in their disclosures. Traditional keyword-matching approaches fail because the same activity can be described in vastly different ways across organizations and reporting frameworks. A company might describe its transition to renewable energy using language that never explicitly mentions the taxonomy activity it corresponds to — making intelligent semantic understanding essential.

The European Securities and Markets Authority (ESMA) has emphasized that consistent, reliable ESG classification is fundamental to preventing greenwashing and maintaining investor confidence. This regulatory pressure, combined with the sheer volume of disclosures, makes automated LLM ESG detection not just useful but necessary for the financial ecosystem.

How LLM ESG Detection Works: Architecture and Methods

The core task in LLM ESG detection is binary classification: given a text segment from a Non-Financial Disclosure and an ESG activity description from the taxonomy, determine whether the text pertains to that activity. Formally, the system implements a function f(c, i) that outputs 1 if text segment c relates to activity i, and 0 otherwise. While this sounds straightforward, the semantic complexity of financial language makes it a surprisingly challenging NLP problem.

The research team developed ESGQuest, a complete system architecture built on a Retrieval-Augmented Generation (RAG) pipeline. The process works as follows: first, corporate disclosure documents are chunked into manageable text segments and stored in a Pinecone vector database. When evaluating a specific ESG activity, NACE codes filter relevant activities by industry sector. Then, similarity search retrieves the most relevant text chunks for each activity. Finally, a fine-tuned LLM assesses whether each chunk genuinely aligns with the activity description.

For fine-tuning, the researchers employed Low-Rank Adaptation (LoRA), a parameter-efficient technique from the PEFT library that updates only a tiny fraction of model parameters. With a LoRA rank of 8, alpha of 32, and dropout of 0.05, the method updates just 0.08% of parameters for a 3-billion-parameter model and 0.12% for a 7-billion-parameter model. This efficiency is what makes LLM ESG detection practical for organizations without massive compute infrastructure — you can fine-tune a state-of-the-art model on a single A100 GPU in minutes rather than hours.

The training protocol incorporated 10-fold cross-validation to mitigate overfitting, with validation performed every 10% of total optimization steps. Models were trained using the AdamW optimizer with a learning rate of 3e-4 and gradient accumulation steps of 5. This rigorous methodology ensures that performance metrics reflect genuine generalization capability rather than memorization of training examples — a critical concern when working with small, specialized datasets in the ESG domain.

Transform your ESG reports into interactive experiences that stakeholders actually engage with.

Try It Free →

Benchmarking Nine Models for ESG Activity Classification

One of the most valuable contributions of the ESG-Activities benchmark is its comprehensive evaluation of nine different models spanning the spectrum from lightweight open-source architectures to large proprietary systems. The models tested include four from the LLaMA family (Llama 3B, Llama2 7B, Llama3 8B), three from Google’s Gemma family (Gemma 2B, Gemma 7B, RecurrentGemma 2B), Mistral 7B, the proprietary GPT-4o Mini, and the domain-specific ESG-BERT.

The researchers evaluated each model across three configurations: zero-shot learning (no task-specific training), fine-tuning on original manually curated data only (212 instances), and fine-tuning on the combined original plus synthetic dataset (1,272 instances). This systematic comparison reveals how different architectures respond to increasing amounts of domain-specific training data.

In the zero-shot configuration, GPT-4o Mini achieved the highest F1 score of 72.7%, demonstrating the strength of large proprietary models when no task-specific training is available. Gemma 7B showed the highest precision at 83.6% but suffered from lower recall, while Llama3 8B and the smaller Llama 3B also performed competitively. Notably, domain-specific ESG-BERT — despite being explicitly designed for ESG text — achieved only 54.6% F1, underscoring that pre-training on ESG-related text alone does not guarantee strong classification performance on specific taxonomy activities.

The most revealing finding in the zero-shot evaluation was the poor performance of Mistral 7B (33.96% F1) and RecurrentGemma 2B (49.45% F1). These models demonstrated a strong bias toward predicting the negative class, suggesting that their pre-training distributions may not align well with ESG-specific language patterns without additional fine-tuning. This result has practical implications for organizations considering deploying models for ESG compliance: model selection matters as much as model size, and zero-shot deployment should always be validated against domain-specific benchmarks.

Synthetic Data Augmentation for LLM ESG Detection

The scarcity of labeled data is perhaps the single greatest challenge in developing LLM ESG detection systems. Creating high-quality training data for ESG classification requires domain experts — typically professors and researchers in sustainable finance or specific industry sectors — to manually evaluate whether text segments map to taxonomy activities. The ESG-Activities benchmark was validated by three independent experts (professors and postdocs in transport), with majority rule requiring at least two positive votes for each classification. This rigorous process yielded only 265 candidate mappings, of which 212 were allocated for training.

To address this data scarcity, the researchers developed an innovative synthetic data augmentation strategy. Using ChatGPT-4o, they generated five alternative formulations for each of the 212 original training sentences. Each synthetic version preserves the core meaning and ESG classification of the original text while varying the wording, sentence structure, and vocabulary. This expanded the training set from 212 to 1,272 instances — a six-fold increase that proved transformative for model performance.

The impact of synthetic augmentation was dramatic and consistent across most models. Llama2 7B improved from 70.5% F1 (fine-tuned on original data only) to 84.9% F1 — a 14.4 percentage point increase. Gemma 7B jumped from 72.3% to 82.1%. Mistral 7B, one of the worst performers in zero-shot, surged from 58.4% to 78.4%. These improvements demonstrate that synthetic data augmentation is particularly effective for models that initially struggle with the task, providing the diverse training examples needed to learn robust ESG-specific language patterns.

Critically, the test set was held constant across all experiments — consisting exclusively of 53 human-curated instances never used in training. This ensures that performance improvements reflect genuine generalization rather than contamination from synthetic data leaking into evaluation. The approach provides a replicable template for other specialized NLP tasks in finance where labeled data is scarce, from NLP-based financial document processing to regulatory compliance screening.

Performance Results: Open-Source vs Proprietary Models

The headline finding of the ESG-Activities benchmark is both surprising and consequential: when properly fine-tuned with synthetic data augmentation, open-source models decisively outperform proprietary alternatives for LLM ESG detection. Llama2 7B achieved the overall best F1 score of 84.9%, with precision and recall both at 84.9% — a remarkably balanced performance profile. Llama 3B followed closely at 84.4%, and Gemma 7B reached 82.1%.

By contrast, GPT-4o Mini — the best zero-shot performer — actually declined from 80.5% to 77.1% F1 when fine-tuned with synthetic data. The researchers attribute this counterintuitive result to the constraints of Azure’s fine-tuning API, which uses default hyperparameters that may not be optimal for small, specialized datasets. This finding serves as a cautionary tale: proprietary model APIs often limit the degree of customization available to practitioners, and the convenience of cloud-based fine-tuning may come at the cost of optimal performance.

ModelZero-Shot F1Fine-Tuned (Original)Fine-Tuned (+ Synthetic)
Llama2 7B58.3%70.5%84.9%
Llama 3B68.2%78.9%84.4%
Gemma 7B70.5%72.3%82.1%
GPT-4o Mini72.7%80.5%77.1%
ESG-BERT54.6%66.9%70.5%

The performance of ESG-BERT deserves special attention. Despite being a domain-specific model pre-trained on ESG-related text, it consistently underperformed general-purpose LLMs. Its high precision (83.6% with synthetic data) was offset by persistently low recall (67.9%), meaning it reliably identified true positive cases but missed a large proportion of relevant text segments. This suggests that encoder-only architectures like BERT may be fundamentally less suited to the nuanced semantic matching required for taxonomy activity detection, where understanding the relationship between two text segments — the disclosure and the activity definition — is paramount.

Another notable observation is the performance gap between similarly sized models. Llama2 7B significantly outperformed both Gemma 7B and Mistral 7B despite comparable parameter counts, indicating that architecture design and pre-training data composition play decisive roles in downstream task performance. For practitioners building sustainable finance AI applications, the choice of base model should be informed by domain-specific benchmarks rather than general-purpose leaderboard rankings.

Turn complex ESG research into engaging interactive content your audience will actually read.

Get Started →

Cost and Efficiency Analysis for ESG AI Deployment

Beyond accuracy, practical deployment of LLM ESG detection requires careful analysis of computational costs and inference speed. The research provides detailed cost and timing data that enables organizations to make informed decisions about model selection based on their specific constraints.

Training costs on an A100 GPU (approximately $9/hour) range from remarkably affordable to moderately expensive. Llama2 7B — the best-performing model — requires just $2.62 for full training on the combined dataset, completing in approximately 17 minutes. By comparison, Mistral 7B costs $14.72 and takes nearly 98 minutes for the same training. GPT-4o Mini’s fine-tuning through OpenAI’s API costs $0.86 for the full dataset but incurs additional deployment fees of $1.62/hour — costs that accumulate quickly in production environments processing large document collections.

Inference speed is equally important for production systems that must process thousands of text segments across dozens of corporate disclosures. Llama2 7B demonstrates the fastest inference at 2.14 seconds per classification, closely followed by Gemma 7B at 2.18 seconds. ESG-BERT is the fastest overall at 1.11 seconds but achieves significantly lower accuracy. Mistral 7B is an outlier at 10.89 seconds per inference — nearly five times slower than Llama2 7B — making it impractical for high-throughput applications despite reasonable accuracy.

These economics make a compelling case for open-source deployment. An organization processing 10,000 text segments per month would spend approximately $2.62 on fine-tuning (one-time) and minimal ongoing compute costs for inference using Llama2 7B. The equivalent GPT-4o Mini deployment would incur recurring API fees that far exceed the infrastructure cost of running an open-source model on-premises or in a private cloud. For financial institutions where data privacy requirements often prohibit sending sensitive disclosures to third-party APIs, the cost advantage of open-source models is compounded by the elimination of data residency concerns.

Practical Applications of LLM ESG Detection in Finance

The ESGQuest system architecture demonstrates how LLM ESG detection can be deployed as a complete end-to-end solution for financial institutions. The practical workflow begins with document ingestion: Non-Financial Disclosures and sustainability reports are uploaded, chunked into manageable text segments, and indexed in a vector database. NACE codes associated with the reporting company’s industry sector filter the relevant ESG activities from the taxonomy, narrowing the classification scope.

In practice, this technology addresses several high-value use cases. Investment firms conducting ESG due diligence can automatically screen portfolio companies’ sustainability reports against taxonomy requirements, flagging disclosures that align with specific environmental objectives. Credit rating agencies incorporating ESG factors can standardize their assessment methodology by replacing subjective analyst judgment with consistent, reproducible LLM classifications. Regulatory bodies like the European Banking Authority (EBA) can deploy automated screening to identify potential greenwashing in corporate disclosures at scale.

The transport industry case study in the research — covering Ferrovie dello Stato, Autostrade per l’Italia, Maersk, and Mundys — illustrates the diversity of ESG activities that must be detected. Activities range from developing zero-emission vessel infrastructure to implementing smart mobility systems, from expanding cycling infrastructure to ensuring green port operations. Each activity requires the model to understand industry-specific terminology and distinguish between genuine alignment and superficial keyword overlap.

For organizations building their own ESG detection pipelines, the research provides a clear blueprint: start with Llama2 7B or Llama 3B as the base model, create a small expert-validated dataset of 200+ examples for your target taxonomy activities, generate synthetic augmentations to reach 1,000+ training instances, and fine-tune using LoRA. The entire process can be completed in a single day with minimal compute resources, producing a model that outperforms expensive proprietary alternatives.

Building an ESG Detection Pipeline with RAG

The Retrieval-Augmented Generation approach used in ESGQuest represents best practice for deploying LLM ESG detection in production environments. Rather than feeding entire documents to the language model — which would exceed context length limits and dilute classification accuracy — the RAG pipeline first retrieves the most relevant text segments using embedding-based similarity search.

The technical implementation involves several key components. Documents are processed using standard NLP preprocessing: tokenization, sentence splitting, and chunking into segments of optimal length (typically 256-512 tokens for classification tasks). These chunks are embedded using a pre-trained embedding model and stored in a vector database — the researchers used Pinecone, though alternatives like Weaviate, Qdrant, or Chroma would work equally well.

At query time, each ESG activity description is embedded and used to retrieve the top-k most similar text chunks from the company’s disclosure documents. These candidate chunks are then passed to the fine-tuned LLM for binary classification. The system outputs an annotated PDF highlighting which sections of the original document align with specific taxonomy activities — providing auditable, explainable results that compliance teams can review and validate.

This architecture offers several advantages over end-to-end approaches. First, the retrieval step dramatically reduces the number of LLM inference calls required, since only semantically relevant chunks are evaluated. Second, the modular design allows each component to be upgraded independently — a better embedding model or a newer LLM can be swapped in without redesigning the entire pipeline. Third, the vector database serves as a persistent knowledge store that can be queried for multiple purposes beyond ESG classification, such as thematic analysis or trend tracking across reporting periods.

For teams seeking to understand how AI-powered document analysis pipelines work in practice, exploring document intelligence solutions provides valuable context for the architectural decisions that make LLM ESG detection systems reliable and scalable.

Future Directions for AI-Powered ESG Compliance

The ESG-Activities benchmark opens several promising research directions that will shape the future of LLM ESG detection. The current study focuses on 12 transport-related activities from four companies — extending the approach to cover all six environmental objectives across all industry sectors would create a comprehensive taxonomy compliance tool. The researchers note that alternative data generation techniques, including human-in-the-loop approaches, could further improve synthetic data quality and model performance.

Beyond binary classification, the underlying LLM capabilities can be extended to related NLP tasks in sustainable finance. ESG sentiment analysis — determining whether a company’s language about sustainability activities is positive, negative, or neutral — could help investors distinguish between genuine commitment and performative greenwashing. ESG risk assessment, which evaluates the materiality and severity of sustainability-related risks disclosed in corporate filings, represents another high-value application where fine-tuned LLMs could replace or augment human analysts.

The success of lightweight models (2B-7B parameters) for this task is particularly encouraging for real-world deployment scenarios. As model distillation and quantization techniques continue to improve, it is plausible that ESG detection models could run on edge devices or standard commodity hardware, enabling real-time analysis during investor presentations, board meetings, or regulatory reviews. The combination of high accuracy, low cost, fast inference, and data privacy makes LLM ESG detection one of the most practical applications of AI in financial services today.

Looking ahead, the convergence of regulatory pressure, technological capability, and investor demand for transparent ESG data creates a powerful tailwind for adoption. Organizations that invest in building robust LLM ESG detection capabilities now will be well-positioned as reporting requirements expand and stakeholder expectations for AI-verified sustainability claims become the norm rather than the exception.

Ready to make your ESG research and financial documents more engaging? Start transforming them today.

Start Now →

Frequently Asked Questions

What is LLM ESG detection and why does it matter for finance?

LLM ESG detection uses large language models to automatically identify Environmental, Social, and Governance activities in financial texts. It matters because it enables scalable, cost-effective compliance with EU ESG taxonomy requirements, replacing manual document review that is slow and error-prone.

Which LLM performs best for ESG activity detection in financial texts?

Research shows that Llama2 7B achieves the highest F1 score of 84.9% when fine-tuned with synthetic data augmentation, outperforming proprietary models like GPT-4o Mini and domain-specific models like ESG-BERT for ESG activity classification tasks.

How does synthetic data improve LLM ESG detection accuracy?

Synthetic data augmentation generates alternative formulations of labeled training examples, expanding small datasets from 212 to 1,272 instances. This technique improved F1 scores by up to 14.8 percentage points, helping models generalize better across diverse ESG text patterns.

Can open-source models replace proprietary AI for ESG compliance?

Yes. Fine-tuned open-source models like Llama2 7B and Llama 3B outperformed GPT-4o Mini when trained with synthetic data augmentation. Open-source models also offer data privacy advantages since they can run locally, which is critical for financial institutions handling sensitive disclosures.

What is the EU ESG taxonomy and how do LLMs help with compliance?

The EU ESG taxonomy is a classification system that defines which economic activities qualify as environmentally sustainable under the European Green Deal. LLMs automate the process of mapping company disclosures to specific taxonomy activities, reducing manual effort and improving accuracy for investors and regulators.

How much does it cost to fine-tune an LLM for ESG detection?

Fine-tuning costs range from $0.14 for GPT-4o Mini on original data to $14.72 for Mistral 7B on the full synthetic dataset. The best-performing model, Llama2 7B, costs approximately $2.62 for full training on an A100 GPU, making it highly cost-effective for production deployment.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

No credit card required · 30-second setup