—
0:00
Cybersecurity LLM Fine-Tuning with RAG & MITRE ATT&CK
Table of Contents
- Domain-Specific LLMs for Cybersecurity Operations
- Instruction Tuning: Teaching LLMs Cybersecurity Reasoning
- The MITRE ATT&CK Framework as Training Foundation
- Synthetic Data Generation for Cybersecurity Training
- RAG Architecture for Real-Time Threat Intelligence
- GraphRAG and GNN Integration for Attack Analysis
- QLoRA Quantization for Resource-Efficient Deployment
- Benchmarking Results: RAG vs GraphRAG vs Graph+LLM
- Practical Deployment Challenges and Solutions
- Future of Domain-Adapted Cybersecurity LLMs
📌 Key Takeaways
- GraphRAG+GNN Leads Performance: Achieves highest LLM Judge score of 8.00/10 on MITRE ATT&CK queries, outperforming Pure RAG (7.87) and Graph+LLM (7.16).
- Small Models Can Compete: Fine-tuned Gemma-2B with QLoRA delivers meaningful cybersecurity analysis on a single NVIDIA RTX 4090 GPU with 24GB VRAM.
- Synthetic Data Enables Privacy: LLM-generated training data from 2,398 MITRE ATT&CK prompt-response pairs enables comprehensive coverage without exposing sensitive operational data.
- Context Window Matters: Effective token usage is limited to 200-400 tokens despite 2,048-token support, requiring careful prompt engineering for domain tasks.
- Hybrid Cloud-Local Pipeline: Cloud-hosted 175B+ models generate training data while local fine-tuning preserves data privacy and reduces inference costs.
Domain-Specific LLMs for Cybersecurity Operations
The cybersecurity industry faces a fundamental challenge: general-purpose large language models possess broad knowledge but lack the specialized understanding needed for accurate threat analysis, incident classification, and tactical response recommendations. While models like GPT-4 and Claude demonstrate impressive general capabilities, they often produce generic or occasionally inaccurate outputs when confronted with domain-specific cybersecurity queries that require precise mapping to established frameworks and taxonomies.
The CyberLLM-FINDS 2025 research project addresses this gap by developing a methodology for transforming general-purpose models into domain-specific cybersecurity assistants. Published by Iyer, Bobadilla, and Iyengar, the research demonstrates how instruction tuning, retrieval-augmented generation (RAG), and graph neural network integration can align a compact 2-billion parameter model with the MITRE ATT&CK framework—the industry’s most widely adopted knowledge base of adversary tactics, techniques, and procedures.
What makes this research particularly significant is its focus on resource efficiency. Rather than requiring massive computational infrastructure, the approach achieves competitive results using a single consumer-grade GPU, making advanced cybersecurity AI accessible to organizations that cannot afford enterprise-scale computing resources. This democratization of AI-driven cybersecurity capabilities represents a meaningful step toward closing the gap between large enterprises and smaller organizations in their defensive capabilities.
Instruction Tuning: Teaching LLMs Cybersecurity Reasoning
Instruction tuning represents a paradigm shift from traditional fine-tuning approaches by training models on structured prompt-response pairs that teach the model to follow specific task instructions. In the cybersecurity context, this means creating training examples that pair security analysis questions with expert-quality responses grounded in established frameworks and best practices.
The CyberLLM-FINDS methodology employs a multi-task instruction tuning approach where the model learns to handle diverse cybersecurity tasks simultaneously—from mapping suspicious PowerShell commands to MITRE ATT&CK techniques (such as T1059.001) to analyzing credential dumping indicators (T1003.001 – LSASS Dumping) and identifying lateral movement patterns. Each training example follows a consistent format: a clear instruction, relevant context, and a detailed response that demonstrates the reasoning process.
One-shot prompts with chain-of-thought reasoning proved most effective for the fine-tuned model. By providing a single example that demonstrates the desired analytical process, the model learns to replicate this reasoning pattern across novel queries. This finding has practical implications for deployment: security teams can provide a single reference analysis as context to significantly improve the model’s response quality for similar threat scenarios without additional fine-tuning.
The research also explored zero-shot and few-shot prompting strategies, finding that the optimal approach varies by task complexity. For straightforward classification tasks (mapping events to ATT&CK tactics), zero-shot performance was acceptable after fine-tuning. For complex analytical tasks requiring multi-step reasoning—such as reconstructing attack chains or correlating indicators across multiple data sources—one-shot prompting with explicit reasoning chains yielded substantially better results.
The MITRE ATT&CK Framework as Training Foundation
The MITRE ATT&CK framework serves as the structural backbone for the CyberLLM-FINDS training pipeline, providing a comprehensive taxonomy of adversary behaviors observed in real-world cyber operations. This framework organizes attack knowledge into tactics (the adversary’s goals), techniques (how they achieve those goals), and sub-techniques (specific implementations), creating a hierarchical knowledge structure that maps naturally to LLM training objectives.
The training dataset comprises 2,398 MITRE ATT&CK prompt-response pairs covering the full spectrum of adversary behaviors—from initial access and execution through persistence, privilege escalation, defense evasion, credential access, discovery, lateral movement, collection, command and control, exfiltration, and impact. This comprehensive coverage ensures the fine-tuned model develops balanced knowledge across all phases of the cyber kill chain rather than becoming overspecialized in commonly discussed techniques.
STIX (Structured Threat Information Expression) objects complement the ATT&CK framework by providing a standardized format for threat intelligence data. The training pipeline incorporates STIX-formatted threat intelligence to teach the model how to parse, generate, and reason about structured threat data—a capability essential for integration with existing security information and event management (SIEM) systems and threat intelligence platforms that rely on STIX for data exchange.
A key insight from the research is that the training data must reflect the distribution of real-world threats rather than simply providing equal coverage of all techniques. Techniques that are frequently observed in the wild—such as spearphishing attachments, PowerShell execution, and scheduled task persistence—receive proportionally more training examples, while rare but high-impact techniques are augmented with synthetic examples to prevent the model from developing blind spots.
Transform complex cybersecurity frameworks into interactive learning experiences your team will engage with.
Synthetic Data Generation for Cybersecurity Training
One of the most innovative aspects of the CyberLLM-FINDS methodology is its use of synthetic data generation to overcome the scarcity of labeled cybersecurity training data. Real-world cybersecurity data is notoriously difficult to obtain due to privacy concerns, classification restrictions, and the sensitive nature of incident information. Synthetic data generation addresses these challenges by using cloud-hosted large language models (175B+ parameters) to create realistic training examples that capture the diversity and complexity of real cyber threats.
The synthetic data pipeline follows a structured process: first, threat scenarios are defined based on ATT&CK techniques and real-world threat intelligence reports. Then, cloud-hosted models generate diverse variations of log entries, alert descriptions, incident reports, and analytical queries that reflect these scenarios. Each generated example undergoes validation against the ATT&CK taxonomy to ensure technical accuracy and consistency with established threat classification standards.
This approach enables privacy-preserving, balanced, and comprehensive coverage across all ATT&CK techniques, including rare and underrepresented attack behaviors that might have insufficient representation in available real-world datasets. For organizations bound by regulatory constraints or data protection requirements, synthetic data generation offers a practical path to building effective cybersecurity AI systems without exposing sensitive operational data or violating compliance obligations.
The research found that models under 2 billion parameters achieved below 20% baseline accuracy on instruction-following tasks, necessitating the use of synthetic data from cloud-hosted models. This hybrid approach—using larger models to generate training data that enables smaller models to perform effectively—represents a pragmatic strategy for deploying cybersecurity AI at scale while managing computational costs and maintaining data sovereignty.
RAG Architecture for Real-Time Threat Intelligence
Retrieval-Augmented Generation fundamentally changes how cybersecurity LLMs access and utilize knowledge. Rather than relying solely on information encoded in model weights during training, RAG systems dynamically retrieve relevant documents, threat intelligence, and framework documentation at inference time, ensuring responses reflect the most current threat landscape.
In the CyberLLM-FINDS architecture, the RAG component maintains a curated knowledge base of ATT&CK documentation, threat intelligence reports, and security best practices. When a query arrives, the system first retrieves the most relevant knowledge chunks using semantic similarity search, then presents these alongside the query to the fine-tuned model. This architecture significantly reduces hallucination—a critical concern in security applications where incorrect information could lead to missed threats or inappropriate response actions.
Pure RAG achieved the highest clarity scores in head-to-head evaluations, winning 3 of 5 comparisons against the more complex GraphRAG+GNN approach. This finding suggests that for straightforward threat analysis queries where speed and clarity are paramount, a well-implemented RAG system may be the optimal choice. The simplicity of Pure RAG also translates to easier deployment, maintenance, and troubleshooting in production security operations environments, making it a practical option for organizations looking to enhance their LLM-powered cybersecurity capabilities.
The RAG knowledge base requires careful curation to be effective. The researchers found that simply indexing raw ATT&CK documentation produced suboptimal results; the best performance came from preprocessing knowledge into focused chunks that align with the types of queries the model would encounter in production. Each chunk is enriched with metadata including relevant ATT&CK technique IDs, tactic classifications, and platform applicability, enabling more precise retrieval and reducing the noise that can degrade response quality.
GraphRAG and GNN Integration for Attack Analysis
GraphRAG extends the traditional RAG paradigm by incorporating graph neural networks that capture the relational structure of cybersecurity knowledge. While standard RAG treats each retrieved document as an independent text chunk, GraphRAG leverages the connections between ATT&CK techniques, threat actors, tools, and attack campaigns to provide contextually richer responses that account for the interdependencies inherent in cyber operations.
The graph structure models relationships that are critical for comprehensive threat analysis: technique chains that represent common attack sequences, tool-technique mappings that indicate which adversary tools implement which techniques, and campaign-actor associations that connect observed activity to known threat groups. When a security analyst queries about a specific technique, GraphRAG can automatically surface related techniques that commonly co-occur, potential mitigation strategies, and known threat actors who employ similar tactics.
The GraphRAG+GNN approach achieved the highest overall LLM Judge score of 8.00 across five evaluation dimensions—Relevance (8.0), Accuracy (8.2), Specificity (7.9), and Clarity (8.1)—outperforming both Pure RAG (7.87) and Graph+LLM (7.16). The accuracy advantage is particularly noteworthy: by incorporating graph structure, the system can verify the consistency of its responses against known attack patterns, reducing the likelihood of generating plausible but factually incorrect analysis.
However, the Graph+LLM approach (using graph retrieval without the GNN component) produced broader but less focused responses, indicating that graph-only retrieval lacks the precision needed for actionable cybersecurity analysis. The GNN component’s ability to learn node representations that capture both local and global graph properties proves essential for translating graph structure into focused, relevant context that enhances rather than dilutes the model’s analytical capability.
Make threat intelligence frameworks accessible to every team member with interactive document experiences.
QLoRA Quantization for Resource-Efficient Deployment
The computational demands of large language models present a significant barrier to adoption for many cybersecurity organizations. QLoRA (Quantized Low-Rank Adapters) addresses this challenge by enabling fine-tuning and inference with dramatically reduced memory requirements. The CyberLLM-FINDS implementation uses 4-bit quantized weights with 16-bit arithmetic, allowing the entire Gemma-2B model to fit within the 24GB VRAM of a single NVIDIA RTX 4090 GPU.
The quantization process maintains model quality by preserving the most information-dense components of the weight matrices while compressing less critical parameters. Low-rank adapters further reduce the number of trainable parameters during fine-tuning, focusing the training signal on domain-specific adaptations rather than modifying the entire model. This combination enables the research team to fine-tune on a batch size of 4 with a maximum sequence length of 397 tokens while maintaining competitive performance on cybersecurity tasks.
The QLoRA methodology, originally developed by Dettmers and colleagues, has proven particularly well-suited for domain adaptation scenarios where a relatively small amount of specialized knowledge must be efficiently integrated into a pre-trained general-purpose model. For cybersecurity applications, this means organizations can create specialized models that understand their specific threat landscape, compliance requirements, and operational context without the massive computational investment typically associated with LLM development.
Practical deployment considerations include inference speed and token generation limits. The research found that inference output should be capped at 200 tokens for optimal quality, with longer outputs showing degradation in coherence and accuracy. Parser errors occurred at default token lengths of 1,024–2,048 with batch size 4, necessitating careful configuration tuning for production deployments. These constraints inform deployment architectures that may chain multiple focused queries rather than attempting single long-form responses for complex analytical tasks.
Benchmarking Results: RAG vs GraphRAG vs Graph+LLM
The comprehensive evaluation of three retrieval architectures—Pure RAG, Graph+LLM, and GraphRAG+GNN—provides valuable guidance for organizations selecting cybersecurity AI architectures. Each approach exhibits distinct strengths and trade-offs that make them suitable for different operational contexts and use cases.
Pure RAG delivers the best balance of simplicity and performance for standard threat analysis queries. Its clarity advantage (winning 3 of 5 head-to-head evaluations) makes it ideal for environments where security analysts need quick, clear answers without the overhead of graph processing. Pure RAG is also the most straightforward to implement and maintain, requiring only a well-curated document store and semantic search infrastructure. Organizations with limited AI engineering resources may find this the most practical starting point for their cybersecurity AI journey.
GraphRAG+GNN excels in scenarios requiring deep contextual analysis, where understanding the relationships between techniques, actors, and campaigns is essential. Its highest overall score of 8.00 reflects superior performance on complex queries that span multiple ATT&CK tactics and require reasoning about attack chain dependencies. Security operations centers handling advanced persistent threats (APTs) or conducting strategic threat assessments would benefit most from this architecture, despite its higher implementation complexity.
The Graph+LLM approach serves as a cautionary example of how adding architectural complexity without proper integration can actually degrade performance. Its lowest score of 7.16 demonstrates that simply providing graph-based context without the GNN’s learned representations can overwhelm the model with loosely relevant information, reducing response focus and precision. This finding underscores the importance of thoughtful architecture design rather than simply combining trendy AI components and hoping for emergent improvement.
Practical Deployment Challenges and Solutions
Deploying domain-specific cybersecurity LLMs in production environments presents challenges that extend beyond model training and evaluation. The CyberLLM-FINDS research identifies several critical deployment considerations that organizations must address to realize the practical benefits of cybersecurity AI.
Context window limitations represent perhaps the most significant practical constraint. Despite the Gemma-2B model supporting a 2,048-token context window, effective utilization during domain-specific tasks is limited to 200–400 tokens due to uneven prompt length distributions in the training data. This constraint requires organizations to design interaction patterns that decompose complex analytical tasks into focused sub-queries, each within the effective context window, and synthesize results through an orchestration layer.
The dependency on cloud-hosted models for synthetic data generation introduces both cost and security considerations. Organizations must carefully evaluate the trade-off between the quality improvements enabled by large-model-generated training data and the risks of transmitting cybersecurity-relevant prompts to external APIs. Strategies for mitigating this risk include using anonymized threat scenarios, abstracting organization-specific details, and implementing data loss prevention controls on outbound API traffic.
Integration with existing security infrastructure—SIEM systems, SOAR platforms, threat intelligence platforms, and ticketing systems—requires careful API design and workflow orchestration. The model’s outputs must be formatted to match the expectations of downstream systems, whether that means generating STIX-formatted threat intelligence objects, structured incident reports compatible with the organization’s ticketing system, or detection rules in platform-specific query languages like Kusto (KQL) or Splunk SPL.
Continuous model maintenance represents an ongoing operational requirement. As new ATT&CK techniques are added, threat actor behaviors evolve, and organizational security postures change, the fine-tuned model and its RAG knowledge base must be updated accordingly. Establishing automated pipelines for data collection, preprocessing, fine-tuning, evaluation, and deployment ensures that the model remains current without imposing unsustainable manual effort on the security team.
Future of Domain-Adapted Cybersecurity LLMs
The CyberLLM-FINDS research opens several promising avenues for the future development of domain-adapted cybersecurity language models. Perhaps the most impactful direction is the development of standardized benchmarks that enable rigorous comparison across cybersecurity AI systems. The current evaluation scope—five MITRE queries with an LLM Judge—provides initial validation but falls short of the comprehensive benchmarking needed to guide production adoption decisions.
Multi-model architectures that combine specialized cybersecurity LLMs with general-purpose models offer another promising direction. Rather than expecting a single model to handle all cybersecurity tasks, organizations could deploy ensembles where a cybersecurity-specialized model handles domain-specific analysis while a general-purpose model handles natural language interaction, report generation, and cross-domain reasoning. This approach mirrors the specialization seen in human security teams, where different analysts bring different expertise to collaborative threat analysis.
The convergence of cybersecurity LLMs with automated response systems represents both an opportunity and a risk. As models become more accurate and reliable, the temptation to automate not just analysis but also response actions grows. However, the consequences of incorrect automated responses in cybersecurity—from blocking legitimate traffic to executing inappropriate containment actions—demand that autonomous response capabilities be deployed with robust safeguards, human oversight, and comprehensive audit trails.
Advances in model efficiency will continue to drive accessibility. Techniques like generative AI applications in cybersecurity demonstrate that meaningful security analysis can be achieved with increasingly compact models when combined with proper domain adaptation. Future work on sub-billion parameter models, hardware-specific optimizations, and edge deployment architectures could bring cybersecurity AI capabilities to even the most resource-constrained environments, including operational technology networks and remote facilities where cloud connectivity is limited or prohibited.
The integration of multi-modal capabilities—combining text analysis with network traffic visualization, code graph analysis, and binary pattern recognition—represents the next frontier for cybersecurity LLMs. As the threat landscape continues to evolve in complexity and sophistication, the models that defend against these threats must evolve correspondingly, incorporating diverse data modalities and reasoning capabilities to maintain effective detection and response capabilities.
Turn complex MITRE ATT&CK research into interactive experiences your security team will actually use.
Frequently Asked Questions
What is instruction tuning for cybersecurity LLMs?
Instruction tuning is a fine-tuning technique that trains language models on structured prompt-response pairs derived from cybersecurity frameworks like MITRE ATT&CK. It teaches the model to follow specific security analysis instructions, map threats to attack techniques, and generate domain-appropriate responses for tasks like log analysis, incident triage, and threat classification.
How does RAG improve cybersecurity LLM performance?
Retrieval-Augmented Generation enhances cybersecurity LLMs by supplementing the model’s parametric knowledge with real-time retrieval from threat intelligence databases, vulnerability repositories, and MITRE ATT&CK documentation. This reduces hallucinations, provides up-to-date threat context, and improves accuracy for domain-specific queries without requiring full model retraining.
Can small LLMs perform cybersecurity tasks effectively?
Small LLMs like Gemma-2B can perform cybersecurity tasks effectively when fine-tuned with domain-specific data and augmented with techniques like QLoRA quantization and RAG. Research shows that GraphRAG+GNN achieved an LLM Judge score of 8.00 out of 10 on MITRE ATT&CK queries, demonstrating that resource-constrained models can deliver meaningful security analysis when properly adapted.
What is the MITRE ATT&CK framework and why does it matter for AI?
MITRE ATT&CK is a globally recognized knowledge base of adversary tactics, techniques, and procedures based on real-world observations. For AI, it provides a structured taxonomy that enables LLMs to classify threats systematically, generate standardized threat intelligence reports, and map security events to known attack patterns, making automated threat analysis more consistent and actionable.
What is GraphRAG and how does it differ from standard RAG?
GraphRAG combines traditional retrieval-augmented generation with graph neural networks to capture relationships between entities in cybersecurity knowledge bases. While standard RAG retrieves relevant text chunks, GraphRAG also leverages entity relationships, attack chain connections, and technique dependencies from graph structures, achieving higher accuracy scores (8.00 vs 7.87) on complex threat intelligence queries.