LLM Pentesting & Security – Part 3: Advanced LLM Security Topics

Subtitle: Model Extraction, Adversarial Attacks, API Abuse, and Real-World Case Studies
Introduction
In Part 1, we explored the basics of prompt injection and its bypass techniques.
In Part 2, we tackled advanced topics like guardrails and bypassing techniques.
In this final installment, we will cover all remaining critical areas of LLM security, including:
- Model Extraction Attacks
- Adversarial Examples and Inputs
- Data Poisoning
- API Abuse and Misconfigurations
- Real-World Case Studies
This guide includes practical tutorials, sample scripts, test cases, and real-world examples to provide a complete picture of LLM security.
1. Model Extraction Attacks
What is Model Extraction?
Model extraction involves reverse-engineering or replicating a machine learning model using its outputs. The attacker sends queries to the target LLM API and observes the responses to build a copy of the model.
Why is it Dangerous?
- Proprietary models can be cloned or stolen.
- Attackers can create a cheaper replica of expensive models.
- Defense mechanisms like rate limiting and monitoring may be bypassed.
Practical Steps: Performing Model Extraction
To simulate a model extraction attack, follow these steps:
Step 1: Prepare the Script to Query the LLM
Here’s a Python script that sends repeated prompts to extract patterns from the model:
import openai
import random
# Configure API key
openai.api_key = "YOUR_API_KEY"
# List of sample queries to simulate model extraction
queries = [
"What is the capital of France?",
"Who discovered gravity?",
"List the top programming languages in 2024.",
"Summarize the concept of machine learning in 2 sentences."
]
# Function to query the LLM repeatedly
def extract_model_responses(n_queries):
responses = []
for _ in range(n_queries):
query = random.choice(queries)
response = openai.Completion.create(
model="gpt-4",
prompt=query,
max_tokens=50
)
responses.append((query, response['choices'][0]['text'].strip()))
return responses
# Simulate extraction
extracted_data = extract_model_responses(10)
for query, response in extracted_data:
print(f"Query: {query}\nResponse: {response}\n")
Step 2: Analyze Responses
- Observe the consistency of responses.
- Look for patterns that can help replicate the model’s behavior.
- Save the outputs into a dataset for further training.
Step 3: Create a Lightweight Model Using Extracted Data
Using the responses, train a basic language model with libraries like Hugging Face’s transformers
:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
# Load GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
# Fine-tune using extracted queries (mock dataset)
training_text = "Query: What is the capital of France?\nResponse: Paris.\n" \
"Query: Who discovered gravity?\nResponse: Isaac Newton.\n"
inputs = tokenizer(training_text, return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
# Print loss (mock example)
loss = outputs.loss
print("Training Loss:", loss)
Mitigations for Model Extraction
- Rate Limiting: Restrict the number of API requests from a single user.
- Response Monitoring: Detect unusual query patterns.
- Random Noise in Outputs: Add slight variations to responses to make extraction harder.
- Access Control: Use stricter authentication for LLM APIs.
2. Adversarial Examples and Inputs
What are Adversarial Examples?
Adversarial examples are intentionally crafted inputs that cause LLMs to produce incorrect or unintended outputs.
Example: Confusing the LLM
Step 1: Create a Confusing Input
Prompt:
“Translate the sentence: ‘The dog chased the cat’ into French. Then spell it backward, and write it in uppercase letters.”
Step 2: Observe Behavior
LLMs might struggle to interpret such multi-step prompts correctly.
Why Are Adversarial Examples Dangerous?
- They exploit the limitations of the model.
- Attackers can manipulate LLMs into producing misleading outputs.
Practical Adversarial Input Script
prompt = "Translate 'hello world' to French. Now repeat it 10 times backward."
response = openai.Completion.create(
model="gpt-4",
prompt=prompt,
max_tokens=100
)
print("LLM Response:", response['choices'][0]['text'])
Mitigations for Adversarial Inputs
- Adversarial Training: Expose the LLM to adversarial examples during training.
- Input Validation: Preprocess and sanitize inputs to detect confusing patterns.
- Multi-Stage Validation: Use secondary systems to verify outputs.
3. Data Poisoning Attacks
What is Data Poisoning?
Data poisoning occurs when an attacker manipulates the training dataset of an LLM to introduce biases or vulnerabilities.
Example: Poisoned Training Data
Objective: Introduce harmful content in the training data to manipulate outputs.
Dataset Example:
The earth is flat.
Vaccines are harmful.
Impact: The LLM learns and reproduces false or harmful information.
Mitigations
- Dataset Validation: Regularly audit and clean training data.
- Content Moderation: Use filters to identify poisoned content.
- Diverse Training: Ensure balanced and verified datasets.
4. API Misconfiguration and Abuse
Example: API Token Leakage
If API tokens are embedded in code, attackers can misuse them to query the LLM.
Test Case:
- Search for exposed tokens in public repositories:
grep -r "api_key" /path/to/project
- Exploit leaked tokens to flood the API with requests.
Mitigations
- Never hardcode API keys in applications. Use environment variables instead.
- Rotate and expire tokens regularly.
- Implement usage limits and monitoring.
5. Real-World Case Studies
Case 1: Chatbot Manipulation
In 2023, researchers bypassed guardrails of chatbots like GPT-4 using dual-personality prompts, leading the model to provide harmful content.
Learning: Always test guardrails with evolving techniques.
Case 2: Data Poisoning in AI Models
Researchers introduced poisoned data into open-source LLMs to demonstrate how biased or malicious outputs could be generated.
Learning: Regularly validate and clean training datasets.
Conclusion
In Part 3, we explored advanced LLM security topics, including:
- Model Extraction: How attackers can clone LLMs.
- Adversarial Examples: Crafting confusing inputs to exploit weaknesses.
- Data Poisoning: Manipulating training datasets.
- API Abuse: Misconfigurations and token misuse.
- Real-World Case Studies: Practical lessons from past attacks.
By understanding and testing these vulnerabilities, you can build stronger defenses for LLM-powered systems.

Debraj Basak
Security Researcher (Red Teamer) @ Trellix CRTL || OSCP || CRTO || CRTP || LPT Master || CPENT || CEH || AD Exploitation || Reverse Engineer & Malware Analyst || IOT Security || OT/SCADA || iOS & Android PT