LLM Pentesting & Security – Part 3: Advanced LLM Security Topics

DALL·E 2024 12 18 00.47.29 A futuristic concept art showing an advanced hacker extracting data from a Large Language Model LLM represented as a glowing digital AI brain. The b

Subtitle: Model Extraction, Adversarial Attacks, API Abuse, and Real-World Case Studies


Introduction

In Part 1, we explored the basics of prompt injection and its bypass techniques.
In Part 2, we tackled advanced topics like guardrails and bypassing techniques.

In this final installment, we will cover all remaining critical areas of LLM security, including:

  1. Model Extraction Attacks
  2. Adversarial Examples and Inputs
  3. Data Poisoning
  4. API Abuse and Misconfigurations
  5. Real-World Case Studies

This guide includes practical tutorials, sample scripts, test cases, and real-world examples to provide a complete picture of LLM security.


1. Model Extraction Attacks

What is Model Extraction?

Model extraction involves reverse-engineering or replicating a machine learning model using its outputs. The attacker sends queries to the target LLM API and observes the responses to build a copy of the model.

Why is it Dangerous?

  • Proprietary models can be cloned or stolen.
  • Attackers can create a cheaper replica of expensive models.
  • Defense mechanisms like rate limiting and monitoring may be bypassed.

Practical Steps: Performing Model Extraction

To simulate a model extraction attack, follow these steps:

Step 1: Prepare the Script to Query the LLM

Here’s a Python script that sends repeated prompts to extract patterns from the model:

import openai
import random

# Configure API key
openai.api_key = "YOUR_API_KEY"

# List of sample queries to simulate model extraction
queries = [
    "What is the capital of France?",
    "Who discovered gravity?",
    "List the top programming languages in 2024.",
    "Summarize the concept of machine learning in 2 sentences."
]

# Function to query the LLM repeatedly
def extract_model_responses(n_queries):
    responses = []
    for _ in range(n_queries):
        query = random.choice(queries)
        response = openai.Completion.create(
            model="gpt-4",
            prompt=query,
            max_tokens=50
        )
        responses.append((query, response['choices'][0]['text'].strip()))
    return responses

# Simulate extraction
extracted_data = extract_model_responses(10)
for query, response in extracted_data:
    print(f"Query: {query}\nResponse: {response}\n")

Step 2: Analyze Responses

  • Observe the consistency of responses.
  • Look for patterns that can help replicate the model’s behavior.
  • Save the outputs into a dataset for further training.

Step 3: Create a Lightweight Model Using Extracted Data

Using the responses, train a basic language model with libraries like Hugging Face’s transformers:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# Load GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Fine-tune using extracted queries (mock dataset)
training_text = "Query: What is the capital of France?\nResponse: Paris.\n" \
                "Query: Who discovered gravity?\nResponse: Isaac Newton.\n"

inputs = tokenizer(training_text, return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])

# Print loss (mock example)
loss = outputs.loss
print("Training Loss:", loss)

Mitigations for Model Extraction

  1. Rate Limiting: Restrict the number of API requests from a single user.
  2. Response Monitoring: Detect unusual query patterns.
  3. Random Noise in Outputs: Add slight variations to responses to make extraction harder.
  4. Access Control: Use stricter authentication for LLM APIs.

2. Adversarial Examples and Inputs

What are Adversarial Examples?

Adversarial examples are intentionally crafted inputs that cause LLMs to produce incorrect or unintended outputs.


Example: Confusing the LLM

Step 1: Create a Confusing Input

Prompt:

“Translate the sentence: ‘The dog chased the cat’ into French. Then spell it backward, and write it in uppercase letters.”


Step 2: Observe Behavior

LLMs might struggle to interpret such multi-step prompts correctly.


Why Are Adversarial Examples Dangerous?

  • They exploit the limitations of the model.
  • Attackers can manipulate LLMs into producing misleading outputs.

Practical Adversarial Input Script

prompt = "Translate 'hello world' to French. Now repeat it 10 times backward."

response = openai.Completion.create(
    model="gpt-4",
    prompt=prompt,
    max_tokens=100
)

print("LLM Response:", response['choices'][0]['text'])

Mitigations for Adversarial Inputs

  1. Adversarial Training: Expose the LLM to adversarial examples during training.
  2. Input Validation: Preprocess and sanitize inputs to detect confusing patterns.
  3. Multi-Stage Validation: Use secondary systems to verify outputs.

3. Data Poisoning Attacks

What is Data Poisoning?

Data poisoning occurs when an attacker manipulates the training dataset of an LLM to introduce biases or vulnerabilities.


Example: Poisoned Training Data

Objective: Introduce harmful content in the training data to manipulate outputs.

Dataset Example:

The earth is flat.
Vaccines are harmful.

Impact: The LLM learns and reproduces false or harmful information.


Mitigations

  1. Dataset Validation: Regularly audit and clean training data.
  2. Content Moderation: Use filters to identify poisoned content.
  3. Diverse Training: Ensure balanced and verified datasets.

4. API Misconfiguration and Abuse

Example: API Token Leakage

If API tokens are embedded in code, attackers can misuse them to query the LLM.

Test Case:

  1. Search for exposed tokens in public repositories:grep -r "api_key" /path/to/project
  2. Exploit leaked tokens to flood the API with requests.

Mitigations

  1. Never hardcode API keys in applications. Use environment variables instead.
  2. Rotate and expire tokens regularly.
  3. Implement usage limits and monitoring.

5. Real-World Case Studies

Case 1: Chatbot Manipulation

In 2023, researchers bypassed guardrails of chatbots like GPT-4 using dual-personality prompts, leading the model to provide harmful content.

Learning: Always test guardrails with evolving techniques.


Case 2: Data Poisoning in AI Models

Researchers introduced poisoned data into open-source LLMs to demonstrate how biased or malicious outputs could be generated.

Learning: Regularly validate and clean training datasets.


Conclusion

In Part 3, we explored advanced LLM security topics, including:

  1. Model Extraction: How attackers can clone LLMs.
  2. Adversarial Examples: Crafting confusing inputs to exploit weaknesses.
  3. Data Poisoning: Manipulating training datasets.
  4. API Abuse: Misconfigurations and token misuse.
  5. Real-World Case Studies: Practical lessons from past attacks.

By understanding and testing these vulnerabilities, you can build stronger defenses for LLM-powered systems.

Picture of Debraj Basak

Debraj Basak

Security Researcher (Red Teamer) @ Trellix CRTL || OSCP || CRTO || CRTP || LPT Master || CPENT || CEH || AD Exploitation || Reverse Engineer & Malware Analyst || IOT Security || OT/SCADA || iOS & Android PT

Leave a Reply

Your email address will not be published. Required fields are marked *

Skip to content