Steering Experiment¶

This document describes how to apply steering vectors to language models for text generation using the psyctl steering command.

Table of Contents¶

Overview
Usage
Steering Parameters
Steering Methods
Examples
Advanced Usage

Overview¶

The steering experiment applies pre-extracted steering vectors to language models during text generation. This influences the model's personality or behavior according to the training data used during vector extraction.

The steering process involves:

Loading a model and its tokenizer
Loading steering vectors from a safetensors file
Registering forward hooks on target layers
Applying steering vectors during text generation
Decoding and returning the steered output

Usage¶

CLI Usage¶

Basic Command¶

Apply a steering vector to generate text:

psyctl steering \
  --model "google/gemma-2-2b-it" \
  --steering-vector "./steering_vector/out.safetensors" \
  --input-text "Tell me about yourself"

With Custom Strength¶

Adjust the steering strength multiplier:

psyctl steering \
  --model "meta-llama/Llama-3.2-3B-Instruct" \
  --steering-vector "./steering_vector/out.safetensors" \
  --input-text "hello world" \
  --strength 1.5

Using Orthogonalized Addition¶

Apply steering with the orthogonalized addition method:

psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./steering_vector/out.safetensors" \
  --input-text "hello" \
  --orthogonal \
  --strength 2.0

Command-Line Options¶

--model: Model name or HuggingFace identifier (required)
--steering-vector: Path to steering vector file (.safetensors) (required)
--input-text: Input text for generation (required)
--strength: Steering strength multiplier (default: 1.0)
--max-tokens: Maximum number of tokens to generate (default: 200)
--temperature: Sampling temperature, 0 for greedy (default: 1.0)
--top-p: Top-p (nucleus) sampling parameter (default: 0.9)
--top-k: Top-k sampling parameter (default: 50)
--orthogonal: Use orthogonalized addition method
--verbose: Log full prompt after chat template application

Python Code Usage¶

You can use the SteeringApplier class directly in Python code with flexible input options.

Basic Example (Using model_name)¶

from pathlib import Path
from psyctl.core.steering_applier import SteeringApplier

# Initialize applier
applier = SteeringApplier()

# Apply steering with model_name
result = applier.apply_steering(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./steering_vector/out.safetensors"),
    input_text="Tell me about yourself",
    strength=1.5
)

print(result)

Using Persistent Steering (Python API Only - Most Efficient for Multiple Generations)¶

The get_steering_applied_model() method returns a model with steering hooks already attached. This is the most efficient way to generate multiple outputs with the same steering configuration:

from pathlib import Path
from psyctl.core.steering_applier import SteeringApplier

# Initialize applier
applier = SteeringApplier()

# Get model with steering hooks attached
model, tokenizer = applier.get_steering_applied_model(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./steering_vector/out.safetensors"),
    strength=2.0,
    orthogonal=True
)

# Use the model multiple times - hooks remain active
test_inputs = ["Hello", "How are you?", "What's your opinion?"]

for prompt in test_inputs:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50, use_cache=False)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Input: {prompt}")
    print(f"Output: {result}\n")

# Remove steering hooks when done
model.remove_steering()

Using Pre-loaded Model (Efficient for Multiple Generations)¶

from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer
from psyctl.core.steering_applier import SteeringApplier

# Load model and tokenizer once
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-270m-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")

# Initialize applier
applier = SteeringApplier()

# Apply steering multiple times with different inputs/strengths
# No need to reload the model each time!
test_inputs = [
    "Hello, how are you?",
    "Tell me about yourself",
    "What is your opinion on AI?"
]

for input_text in test_inputs:
    result = applier.apply_steering(
        model=model,
        tokenizer=tokenizer,
        steering_vector_path=Path("./steering_vector/out.safetensors"),
        input_text=input_text,
        strength=1.5
    )
    print(f"Input: {input_text}")
    print(f"Output: {result}\n")

Experimenting with Different Strengths¶

from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer
from psyctl.core.steering_applier import SteeringApplier

# Load model once
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-270m-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")

applier = SteeringApplier()
input_text = "Hello, how are you?"

# Test different steering strengths efficiently
for strength in [0.5, 1.0, 1.5, 2.0, 2.5]:
    result = applier.apply_steering(
        model=model,
        tokenizer=tokenizer,
        steering_vector_path=Path("./steering_vector/rudeness.safetensors"),
        input_text=input_text,
        strength=strength
    )
    print(f"Strength {strength}: {result}\n")

Using Orthogonalized Addition in Python¶

from pathlib import Path
from psyctl.core.steering_applier import SteeringApplier

applier = SteeringApplier()

# Apply with orthogonalized addition method
result = applier.apply_steering(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./steering_vector/out.safetensors"),
    input_text="What is your personality like?",
    strength=2.0,
    orthogonal=True,  # Enable orthogonalized addition
    temperature=0.7
)

print(result)

Using Verbose Logging¶

Enable verbose logging to see the full prompt after chat template application:

from pathlib import Path
from psyctl.core.steering_applier import SteeringApplier

applier = SteeringApplier()

# Enable verbose to log the full formatted prompt
result = applier.apply_steering(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./steering_vector/out.safetensors"),
    input_text="Hello",
    strength=1.5,
    verbose=True  # Logs full prompt with chat template
)

Using Per-Layer Strength (Python API Only)¶

Control steering strength individually for each layer:

from pathlib import Path
from psyctl.core.steering_applier import SteeringApplier

applier = SteeringApplier()

# Apply different strengths to different layers
result = applier.apply_steering(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./steering_vector/multi_layer.safetensors"),
    input_text="Tell me about yourself",
    strength={
        "model.layers[10].mlp.down_proj": 1.0,
        "model.layers[13].mlp.down_proj": 2.5,
        "model.layers[16].mlp.down_proj": 1.5,
        # Layers not specified will use default strength of 1.0
    }
)

print(result)

You can also use per-layer strength with get_steering_applied_model():

# Get model with per-layer steering
model, tokenizer = applier.get_steering_applied_model(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./steering_vector/multi_layer.safetensors"),
    strength={
        "model.layers[13].mlp.down_proj": 3.0,  # Strong on this layer
        # Other layers use default 1.0
    },
    orthogonal=True
)

# Generate multiple outputs with this configuration
for prompt in ["Hello", "How are you?"]:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50, use_cache=False)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

model.remove_steering()

Steering Parameters¶

Strength¶

The strength parameter controls how strongly the steering vector affects the model.

Uniform Strength (float):

Apply the same strength to all layers:

0.0: No steering (baseline model behavior)
1.0: Default steering strength
1.5-2.0: Strong steering (recommended for subtle personalities)
>2.0: Very strong steering (may produce extreme outputs)

CLI Example:

# Subtle steering
psyctl steering --model "google/gemma-3-270m-it" \
  --steering-vector "./vector.safetensors" \
  --input-text "What is your opinion?" \
  --strength 0.5

# Strong steering
psyctl steering --model "google/gemma-3-270m-it" \
  --steering-vector "./vector.safetensors" \
  --input-text "What is your opinion?" \
  --strength 2.5

Per-Layer Strength (Dict[str, float] - Python API Only):

Control strength for each layer individually:

# Dictionary mapping layer names to strength values
strength = {
    "model.layers[10].mlp.down_proj": 1.0,   # Mild steering
    "model.layers[13].mlp.down_proj": 2.5,   # Strong steering
    "model.layers[16].mlp.down_proj": 1.5,   # Moderate steering
    # Layers not in dict will use default strength of 1.0
}

result = applier.apply_steering(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./vector.safetensors"),
    input_text="What is your opinion?",
    strength=strength
)

Benefits of per-layer strength: - Fine-grained control over steering behavior - Can emphasize or de-emphasize specific layers - Useful for experimenting with layer-specific effects - Layers not specified in the dict automatically use default strength (1.0)

Temperature¶

Controls randomness in text generation:

0.0: Greedy decoding (deterministic)
0.5-0.8: More focused and coherent
1.0: Balanced sampling (default)
>1.0: More creative and diverse

Example:

# Deterministic output
psyctl steering --model "google/gemma-3-270m-it" \
  --steering-vector "./vector.safetensors" \
  --input-text "hello" \
  --temperature 0.0

# Creative output
psyctl steering --model "google/gemma-3-270m-it" \
  --steering-vector "./vector.safetensors" \
  --input-text "hello" \
  --temperature 1.5

Top-p and Top-k¶

Fine-tune sampling behavior:

--top-p: Nucleus sampling threshold (0.0-1.0)
--top-k: Number of top tokens to consider

Example:

psyctl steering --model "google/gemma-3-270m-it" \
  --steering-vector "./vector.safetensors" \
  --input-text "hello" \
  --top-p 0.95 \
  --top-k 100

Steering Methods¶

Simple Addition (Default)¶

The default method adds the steering vector to model activations after the prompt:

output[prompt_length:] = output[prompt_length:] + strength * steering_vector

This is the standard CAA (Contrastive Activation Addition) method.

Orthogonalized Addition¶

The --orthogonal flag enables orthogonalized addition method:

Calculate projection of output onto steering vector direction
Remove the existing component along that direction
Add scaled steering vector

norm_steer = steering_vector / ||steering_vector||
proj = (output · norm_steer) * norm_steer
output[prompt_length:] = (output[prompt_length:] - proj) + strength * steering_vector

This method orthogonalizes the output with respect to the steering direction before applying the steering vector, providing more controlled modification of model behavior.

When to use: - When steering effects are too strong or unpredictable with simple addition - When you want more precise control over steering magnitude - When combining multiple steering vectors to avoid interference - When fine-tuning steering strength for subtle personality changes

Example:

psyctl steering \
  --model "meta-llama/Llama-3.2-3B-Instruct" \
  --steering-vector "./vector.safetensors" \
  --input-text "Describe your personality" \
  --orthogonal \
  --strength 1.5

Examples¶

Example 1: Extroversion Steering¶

# Extract extroversion steering vector (prerequisite)
psyctl extract.steering \
  --model "google/gemma-3-270m-it" \
  --layer "model.layers[13].mlp.down_proj" \
  --dataset "./dataset/extroversion" \
  --output "./vectors/extroversion.safetensors"

# Apply with moderate strength
psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/extroversion.safetensors" \
  --input-text "Tell me about your weekend plans" \
  --strength 1.2

Example 2: Multiple Personalities¶

# Extract multi-layer steering vector
psyctl extract.steering \
  --model "meta-llama/Llama-3.2-3B-Instruct" \
  --layers "model.layers[13].mlp.down_proj,model.layers[14].mlp.down_proj" \
  --dataset "./dataset/agreeableness" \
  --output "./vectors/agreeable_multi.safetensors"

# Apply with orthogonalized addition
psyctl steering \
  --model "meta-llama/Llama-3.2-3B-Instruct" \
  --steering-vector "./vectors/agreeable_multi.safetensors" \
  --input-text "What do you think about helping others?" \
  --orthogonal \
  --strength 2.0

Example 3: Comparing Strengths¶

Test different steering strengths on the same input:

# No steering (baseline)
psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/neuroticism.safetensors" \
  --input-text "I got a bad grade on my test" \
  --strength 0.0

# Mild steering
psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/neuroticism.safetensors" \
  --input-text "I got a bad grade on my test" \
  --strength 0.8

# Strong steering
psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/neuroticism.safetensors" \
  --input-text "I got a bad grade on my test" \
  --strength 2.0

Advanced Usage¶

Chat Template Handling¶

The steering command automatically detects and applies chat templates for instruction-tuned models:

# For models with chat templates (Llama, Gemma, etc.)
# Input is automatically formatted as:
# <bos><start_of_turn>user
# Your input text<end_of_turn>
# <start_of_turn>model

psyctl steering \
  --model "meta-llama/Llama-3.2-3B-Instruct" \
  --steering-vector "./vectors/out.safetensors" \
  --input-text "hello"

For base models without chat templates, the raw input text is used.

Multi-Layer Steering¶

When a steering vector file contains multiple layers, all layers are automatically applied:

# Extract from multiple layers
psyctl extract.steering \
  --model "google/gemma-3-270m-it" \
  --layers "model.layers[10].mlp.down_proj,model.layers[13].mlp.down_proj,model.layers[16].mlp.down_proj" \
  --dataset "./dataset/caa" \
  --output "./vectors/multi.safetensors"

# Apply to all layers at once
psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/multi.safetensors" \
  --input-text "hello" \
  --strength 1.5

Greedy vs Sampling¶

For reproducible results, use greedy decoding:

psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/out.safetensors" \
  --input-text "Tell me a story" \
  --temperature 0.0

For creative outputs, use higher temperature with sampling:

psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/out.safetensors" \
  --input-text "Tell me a story" \
  --temperature 1.2 \
  --top-p 0.95

Technical Details¶

Hook Implementation¶

The steering mechanism uses PyTorch forward hooks registered on target layers. The hook function:

Receives layer output (batch_size, sequence_length, hidden_dim)
Applies steering only to tokens after the prompt
Returns modified output in the same format

Code reference: src/psyctl/core/steering_applier.py:_make_steering_hook()

Prompt Length Tracking¶

The system tracks prompt length to ensure steering is only applied to generated tokens, not the input prompt. This prevents distorting the input context.

Special case: Setting prompt_length=0 internally applies steering to all tokens (BiPO-style), though this is not exposed via CLI.

Memory Management¶

Hooks are automatically cleaned up after generation using try/finally blocks to prevent memory leaks.

Troubleshooting¶

Issue: Steering has no effect¶

Solution: - Increase --strength parameter - Try --orthogonal flag for orthogonalized addition method - Verify steering vector was extracted from the same model - Check that layer paths match between extraction and application

Issue: Output is too extreme¶

Solution: - Decrease --strength parameter (try 0.5-1.0) - Use --orthogonal flag for more controlled steering - Lower --temperature for more focused output

Issue: Model uses too much memory¶

Solution: - Use a smaller model (e.g., gemma-3-270m-it instead of gemma-3-27b-it) - Reduce --max-tokens parameter - The steering process uses use_cache=False which increases memory during generation

Steering Experiment¶

Table of Contents¶

Overview¶

Usage¶

CLI Usage¶

Basic Command¶

With Custom Strength¶

Using Orthogonalized Addition¶

Command-Line Options¶

Python Code Usage¶

Basic Example (Using model_name)¶

Using Persistent Steering (Python API Only - Most Efficient for Multiple Generations)¶

Using Pre-loaded Model (Efficient for Multiple Generations)¶

Experimenting with Different Strengths¶

Using Orthogonalized Addition in Python¶

Using Verbose Logging¶

Using Per-Layer Strength (Python API Only)¶

Steering Parameters¶

Strength¶

Temperature¶

Top-p and Top-k¶

Steering Methods¶

Simple Addition (Default)¶

Orthogonalized Addition¶

Examples¶

Example 1: Extroversion Steering¶

Example 2: Multiple Personalities¶

Example 3: Comparing Strengths¶

Advanced Usage¶

Chat Template Handling¶

Multi-Layer Steering¶

Greedy vs Sampling¶

Technical Details¶

Hook Implementation¶

Prompt Length Tracking¶

Memory Management¶

Troubleshooting¶

Issue: Steering has no effect¶

Issue: Output is too extreme¶

Issue: Model uses too much memory¶

See Also¶