Contributing to PSYCTL¶
This guide covers how to extend PSYCTL with new features, particularly adding new steering vector extraction methods.
Table of Contents¶
Development Setup¶
Environment Setup¶
# Development environment setup
& .\scripts\install-dev.ps1
# Virtual environment activation
& .\.venv\Scripts\Activate.ps1
Development Workflow¶
# Format code
& .\scripts\format.ps1
# Run tests with coverage
& .\scripts\test.ps1
# Complete build process (format + lint + test + install)
& .\scripts\build.ps1
Adding New Extraction Methods¶
To implement a new steering vector extraction method, follow these steps:
1. Create Extractor Class¶
Create a new file in src/psyctl/core/extractors/:
# src/psyctl/core/extractors/my_method_extractor.py
from typing import Dict
from pathlib import Path
import torch
from torch import nn
from transformers import AutoTokenizer
from psyctl.core.extractors.base import BaseVectorExtractor
from psyctl.core.logger import get_logger
class MyMethodExtractor(BaseVectorExtractor):
"""
Extract steering vectors using My Custom Method.
Description of your method and algorithm here.
"""
def __init__(self):
self.logger = get_logger("my_method_extractor")
def extract(
self,
model: nn.Module,
tokenizer: AutoTokenizer,
layers: list[str],
dataset_path: Path,
**kwargs
) -> Dict[str, torch.Tensor]:
"""
Extract steering vectors from specified layers.
Args:
model: Loaded language model
tokenizer: Model tokenizer
layers: List of layer paths to extract from
dataset_path: Path to dataset
**kwargs: Method-specific parameters
Returns:
Dictionary mapping layer names to steering vectors
"""
self.logger.info(f"Extracting with MyMethod from {len(layers)} layers")
# Your extraction logic here
steering_vectors = {}
for layer_path in layers:
# 1. Access the layer
# 2. Collect activations
# 3. Compute steering vector
# 4. Store in dictionary
pass
return steering_vectors
2. Register Extractor¶
Update src/psyctl/core/steering_extractor.py to register your method:
from psyctl.core.extractors.my_method_extractor import MyMethodExtractor
class SteeringExtractor:
EXTRACTORS = {
'mean_diff': MeanDifferenceActivationVectorExtractor,
'bipo': BiPOVectorExtractor,
'my_method': MyMethodExtractor, # Add your extractor
}
def extract(self, method: str = 'mean_diff', **kwargs):
extractor_class = self.EXTRACTORS.get(method)
if extractor_class is None:
raise ValueError(f"Unknown extraction method: {method}")
extractor = extractor_class()
return extractor.extract(**kwargs)
3. Update CLI¶
Add method selection to CLI command in src/psyctl/commands/extract.py:
@click.command()
@click.option("--model", required=True)
@click.option("--layer", multiple=True)
@click.option("--dataset", required=True, type=click.Path())
@click.option("--output", required=True, type=click.Path())
@click.option("--method", default="mean_diff",
help="Extraction method: mean_diff, bipo, my_method")
@click.option("--lr", type=float, default=5e-4, help="Learning rate for BiPO")
@click.option("--beta", type=float, default=0.1, help="Beta parameter for BiPO")
@click.option("--epochs", type=int, default=10, help="Number of epochs for BiPO")
def steering(model: str, layer: tuple, dataset: str, output: str, method: str,
lr: float, beta: float, epochs: int):
# ...
method_params = {}
if method == "bipo":
method_params = {"lr": lr, "beta": beta, "epochs": epochs}
extractor.extract(method=method, **method_params)
4. Add Tests¶
Create tests in tests/core/extractors/test_my_method_extractor.py:
import pytest
from psyctl.core.extractors.my_method_extractor import MyMethodExtractor
def test_my_method_basic():
extractor = MyMethodExtractor()
# Test basic functionality
pass
def test_my_method_multi_layer():
extractor = MyMethodExtractor()
# Test multi-layer extraction
pass
5. Document Your Method¶
Add documentation to docs/EXTRACT.STEERING.md under the "Extraction Methods" section:
### MyMethodName
Brief description of the method.
**Algorithm:**
1. Step 1
2. Step 2
3. Step 3
**Key Features:**
- Feature 1
- Feature 2
**When to use:**
- Use case 1
- Use case 2
**Parameters:**
- `param1`: Description
- `param2`: Description
**Example:**
```bash
psyctl extract.steering \
--model "meta-llama/Llama-3.2-3B-Instruct" \
--layer "model.layers[13].mlp.down_proj" \
--dataset "./dataset/caa" \
--output "./steering_vector/out.safetensors" \
--method my_method
## Implementation Details
### Layer Access
The `LayerAccessor` class handles dynamic layer access:
```python
from psyctl.core.layer_accessor import LayerAccessor
accessor = LayerAccessor()
layer_module = accessor.get_layer(model, "model.layers[13].mlp.down_proj")
Layer Path Format:
Layer paths use dot notation with bracket indexing:
- model.layers[13].mlp.down_proj - MLP output projection (recommended)
- model.layers[0].self_attn.o_proj - Attention output projection
- model.language_model.layers[10].mlp.act_fn - After activation function
Activation Collection¶
The ActivationHookManager manages forward hooks for collecting activations:
from psyctl.core.hook_manager import ActivationHookManager
hook_manager = ActivationHookManager()
layer_modules = {"layer_13": model.model.layers[13].mlp.down_proj}
hook_manager.register_hooks(layer_modules)
# Run inference
with torch.inference_mode():
outputs = model(**inputs)
# Get collected activations
activations = hook_manager.get_mean_activations()
hook_manager.remove_all_hooks()
Dataset Format¶
steering datasets are JSONL files with this structure:
{
"question": "[Situation]\n...\n[Question]\n...\n1. Answer option 1\n2. Answer option 2\n[Answer]",
"positive": "(1",
"neutral": "(2",
"positive_text": "Full text of personality answer...",
"neutral_text": "Full text of neutral answer..."
}
Version 2+ datasets include positive_text and neutral_text fields for full answer content.
The loader automatically combines question with positive/neutral to create full prompts.
Output Format¶
Steering vectors should be saved in safetensors format with embedded metadata:
# File structure
{
"model.layers[13].mlp.down_proj": torch.Tensor, # First layer's steering vector
"model.layers[14].mlp.down_proj": torch.Tensor, # Second layer's steering vector
# ... more layers
"__metadata__": {
"model": "meta-llama/Llama-3.2-3B-Instruct",
"method": "mean_diff", # or "bipo"
"layers": ["model.layers[13].mlp.down_proj", "model.layers[14].mlp.down_proj"],
"dataset_path": "./dataset/caa",
"dataset_samples": 20000,
"num_layers": 2,
"normalized": false
}
}
Loading vectors:
from safetensors.torch import load_file
data = load_file("steering_vector.safetensors")
layer_13_vector = data["model.layers[13].mlp.down_proj"]
metadata = data["__metadata__"]
Testing Guidelines¶
Local Testing Standards¶
- Always use "gemma-3-270m-it" model for local testing
- Output files go to ./results folder
- Hugging Face cache goes to ./temp folder
- No emoji usage in code or output
Test Coverage¶
- Unit tests for all public methods
- Integration tests for end-to-end workflows
- Mock-based testing for external dependencies (HuggingFace models)
- Coverage target: >80%
Code Quality¶
- Google-style docstrings for all public functions
- Type hints required
- Snake_case for functions/variables, PascalCase for classes
- Import organization with isort
- Code formatting: Black + isort
- Linting: flake8 + mypy
Git Workflow¶
- Feature branches:
feature/issue-number-description - Bug fixes:
fix/issue-number-description - Semantic versioning (MAJOR.MINOR.PATCH)
Common Implementation Patterns¶
Memory-Efficient Activation Collection¶
Use incremental mean computation for large datasets:
mean_activation = None
count = 0
for batch in dataset:
activations = get_activations(batch)
if mean_activation is None:
mean_activation = torch.zeros_like(activations[0])
for act in activations:
count += 1
mean_activation += (act - mean_activation) / count
Batch Processing¶
Process data in batches for GPU efficiency:
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
for batch in dataloader:
with torch.inference_mode():
outputs = model(**batch)
# Collect activations
Checkpoint Support¶
Save intermediate results for long-running operations:
if checkpoint_interval and (idx + 1) % checkpoint_interval == 0:
checkpoint_path = output_path.with_suffix('.checkpoint')
save_file(intermediate_vectors, checkpoint_path)
self.logger.info(f"Checkpoint saved: {checkpoint_path}")