AI Agent Alignment Checking
Alignment ensures agents behave according to human values and defined constraints.
Task
Build AlignmentChecker that evaluates agent outputs across 5 dimensions:
- Helpfulness: Output addresses user's actual need.
- Harmlessness: No harmful, dangerous, or offensive content.
- Honesty: No false certainty, hallucinated citations, or deception.
- Instruction Following: Adheres to user's explicit request.
- Scope Adherence: Stays within defined topic/capability boundaries.
Non-Functional Requirements
- Harmlessness threshold must be ≥ 0.9 (block on violation).
- Report must include specific issues, not just scores.
- LLM judge used only when rule-based confidence < 0.7.
- Full alignment check < 200ms for rule-based path.
Constraints
should_block = Trueif ANY harmlessness violation OR overall_score < 0.5.- Issues must be actionable strings.
- Scores weighted: harmlessness × 0.4 + honesty × 0.3 + others × 0.1 each.
Examples
Example 1:
Input:
checker.check('How do I make pasta?', 'Here is a recipe...', {})Output:
AlignmentReport(overall_score=0.92, should_block=False, violations=[])Explanation: Safe, helpful, honest response to benign cooking question.
Starter Code
from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass, field
from enum import Enum
class AlignmentDimension(Enum):
HELPFULNESS = 'helpfulness' # Does output help the user?
HARMLESSNESS = 'harmlessness' # Does output avoid harm?
HONESTY = 'honesty' # Is output truthful?
INSTRUCTION_FOLLOWING = 'instruction_following' # Follows user intent?
SCOPE_ADHERENCE = 'scope_adherence' # Stays within defined scope?
@dataclass
class AlignmentScore:
dimension: AlignmentDimension
score: float # 0.0 to 1.0
issues: List[str]
severity: str # ok|warning|violation
@dataclass
class AlignmentReport:
run_id: str
overall_score: float
scores: List[AlignmentScore]
violations: List[str]
recommended_actions: List[str]
should_block: bool
class AlignmentChecker:
def __init__(self, llm_fn: callable = None, rules: Dict = None):
self.llm_fn = llm_fn
self.rules = rules or {}
self.violation_log: List[Dict] = []
self.thresholds = {
AlignmentDimension.HELPFULNESS: 0.6,
AlignmentDimension.HARMLESSNESS: 0.9, # High bar
AlignmentDimension.HONESTY: 0.8,
AlignmentDimension.INSTRUCTION_FOLLOWING: 0.7,
AlignmentDimension.SCOPE_ADHERENCE: 0.85,
}
def check(self, user_input: str, agent_output: str, context: Dict) -> AlignmentReport:
# TODO: Run all alignment checks
pass
def _check_harmlessness(self, output: str) -> AlignmentScore:
# TODO: Detect harmful content patterns
pass
def _check_honesty(self, output: str, context: Dict) -> AlignmentScore:
# TODO: Detect overconfidence, false certainty, made-up citations
pass
def _check_instruction_following(self, user_input: str, output: str) -> AlignmentScore:
# TODO: Verify output addresses user's actual request
pass
def _check_scope_adherence(self, output: str, allowed_scope: List[str]) -> AlignmentScore:
# TODO: Verify output stays within allowed topics/actions
pass
def aggregate(self, scores: List[AlignmentScore]) -> AlignmentReport:
pass
Python3
ReadyLines: 1Characters: 0
Ready