Implement Agent Alignment and Value Alignment Checker

Hard
Agents

AI Agent Alignment Checking

Alignment ensures agents behave according to human values and defined constraints.

Task

Build AlignmentChecker that evaluates agent outputs across 5 dimensions:

  1. Helpfulness: Output addresses user's actual need.
  2. Harmlessness: No harmful, dangerous, or offensive content.
  3. Honesty: No false certainty, hallucinated citations, or deception.
  4. Instruction Following: Adheres to user's explicit request.
  5. Scope Adherence: Stays within defined topic/capability boundaries.

Non-Functional Requirements

  • Harmlessness threshold must be ≥ 0.9 (block on violation).
  • Report must include specific issues, not just scores.
  • LLM judge used only when rule-based confidence < 0.7.
  • Full alignment check < 200ms for rule-based path.

Constraints

  • should_block = True if ANY harmlessness violation OR overall_score < 0.5.
  • Issues must be actionable strings.
  • Scores weighted: harmlessness × 0.4 + honesty × 0.3 + others × 0.1 each.

Examples

Example 1:
Input: checker.check('How do I make pasta?', 'Here is a recipe...', {})
Output: AlignmentReport(overall_score=0.92, should_block=False, violations=[])
Explanation: Safe, helpful, honest response to benign cooking question.

Starter Code

from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass, field
from enum import Enum

class AlignmentDimension(Enum):
    HELPFULNESS = 'helpfulness'       # Does output help the user?
    HARMLESSNESS = 'harmlessness'     # Does output avoid harm?
    HONESTY = 'honesty'               # Is output truthful?
    INSTRUCTION_FOLLOWING = 'instruction_following'  # Follows user intent?
    SCOPE_ADHERENCE = 'scope_adherence'  # Stays within defined scope?

@dataclass
class AlignmentScore:
    dimension: AlignmentDimension
    score: float  # 0.0 to 1.0
    issues: List[str]
    severity: str  # ok|warning|violation

@dataclass
class AlignmentReport:
    run_id: str
    overall_score: float
    scores: List[AlignmentScore]
    violations: List[str]
    recommended_actions: List[str]
    should_block: bool

class AlignmentChecker:
    def __init__(self, llm_fn: callable = None, rules: Dict = None):
        self.llm_fn = llm_fn
        self.rules = rules or {}
        self.violation_log: List[Dict] = []
        self.thresholds = {
            AlignmentDimension.HELPFULNESS: 0.6,
            AlignmentDimension.HARMLESSNESS: 0.9,  # High bar
            AlignmentDimension.HONESTY: 0.8,
            AlignmentDimension.INSTRUCTION_FOLLOWING: 0.7,
            AlignmentDimension.SCOPE_ADHERENCE: 0.85,
        }

    def check(self, user_input: str, agent_output: str, context: Dict) -> AlignmentReport:
        # TODO: Run all alignment checks
        pass

    def _check_harmlessness(self, output: str) -> AlignmentScore:
        # TODO: Detect harmful content patterns
        pass

    def _check_honesty(self, output: str, context: Dict) -> AlignmentScore:
        # TODO: Detect overconfidence, false certainty, made-up citations
        pass

    def _check_instruction_following(self, user_input: str, output: str) -> AlignmentScore:
        # TODO: Verify output addresses user's actual request
        pass

    def _check_scope_adherence(self, output: str, allowed_scope: List[str]) -> AlignmentScore:
        # TODO: Verify output stays within allowed topics/actions
        pass

    def aggregate(self, scores: List[AlignmentScore]) -> AlignmentReport:
        pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews