Constitutional AI Safety Layer

Hard
Agents

Implement Constitutional AI safety layer:

Constitutional AI: Train/constraint AI to follow a set of principles (constitution).

Methods:

  1. critique_output(output, context): Check against principles
    • Return violations: [{'principle': ..., 'severity': ..., 'explanation': ...}]
  2. revise_output(output, critiques): Fix violations
  3. self_critique_loop(initial, max_iter): Iterate critique→revise
  4. add_principle(principle, priority): Extend constitution
  5. evaluate_compliance(outputs): Stats on adherence

Constitution Format: List of strings like:

  • 'Be helpful and harmless'
  • 'Respect user privacy'
  • 'Avoid generating harmful content'

Critique Simulation: For testing, use keyword matching against violation patterns

Examples

Example 1:
Input: const = ['Be helpful', 'Be honest']; layer = ConstitutionalAILayer(const); violations = layer.critique_output('I will help you', {}); isinstance(violations, list)
Output: True
Explanation: Critique returns list of violations (empty if compliant)

Starter Code

class ConstitutionalAILayer:
    """
    Safety layer implementing constitutional AI principles.
    """
    
    def __init__(self, constitution):
        self.constitution = constitution  # List of principles
        self.critique_model = None
        self.revision_history = []
    
    def critique_output(self, output, context):
        """
        Critique output against constitutional principles.
        Returns list of violations with severity.
        """
        # Your implementation here
        pass
    
    def revise_output(self, output, critiques):
        """
        Revise output to address critiques.
        Returns revised output.
        """
        # Your implementation here
        pass
    
    def self_critique_loop(self, initial_output, max_iterations=3):
        """
        Iteratively critique and revise until no violations or max iterations.
        """
        # Your implementation here
        pass
    
    def add_principle(self, principle, priority='high'):
        """Add new constitutional principle"""
        # Your implementation here
        pass
    
    def evaluate_compliance(self, outputs):
        """Evaluate compliance rate across multiple outputs"""
        # Your implementation here
        pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews