Advanced Failure Recovery with Checkpointing

Hard
Agents

Implement advanced checkpoint-based recovery:

Checkpointing:

  1. checkpoint(agent_id, state, step_number): Save state periodically
    • Only save if step_number % interval == 0
    • Serialize state (use pickle or json)
    • Return checkpoint_id
  2. restore(agent_id, checkpoint_id): Load state
  3. replay_from_checkpoint(agent_id, checkpoint_id, actions): Replay
  4. prune_old_checkpoints(agent_id, keep_last): Cleanup
  5. get_recovery_point(agent_id, target_step): Find nearest checkpoint

Checkpoint Format: {'checkpoint_id': ..., 'step': ..., 'timestamp': ..., 'state': ...}

Recovery Strategy:

  • Find nearest checkpoint before failure
  • Restore state
  • Replay actions from checkpoint to failure point

Examples

Example 1:
Input: cr = CheckpointRecovery(checkpoint_interval=5); cid = cr.checkpoint('agent1', {'data': 'state'}, 5); cid is not None
Output: True
Explanation: Checkpoint saved at step 5 (multiple of interval)

Starter Code

import pickle
import time

class CheckpointRecovery:
    """
    Advanced failure recovery with state checkpointing and replay.
    """
    
    def __init__(self, checkpoint_interval=5):
        self.checkpoint_interval = checkpoint_interval
        self.checkpoints = {}  # agent_id -> list of checkpoints
        self.current_states = {}
    
    def checkpoint(self, agent_id, state, step_number):
        """
        Save checkpoint if step_number % interval == 0
        Returns checkpoint_id or None
        """
        # Your implementation here
        pass
    
    def restore(self, agent_id, checkpoint_id=None):
        """
        Restore agent to checkpoint.
        If checkpoint_id is None, restore to latest.
        """
        # Your implementation here
        pass
    
    def replay_from_checkpoint(self, agent_id, checkpoint_id, actions):
        """
        Replay actions from checkpoint to recover state.
        """
        # Your implementation here
        pass
    
    def prune_old_checkpoints(self, agent_id, keep_last=3):
        """Keep only last N checkpoints to save space"""
        # Your implementation here
        pass
    
    def get_recovery_point(self, agent_id, target_step):
        """Find nearest checkpoint before target_step"""
        # Your implementation here
        pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews