EC2 Recovery

AWS EC2 Instance Recovery Guide: When SSH and Serial Console Fail

This guide demonstrates how to recover access to an EC2 instance when both SSH and Serial Console access are unavailable. We'll use a Proxmox instance as an example, but this method works for any Linux-based EC2 instance.

Prerequisites

AWS Console access
Basic understanding of Linux commands
A working EC2 instance to use as rescue system (Amazon Linux 2 recommended)

Recovery Steps

1. Create a Snapshot of the Affected Volume

Navigate to EC2 Dashboard
Go to Volumes and select the volume of the affected instance
Actions → Create Snapshot
Add descriptive name like "Pre-rescue-backup-[date]"
Click Create Snapshot
Wait for snapshot to reach 100% completion

2. Stop the Affected Instance

Navigate to EC2 Dashboard
Select the affected instance
Actions → Instance State → Stop
Wait until instance is fully stopped

3. Detach the Root Volume

Select the stopped instance
Scroll to 'Storage' tab
Note the volume ID of the root volume
Right-click the volume → Detach Volume
Confirm detach

4. Launch a Rescue Instance

Launch a new EC2 instance
Use Amazon Linux 2 AMI
Same availability zone as affected volume
Configure security group to allow SSH access

5. Attach Problem Volume to Rescue Instance

Select the detached volume
Actions → Attach Volume
Select rescue instance
Note the device name (e.g., /dev/sdb or /dev/xvdb)

6. Access and Mount the Volume

# Connect to rescue instance
ssh -i your-key.pem ec2-user@rescue-instance-ip

# List available disks to find attached volume
sudo fdisk -l
# or
lsblk

# Create mount point
sudo mkdir -p /mnt/rescue

# Mount the root partition
sudo mount /dev/xvdb1 /mnt/rescue  # Adjust device name as needed

7. Troubleshoot and Fix Issues

Common File Locations

# Network Configuration
sudo nano /mnt/rescue/etc/network/interfaces    # Debian/Ubuntu/Proxmox
sudo nano /mnt/rescue/etc/sysconfig/network-scripts/ifcfg-eth0  # RHEL/CentOS

# SSH Configuration
sudo nano /mnt/rescue/etc/ssh/sshd_config

# System Logs
sudo less /mnt/rescue/var/log/syslog    # Debian/Ubuntu
sudo less /mnt/rescue/var/log/messages  # RHEL/CentOS

Example: Fixing Proxmox Network Configuration

# View current network config
sudo cat /mnt/rescue/etc/network/interfaces

# Edit if needed
sudo nano /mnt/rescue/etc/network/interfaces

# Example of working basic config:
auto lo
iface lo inet loopback

iface eth0 inet manual

auto vmbr0
iface vmbr0 inet dhcp
        bridge-ports eth0
        bridge-stp off
        bridge-fd 0

8. Cleanup and Restore

# Unmount volume
cd ~  # Ensure you're not in mounted directory
sudo umount /mnt/rescue

After unmounting:

Detach volume from rescue instance in AWS Console
Reattach to original instance as root volume
Start original instance
Test connectivity

Common Issues and Solutions

Network Configuration Issues

Check for correct interface names (eth0, ens5, etc.)
Verify gateway configuration
Ensure no conflicting network bridges
Check for valid IP addressing

Boot Issues

Check /boot partition isn't full
Verify fstab entries are correct
Check for kernel issues in grub configuration

Permission Issues

Verify SSH key permissions
Check SELinux/AppArmor settings
Validate root access configuration

Prevention Tips

Always maintain a snapshot of working system
Document working network configuration
Use AWS Systems Manager Session Manager as backup access method
Keep serial console access enabled
Document network changes before implementing
Test changes in staging environment first

Additional Resources

Remember: Always maintain current backups and document your system configuration to make recovery easier when needed.