How RecoveryPulse Works

RecoveryPulse monitors your websites and automatically recovers them when issues are detected. Here's how it works.

1. Continuous Monitoring

RecoveryPulse checks your websites at configurable intervals (default: 60 seconds). Each check verifies:

  • HTTP Status: Ensures the expected status code (usually 200) is returned
  • Response Time: Measures how long the response takes
  • SSL Certificate: Validates your SSL certificate and checks expiry
  • Content Match: Optionally verifies specific text appears on the page

2. Incident Detection

When a check fails, RecoveryPulse waits for a second failure to confirm the issue isn't transient. After two consecutive failures:

  • An incident is created with full details
  • The site status is marked as "down"
  • Notifications are sent (if configured)
  • Auto-recovery begins (if enabled)

3. Automated Recovery

RecoveryPulse connects to your server via SSH and executes recovery actions in order. Between each action, it checks if the site is back online before proceeding.

Typical Recovery Flow:
  1. Restart application service
  2. Wait 30 seconds, check site
  3. If still down: restart web server
  4. Wait 30 seconds, check site
  5. If still down: restart database
  6. Continue until recovered or max attempts reached

Available Recovery Actions

ActionDescriptionWhen to Use
restart_appRestarts your application service via systemdFirst action for most issues
restart_nginxRestarts the nginx web server502/504 errors, proxy issues
restart_apacheRestarts Apache web serverApache-based setups
restart_mysqlRestarts MySQL databaseDatabase connection errors
restart_postgresqlRestarts PostgreSQL databasePostgreSQL setups
restart_php_fpmRestarts PHP-FPM servicePHP applications
clear_nginx_cacheClears nginx cache and reloadsStale cache issues
rollback_nginx_configRestores nginx.conf from backupAfter config changes
reboot_serverFull server rebootLast resort
custom_scriptRun any custom commandSpecial recovery needs

Best Practices

Rule Order

Start with least disruptive actions (app restart) and escalate to more drastic measures (server reboot) only if needed.

Wait Times

Give services enough time to fully restart before checking. 30 seconds is usually sufficient for most services.

Max Attempts

Set a reasonable limit (5-10) to prevent infinite recovery loops. Some issues need manual intervention.

SSH Security

Use a dedicated SSH key for RecoveryPulse with limited sudo permissions for only the commands you need.