Thinkcentre Watchdog

A Docker-based monitoring solution for detecting and auto-rebooting hung Kubernetes machines via Home Assistant integration.

Overview

This watchdog monitors a target service URL for 502 Bad Gateway errors (indicating a hung machine). When a service fails:

  1. A 5-minute grace period begins (allowing for deployment recoveries)
  2. If the service recovers within 5 minutes, the error is cleared (normal deployment scenario)
  3. If still failing after 5 minutes, an automatic power-cycle is triggered via Home Assistant
  4. The machine powers off for 10 seconds, then powers back on

All activity is logged with timestamps for monitoring and troubleshooting.

Prerequisites

  • Docker and Docker Compose installed
  • Home Assistant instance running with network access
  • A power switch entity configured in Home Assistant
  • Long-lived access token from Home Assistant

Installation

1. Download/Organize Files

Clone or download this repository to your machine:

git clone <repository-url>
cd Thinkcentre-watchdog

The directory should contain:

  • Dockerfile - Container definition
  • thinkcenter_monitor.sh - Monitoring script
  • docker-compose.yml - Docker Compose configuration
  • .env.example - Environment variable template
  • README.md - This file

2. Create Configuration File

Copy the example environment file and edit it with your actual values:

cp .env.example .env

Edit .env and configure:

# Your target service URL
TARGET_URL=http://your-kubernetes-service:8080

# Home Assistant configuration
HA_URL=http://homeassistant:8123
HA_TOKEN=your_long_lived_access_token_here
HA_ENTITY=switch.your_power_switch_entity

# Optional: Adjust timing if needed
GRACE_PERIOD=300      # 5 minutes
CHECK_INTERVAL=30     # Check every 30 seconds

3. Generate Home Assistant Token

  1. Open Home Assistant web interface
  2. Go to SettingsDeveloper ToolsLong-Lived Access Tokens
  3. Click Create Token
  4. Name it (e.g., "Thinkcentre Watchdog")
  5. Copy the token and paste it in your .env file as HA_TOKEN

4. Configure Power Switch in Home Assistant

Ensure you have a switch entity in Home Assistant that controls the machine's power. Common options:

  • Smart Outlet/Relay: If using a smart power outlet
  • IPMI/Redfish: For datacenter machines
  • Smart Plug: Like Tasmota, Zigbee, or Z-Wave devices

Configure the entity ID in your .env as HA_ENTITY (e.g., switch.thinkcentre_power)

5. Build and Run

Start the monitoring container:

docker compose up -d

The container will:

  • Build from the Dockerfile
  • Start with restart: unless-stopped policy
  • Mount logs to a named volume
  • Apply resource limits (0.1 CPU, 64MB memory)

6. View Logs

Monitor real-time logs:

docker compose logs -f thinkcenter-monitor

Or view persistent logs from the volume:

docker volume inspect thinkcenter_logs
# Look at the Mountpoint directory

7. Stop or Restart

Stop the container:

docker compose down

Restart the container:

docker compose restart thinkcenter-monitor

Deploying Multiple Instances

To monitor multiple machines:

For Machine 2:

Create a separate directory:

mkdir thinkcentre-watchdog-machine2
cd thinkcentre-watchdog-machine2

# Copy files
cp /path/to/original/* .

# Create unique .env
cp .env.example .env

# Edit .env for machine 2
nano .env
# Change: HA_ENTITY=switch.machine2_power
# Change: TARGET_URL to machine 2's service URL

Then run:

docker compose up -d

Using Namespace (Alternative)

Or manage from one directory with unique service names:

docker compose -f docker-compose.yml -f docker-compose.machine2.yml up -d

Configuration Variables

Variable Default Description
TARGET_URL http://localhost:8080 Service URL to monitor
HA_URL http://homeassistant:8123 Home Assistant base URL
HA_TOKEN (required) Home Assistant long-lived access token
HA_ENTITY switch.thinkcentre_power Home Assistant switch entity ID
GRACE_PERIOD 300 Seconds to wait before power-cycling (5 minutes)
CHECK_INTERVAL 30 Seconds between health checks

Troubleshooting

Container won't start

Check if HA_TOKEN is set:

docker compose config | grep HA_TOKEN

No logs appearing

Check the volume mount:

docker volume ls | grep thinkcenter_logs
docker volume inspect thinkcenter_logs

Power-cycle not triggering

  1. Verify HA_TOKEN is valid (check Home Assistant logs)
  2. Confirm HA_ENTITY exists in Home Assistant
  3. Check network connectivity: docker compose exec thinkcenter-monitor curl -v http://homeassistant:8123

Service not responding correctly

Test the target URL directly:

docker compose exec thinkcenter-monitor curl -v http://your-service:8080

How It Works

  1. Health Check: Every CHECK_INTERVAL seconds, HTTP response code is checked
  2. Grace Period: First 502 error triggers a 5-minute window for recovery
  3. Recovery Detection: If service returns non-502 during grace period, error resets
  4. Power Cycle: After grace period expires with continued 502s, power cycle triggers:
    • Send turn_off to HA switch entity
    • Wait 10 seconds
    • Send turn_on to HA switch entity
  5. Logging: All events timestamped and logged to /var/log/thinkcenter_monitor.log

Resource Limits

  • CPU: 0.1 cores (limited to prevent resource hogging)
  • Memory: 64MB (minimal requirements for bash + curl)
  • Logging: JSON file driver, max 10MB per file, keeps 3 files (30MB total)

Debugging

Enable verbose output by checking logs with:

docker compose logs --tail 50 thinkcenter-monitor

To test the script locally (without Docker):

bash thinkcenter_monitor.sh

License

Monitoring solution for Thinkcentre machines.

Support

For issues or improvements, check the logs first and verify all environment variables are correctly set in your .env file.

Description
A watchdog daemon for my thinkcentre machines which tend to hang.
Readme 52 KiB
Languages
Shell 91.9%
Dockerfile 8.1%