# Thinkcentre Watchdog A Docker-based monitoring solution for detecting and auto-rebooting hung Kubernetes machines via Home Assistant integration. ## Overview This watchdog monitors a target service URL for 502 Bad Gateway errors (indicating a hung machine). When a service fails: 1. A 5-minute grace period begins (allowing for deployment recoveries) 2. If the service recovers within 5 minutes, the error is cleared (normal deployment scenario) 3. If still failing after 5 minutes, an automatic power-cycle is triggered via Home Assistant 4. The machine powers off for 10 seconds, then powers back on All activity is logged with timestamps for monitoring and troubleshooting. ## Prerequisites - Docker and Docker Compose installed - Home Assistant instance running with network access - A power switch entity configured in Home Assistant - Long-lived access token from Home Assistant ## Installation ### 1. Download/Organize Files Clone or download this repository to your machine: ```bash git clone cd Thinkcentre-watchdog ``` The directory should contain: - `Dockerfile` - Container definition - `thinkcenter_monitor.sh` - Monitoring script - `docker-compose.yml` - Docker Compose configuration - `.env.example` - Environment variable template - `README.md` - This file ### 2. Create Configuration File Copy the example environment file and edit it with your actual values: ```bash cp .env.example .env ``` Edit `.env` and configure: ```bash # Your target service URL TARGET_URL=http://your-kubernetes-service:8080 # Home Assistant configuration HA_URL=http://homeassistant:8123 HA_TOKEN=your_long_lived_access_token_here HA_ENTITY=switch.your_power_switch_entity # Optional: Adjust timing if needed GRACE_PERIOD=300 # 5 minutes CHECK_INTERVAL=30 # Check every 30 seconds ``` ### 3. Generate Home Assistant Token 1. Open Home Assistant web interface 2. Go to **Settings** → **Developer Tools** → **Long-Lived Access Tokens** 3. Click **Create Token** 4. Name it (e.g., "Thinkcentre Watchdog") 5. Copy the token and paste it in your `.env` file as `HA_TOKEN` ### 4. Configure Power Switch in Home Assistant Ensure you have a switch entity in Home Assistant that controls the machine's power. Common options: - **Smart Outlet/Relay**: If using a smart power outlet - **IPMI/Redfish**: For datacenter machines - **Smart Plug**: Like Tasmota, Zigbee, or Z-Wave devices Configure the entity ID in your `.env` as `HA_ENTITY` (e.g., `switch.thinkcentre_power`) ### 5. Build and Run Start the monitoring container: ```bash docker compose up -d ``` The container will: - Build from the Dockerfile - Start with `restart: unless-stopped` policy - Mount logs to a named volume - Apply resource limits (0.1 CPU, 64MB memory) ### 6. View Logs Monitor real-time logs: ```bash docker compose logs -f thinkcenter-monitor ``` Or view persistent logs from the volume: ```bash docker volume inspect thinkcenter_logs # Look at the Mountpoint directory ``` ### 7. Stop or Restart Stop the container: ```bash docker compose down ``` Restart the container: ```bash docker compose restart thinkcenter-monitor ``` ## Deploying Multiple Instances To monitor multiple machines: ### For Machine 2: Create a separate directory: ```bash mkdir thinkcentre-watchdog-machine2 cd thinkcentre-watchdog-machine2 # Copy files cp /path/to/original/* . # Create unique .env cp .env.example .env # Edit .env for machine 2 nano .env # Change: HA_ENTITY=switch.machine2_power # Change: TARGET_URL to machine 2's service URL ``` Then run: ```bash docker compose up -d ``` ### Using Namespace (Alternative) Or manage from one directory with unique service names: ```bash docker compose -f docker-compose.yml -f docker-compose.machine2.yml up -d ``` ## Configuration Variables | Variable | Default | Description | |----------|---------|-------------| | `TARGET_URL` | `http://localhost:8080` | Service URL to monitor | | `HA_URL` | `http://homeassistant:8123` | Home Assistant base URL | | `HA_TOKEN` | (required) | Home Assistant long-lived access token | | `HA_ENTITY` | `switch.thinkcentre_power` | Home Assistant switch entity ID | | `GRACE_PERIOD` | `300` | Seconds to wait before power-cycling (5 minutes) | | `CHECK_INTERVAL` | `30` | Seconds between health checks | ## Troubleshooting ### Container won't start Check if `HA_TOKEN` is set: ```bash docker compose config | grep HA_TOKEN ``` ### No logs appearing Check the volume mount: ```bash docker volume ls | grep thinkcenter_logs docker volume inspect thinkcenter_logs ``` ### Power-cycle not triggering 1. Verify HA_TOKEN is valid (check Home Assistant logs) 2. Confirm HA_ENTITY exists in Home Assistant 3. Check network connectivity: `docker compose exec thinkcenter-monitor curl -v http://homeassistant:8123` ### Service not responding correctly Test the target URL directly: ```bash docker compose exec thinkcenter-monitor curl -v http://your-service:8080 ``` ## How It Works 1. **Health Check**: Every `CHECK_INTERVAL` seconds, HTTP response code is checked 2. **Grace Period**: First 502 error triggers a 5-minute window for recovery 3. **Recovery Detection**: If service returns non-502 during grace period, error resets 4. **Power Cycle**: After grace period expires with continued 502s, power cycle triggers: - Send turn_off to HA switch entity - Wait 10 seconds - Send turn_on to HA switch entity 5. **Logging**: All events timestamped and logged to `/var/log/thinkcenter_monitor.log` ## Resource Limits - CPU: 0.1 cores (limited to prevent resource hogging) - Memory: 64MB (minimal requirements for bash + curl) - Logging: JSON file driver, max 10MB per file, keeps 3 files (30MB total) ## Debugging Enable verbose output by checking logs with: ```bash docker compose logs --tail 50 thinkcenter-monitor ``` To test the script locally (without Docker): ```bash bash thinkcenter_monitor.sh ``` ## License Monitoring solution for Thinkcentre machines. ## Support For issues or improvements, check the logs first and verify all environment variables are correctly set in your `.env` file.