Files
Thinkcentre-watchdog/README.md

243 lines
6.0 KiB
Markdown

# Thinkcentre Watchdog
A Docker-based monitoring solution for detecting and auto-rebooting hung Kubernetes machines via Home Assistant integration.
## Overview
This watchdog monitors a target service URL for 502 Bad Gateway errors (indicating a hung machine). When a service fails:
1. A 5-minute grace period begins (allowing for deployment recoveries)
2. If the service recovers within 5 minutes, the error is cleared (normal deployment scenario)
3. If still failing after 5 minutes, an automatic power-cycle is triggered via Home Assistant
4. The machine powers off for 10 seconds, then powers back on
All activity is logged with timestamps for monitoring and troubleshooting.
## Prerequisites
- Docker and Docker Compose installed
- Home Assistant instance running with network access
- A power switch entity configured in Home Assistant
- Long-lived access token from Home Assistant
## Installation
### 1. Download/Organize Files
Clone or download this repository to your machine:
```bash
git clone <repository-url>
cd Thinkcentre-watchdog
```
The directory should contain:
- `Dockerfile` - Container definition
- `thinkcenter_monitor.sh` - Monitoring script
- `docker-compose.yml` - Docker Compose configuration
- `.env.example` - Environment variable template
- `README.md` - This file
### 2. Create Configuration File
Copy the example environment file and edit it with your actual values:
```bash
cp .env.example .env
```
Edit `.env` and configure:
```bash
# Your target service URL
TARGET_URL=http://your-kubernetes-service:8080
# Home Assistant configuration
HA_URL=http://homeassistant:8123
HA_TOKEN=your_long_lived_access_token_here
HA_ENTITY=switch.your_power_switch_entity
# Optional: Adjust timing if needed
GRACE_PERIOD=300 # 5 minutes
CHECK_INTERVAL=30 # Check every 30 seconds
```
### 3. Generate Home Assistant Token
1. Open Home Assistant web interface
2. Go to **Settings****Developer Tools****Long-Lived Access Tokens**
3. Click **Create Token**
4. Name it (e.g., "Thinkcentre Watchdog")
5. Copy the token and paste it in your `.env` file as `HA_TOKEN`
### 4. Configure Power Switch in Home Assistant
Ensure you have a switch entity in Home Assistant that controls the machine's power. Common options:
- **Smart Outlet/Relay**: If using a smart power outlet
- **IPMI/Redfish**: For datacenter machines
- **Smart Plug**: Like Tasmota, Zigbee, or Z-Wave devices
Configure the entity ID in your `.env` as `HA_ENTITY` (e.g., `switch.thinkcentre_power`)
### 5. Build and Run
Start the monitoring container:
```bash
docker compose up -d
```
The container will:
- Build from the Dockerfile
- Start with `restart: unless-stopped` policy
- Mount logs to a named volume
- Apply resource limits (0.1 CPU, 64MB memory)
### 6. View Logs
Monitor real-time logs:
```bash
docker compose logs -f thinkcenter-monitor
```
Or view persistent logs from the volume:
```bash
docker volume inspect thinkcenter_logs
# Look at the Mountpoint directory
```
### 7. Stop or Restart
Stop the container:
```bash
docker compose down
```
Restart the container:
```bash
docker compose restart thinkcenter-monitor
```
## Deploying Multiple Instances
To monitor multiple machines:
### For Machine 2:
Create a separate directory:
```bash
mkdir thinkcentre-watchdog-machine2
cd thinkcentre-watchdog-machine2
# Copy files
cp /path/to/original/* .
# Create unique .env
cp .env.example .env
# Edit .env for machine 2
nano .env
# Change: HA_ENTITY=switch.machine2_power
# Change: TARGET_URL to machine 2's service URL
```
Then run:
```bash
docker compose up -d
```
### Using Namespace (Alternative)
Or manage from one directory with unique service names:
```bash
docker compose -f docker-compose.yml -f docker-compose.machine2.yml up -d
```
## Configuration Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `TARGET_URL` | `http://localhost:8080` | Service URL to monitor |
| `HA_URL` | `http://homeassistant:8123` | Home Assistant base URL |
| `HA_TOKEN` | (required) | Home Assistant long-lived access token |
| `HA_ENTITY` | `switch.thinkcentre_power` | Home Assistant switch entity ID |
| `GRACE_PERIOD` | `300` | Seconds to wait before power-cycling (5 minutes) |
| `CHECK_INTERVAL` | `30` | Seconds between health checks |
## Troubleshooting
### Container won't start
Check if `HA_TOKEN` is set:
```bash
docker compose config | grep HA_TOKEN
```
### No logs appearing
Check the volume mount:
```bash
docker volume ls | grep thinkcenter_logs
docker volume inspect thinkcenter_logs
```
### Power-cycle not triggering
1. Verify HA_TOKEN is valid (check Home Assistant logs)
2. Confirm HA_ENTITY exists in Home Assistant
3. Check network connectivity: `docker compose exec thinkcenter-monitor curl -v http://homeassistant:8123`
### Service not responding correctly
Test the target URL directly:
```bash
docker compose exec thinkcenter-monitor curl -v http://your-service:8080
```
## How It Works
1. **Health Check**: Every `CHECK_INTERVAL` seconds, HTTP response code is checked
2. **Grace Period**: First 502 error triggers a 5-minute window for recovery
3. **Recovery Detection**: If service returns non-502 during grace period, error resets
4. **Power Cycle**: After grace period expires with continued 502s, power cycle triggers:
- Send turn_off to HA switch entity
- Wait 10 seconds
- Send turn_on to HA switch entity
5. **Logging**: All events timestamped and logged to `/var/log/thinkcenter_monitor.log`
## Resource Limits
- CPU: 0.1 cores (limited to prevent resource hogging)
- Memory: 64MB (minimal requirements for bash + curl)
- Logging: JSON file driver, max 10MB per file, keeps 3 files (30MB total)
## Debugging
Enable verbose output by checking logs with:
```bash
docker compose logs --tail 50 thinkcenter-monitor
```
To test the script locally (without Docker):
```bash
bash thinkcenter_monitor.sh
```
## License
Monitoring solution for Thinkcentre machines.
## Support
For issues or improvements, check the logs first and verify all environment variables are correctly set in your `.env` file.