243 lines
6.0 KiB
Markdown
243 lines
6.0 KiB
Markdown
# Thinkcentre Watchdog
|
|
|
|
A Docker-based monitoring solution for detecting and auto-rebooting hung Kubernetes machines via Home Assistant integration.
|
|
|
|
## Overview
|
|
|
|
This watchdog monitors a target service URL for 502 Bad Gateway errors (indicating a hung machine). When a service fails:
|
|
|
|
1. A 5-minute grace period begins (allowing for deployment recoveries)
|
|
2. If the service recovers within 5 minutes, the error is cleared (normal deployment scenario)
|
|
3. If still failing after 5 minutes, an automatic power-cycle is triggered via Home Assistant
|
|
4. The machine powers off for 10 seconds, then powers back on
|
|
|
|
All activity is logged with timestamps for monitoring and troubleshooting.
|
|
|
|
## Prerequisites
|
|
|
|
- Docker and Docker Compose installed
|
|
- Home Assistant instance running with network access
|
|
- A power switch entity configured in Home Assistant
|
|
- Long-lived access token from Home Assistant
|
|
|
|
## Installation
|
|
|
|
### 1. Download/Organize Files
|
|
|
|
Clone or download this repository to your machine:
|
|
|
|
```bash
|
|
git clone <repository-url>
|
|
cd Thinkcentre-watchdog
|
|
```
|
|
|
|
The directory should contain:
|
|
- `Dockerfile` - Container definition
|
|
- `thinkcenter_monitor.sh` - Monitoring script
|
|
- `docker-compose.yml` - Docker Compose configuration
|
|
- `.env.example` - Environment variable template
|
|
- `README.md` - This file
|
|
|
|
### 2. Create Configuration File
|
|
|
|
Copy the example environment file and edit it with your actual values:
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
```
|
|
|
|
Edit `.env` and configure:
|
|
|
|
```bash
|
|
# Your target service URL
|
|
TARGET_URL=http://your-kubernetes-service:8080
|
|
|
|
# Home Assistant configuration
|
|
HA_URL=http://homeassistant:8123
|
|
HA_TOKEN=your_long_lived_access_token_here
|
|
HA_ENTITY=switch.your_power_switch_entity
|
|
|
|
# Optional: Adjust timing if needed
|
|
GRACE_PERIOD=300 # 5 minutes
|
|
CHECK_INTERVAL=30 # Check every 30 seconds
|
|
```
|
|
|
|
### 3. Generate Home Assistant Token
|
|
|
|
1. Open Home Assistant web interface
|
|
2. Go to **Settings** → **Developer Tools** → **Long-Lived Access Tokens**
|
|
3. Click **Create Token**
|
|
4. Name it (e.g., "Thinkcentre Watchdog")
|
|
5. Copy the token and paste it in your `.env` file as `HA_TOKEN`
|
|
|
|
### 4. Configure Power Switch in Home Assistant
|
|
|
|
Ensure you have a switch entity in Home Assistant that controls the machine's power. Common options:
|
|
|
|
- **Smart Outlet/Relay**: If using a smart power outlet
|
|
- **IPMI/Redfish**: For datacenter machines
|
|
- **Smart Plug**: Like Tasmota, Zigbee, or Z-Wave devices
|
|
|
|
Configure the entity ID in your `.env` as `HA_ENTITY` (e.g., `switch.thinkcentre_power`)
|
|
|
|
### 5. Build and Run
|
|
|
|
Start the monitoring container:
|
|
|
|
```bash
|
|
docker compose up -d
|
|
```
|
|
|
|
The container will:
|
|
- Build from the Dockerfile
|
|
- Start with `restart: unless-stopped` policy
|
|
- Mount logs to a named volume
|
|
- Apply resource limits (0.1 CPU, 64MB memory)
|
|
|
|
### 6. View Logs
|
|
|
|
Monitor real-time logs:
|
|
|
|
```bash
|
|
docker compose logs -f thinkcenter-monitor
|
|
```
|
|
|
|
Or view persistent logs from the volume:
|
|
|
|
```bash
|
|
docker volume inspect thinkcenter_logs
|
|
# Look at the Mountpoint directory
|
|
```
|
|
|
|
### 7. Stop or Restart
|
|
|
|
Stop the container:
|
|
|
|
```bash
|
|
docker compose down
|
|
```
|
|
|
|
Restart the container:
|
|
|
|
```bash
|
|
docker compose restart thinkcenter-monitor
|
|
```
|
|
|
|
## Deploying Multiple Instances
|
|
|
|
To monitor multiple machines:
|
|
|
|
### For Machine 2:
|
|
|
|
Create a separate directory:
|
|
|
|
```bash
|
|
mkdir thinkcentre-watchdog-machine2
|
|
cd thinkcentre-watchdog-machine2
|
|
|
|
# Copy files
|
|
cp /path/to/original/* .
|
|
|
|
# Create unique .env
|
|
cp .env.example .env
|
|
|
|
# Edit .env for machine 2
|
|
nano .env
|
|
# Change: HA_ENTITY=switch.machine2_power
|
|
# Change: TARGET_URL to machine 2's service URL
|
|
```
|
|
|
|
Then run:
|
|
|
|
```bash
|
|
docker compose up -d
|
|
```
|
|
|
|
### Using Namespace (Alternative)
|
|
|
|
Or manage from one directory with unique service names:
|
|
|
|
```bash
|
|
docker compose -f docker-compose.yml -f docker-compose.machine2.yml up -d
|
|
```
|
|
|
|
## Configuration Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `TARGET_URL` | `http://localhost:8080` | Service URL to monitor |
|
|
| `HA_URL` | `http://homeassistant:8123` | Home Assistant base URL |
|
|
| `HA_TOKEN` | (required) | Home Assistant long-lived access token |
|
|
| `HA_ENTITY` | `switch.thinkcentre_power` | Home Assistant switch entity ID |
|
|
| `GRACE_PERIOD` | `300` | Seconds to wait before power-cycling (5 minutes) |
|
|
| `CHECK_INTERVAL` | `30` | Seconds between health checks |
|
|
|
|
## Troubleshooting
|
|
|
|
### Container won't start
|
|
|
|
Check if `HA_TOKEN` is set:
|
|
```bash
|
|
docker compose config | grep HA_TOKEN
|
|
```
|
|
|
|
### No logs appearing
|
|
|
|
Check the volume mount:
|
|
```bash
|
|
docker volume ls | grep thinkcenter_logs
|
|
docker volume inspect thinkcenter_logs
|
|
```
|
|
|
|
### Power-cycle not triggering
|
|
|
|
1. Verify HA_TOKEN is valid (check Home Assistant logs)
|
|
2. Confirm HA_ENTITY exists in Home Assistant
|
|
3. Check network connectivity: `docker compose exec thinkcenter-monitor curl -v http://homeassistant:8123`
|
|
|
|
### Service not responding correctly
|
|
|
|
Test the target URL directly:
|
|
```bash
|
|
docker compose exec thinkcenter-monitor curl -v http://your-service:8080
|
|
```
|
|
|
|
## How It Works
|
|
|
|
1. **Health Check**: Every `CHECK_INTERVAL` seconds, HTTP response code is checked
|
|
2. **Grace Period**: First 502 error triggers a 5-minute window for recovery
|
|
3. **Recovery Detection**: If service returns non-502 during grace period, error resets
|
|
4. **Power Cycle**: After grace period expires with continued 502s, power cycle triggers:
|
|
- Send turn_off to HA switch entity
|
|
- Wait 10 seconds
|
|
- Send turn_on to HA switch entity
|
|
5. **Logging**: All events timestamped and logged to `/var/log/thinkcenter_monitor.log`
|
|
|
|
## Resource Limits
|
|
|
|
- CPU: 0.1 cores (limited to prevent resource hogging)
|
|
- Memory: 64MB (minimal requirements for bash + curl)
|
|
- Logging: JSON file driver, max 10MB per file, keeps 3 files (30MB total)
|
|
|
|
## Debugging
|
|
|
|
Enable verbose output by checking logs with:
|
|
|
|
```bash
|
|
docker compose logs --tail 50 thinkcenter-monitor
|
|
```
|
|
|
|
To test the script locally (without Docker):
|
|
|
|
```bash
|
|
bash thinkcenter_monitor.sh
|
|
```
|
|
|
|
## License
|
|
|
|
Monitoring solution for Thinkcentre machines.
|
|
|
|
## Support
|
|
|
|
For issues or improvements, check the logs first and verify all environment variables are correctly set in your `.env` file.
|