# Thinkcentre Watchdog

A Docker-based monitoring solution for detecting and auto-rebooting hung Kubernetes machines via Home Assistant integration.

## Overview

This watchdog monitors a target service URL for 502 Bad Gateway errors (indicating a hung machine). When a service fails:

1. A 5-minute grace period begins (allowing for deployment recoveries)
2. If the service recovers within 5 minutes, the error is cleared (normal deployment scenario)
3. If still failing after 5 minutes, an automatic power-cycle is triggered via Home Assistant
4. The machine powers off for 10 seconds, then powers back on

All activity is logged with timestamps for monitoring and troubleshooting.

## Prerequisites

- Docker and Docker Compose installed
- Home Assistant instance running with network access
- A power switch entity configured in Home Assistant
- Long-lived access token from Home Assistant

## Installation

### 1. Download/Organize Files

Clone or download this repository to your machine:

```bash
git clone <repository-url>
cd Thinkcentre-watchdog
```

The directory should contain:
- `Dockerfile` - Container definition
- `thinkcenter_monitor.sh` - Monitoring script
- `docker-compose.yml` - Docker Compose configuration
- `.env.example` - Environment variable template
- `README.md` - This file

### 2. Create Configuration File

Copy the example environment file and edit it with your actual values:

```bash
cp .env.example .env
```

Edit `.env` and configure:

```bash
# Your target service URL
TARGET_URL=http://your-kubernetes-service:8080

# Home Assistant configuration
HA_URL=http://homeassistant:8123
HA_TOKEN=your_long_lived_access_token_here
HA_ENTITY=switch.your_power_switch_entity

# Optional: Adjust timing if needed
GRACE_PERIOD=300      # 5 minutes
CHECK_INTERVAL=30     # Check every 30 seconds
```

### 3. Generate Home Assistant Token

1. Open Home Assistant web interface
2. Go to **Settings** → **Developer Tools** → **Long-Lived Access Tokens**
3. Click **Create Token**
4. Name it (e.g., "Thinkcentre Watchdog")
5. Copy the token and paste it in your `.env` file as `HA_TOKEN`

### 4. Configure Power Switch in Home Assistant

Ensure you have a switch entity in Home Assistant that controls the machine's power. Common options:

- **Smart Outlet/Relay**: If using a smart power outlet
- **IPMI/Redfish**: For datacenter machines
- **Smart Plug**: Like Tasmota, Zigbee, or Z-Wave devices

Configure the entity ID in your `.env` as `HA_ENTITY` (e.g., `switch.thinkcentre_power`)

### 5. Build and Run

Start the monitoring container:

```bash
docker compose up -d
```

The container will:
- Build from the Dockerfile
- Start with `restart: unless-stopped` policy
- Mount logs to a named volume
- Apply resource limits (0.1 CPU, 64MB memory)

### 6. View Logs

Monitor real-time logs:

```bash
docker compose logs -f thinkcenter-monitor
```

Or view persistent logs from the volume:

```bash
docker volume inspect thinkcenter_logs
# Look at the Mountpoint directory
```

### 7. Stop or Restart

Stop the container:

```bash
docker compose down
```

Restart the container:

```bash
docker compose restart thinkcenter-monitor
```

## Deploying Multiple Instances

To monitor multiple machines:

### For Machine 2:

Create a separate directory:

```bash
mkdir thinkcentre-watchdog-machine2
cd thinkcentre-watchdog-machine2

# Copy files
cp /path/to/original/* .

# Create unique .env
cp .env.example .env

# Edit .env for machine 2
nano .env
# Change: HA_ENTITY=switch.machine2_power
# Change: TARGET_URL to machine 2's service URL
```

Then run:

```bash
docker compose up -d
```

### Using Namespace (Alternative)

Or manage from one directory with unique service names:

```bash
docker compose -f docker-compose.yml -f docker-compose.machine2.yml up -d
```

## Configuration Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `TARGET_URL` | `http://localhost:8080` | Service URL to monitor |
| `HA_URL` | `http://homeassistant:8123` | Home Assistant base URL |
| `HA_TOKEN` | (required) | Home Assistant long-lived access token |
| `HA_ENTITY` | `switch.thinkcentre_power` | Home Assistant switch entity ID |
| `GRACE_PERIOD` | `300` | Seconds to wait before power-cycling (5 minutes) |
| `CHECK_INTERVAL` | `30` | Seconds between health checks |

## Troubleshooting

### Container won't start

Check if `HA_TOKEN` is set:
```bash
docker compose config | grep HA_TOKEN
```

### No logs appearing

Check the volume mount:
```bash
docker volume ls | grep thinkcenter_logs
docker volume inspect thinkcenter_logs
```

### Power-cycle not triggering

1. Verify HA_TOKEN is valid (check Home Assistant logs)
2. Confirm HA_ENTITY exists in Home Assistant
3. Check network connectivity: `docker compose exec thinkcenter-monitor curl -v http://homeassistant:8123`

### Service not responding correctly

Test the target URL directly:
```bash
docker compose exec thinkcenter-monitor curl -v http://your-service:8080
```

## How It Works

1. **Health Check**: Every `CHECK_INTERVAL` seconds, HTTP response code is checked
2. **Grace Period**: First 502 error triggers a 5-minute window for recovery
3. **Recovery Detection**: If service returns non-502 during grace period, error resets
4. **Power Cycle**: After grace period expires with continued 502s, power cycle triggers:
   - Send turn_off to HA switch entity
   - Wait 10 seconds
   - Send turn_on to HA switch entity
5. **Logging**: All events timestamped and logged to `/var/log/thinkcenter_monitor.log`

## Resource Limits

- CPU: 0.1 cores (limited to prevent resource hogging)
- Memory: 64MB (minimal requirements for bash + curl)
- Logging: JSON file driver, max 10MB per file, keeps 3 files (30MB total)

## Debugging

Enable verbose output by checking logs with:

```bash
docker compose logs --tail 50 thinkcenter-monitor
```

To test the script locally (without Docker):

```bash
bash thinkcenter_monitor.sh
```

## License

Monitoring solution for Thinkcentre machines.

## Support

For issues or improvements, check the logs first and verify all environment variables are correctly set in your `.env` file.