diff --git a/docs/server-runtime.md b/docs/server-runtime.md new file mode 100644 index 0000000..9dcce40 --- /dev/null +++ b/docs/server-runtime.md @@ -0,0 +1,424 @@ +# Server runtime audit + +Engineer-to-engineer writeup of what the VPS `mt2.jakubkadlec.dev` is actually +running as of 2026-04-14. Existing docs under `docs/` describe the intended +layout (`debian-runtime.md`, `database-bootstrap.md`, `config-and-secrets.md`); +this document is a ground-truth snapshot from a live recon session, with PIDs, +paths, versions and surprises. + +Companion: `docs/server-topology.md` for the ASCII diagram and port table. + +## TL;DR + +- Only one metin binary is alive right now: the **`db`** helper on port `9000` + (PID `1788997` at audit time, cwd + `/home/mt2.jakubkadlec.dev/metin/runtime/server/channels/db`). +- **`game_auth` and all `channel*_core*` processes are NOT running.** The listing + in the original prompt (auth `:11000/12000`, channel1 cores `:11011/12011` + etc.) reflects *intended* state from the systemd units, not the current live + process table. `ss -tlnp` only shows `0.0.0.0:9000` for m2. +- The game/auth binaries are **not present on disk either**. Only + `share/bin/db` exists; there is no `share/bin/game_auth` and no + `share/bin/channel*_core*`. Those channels cannot start even if requested. +- The `db` unit is currently **flapping / crash-looping**. `systemctl` reports + `deactivating (stop-sigterm)`; syserr.log shows repeated + `Connection reset by peer` from client peers (auth/game trying to reconnect + is the usual culprit, but here nobody is connecting — cause needs + verification). Two fresh `core.` files (97 MB each) sit in the db + channel dir from 13:24 and 13:25 today. +- Orchestration is **pure systemd**, not the upstream `start.py` / tmux setup. + The README still documents `start.py`, so the README is stale for the Debian + VPS; `deploy/systemd/` + `docs/debian-runtime.md` are authoritative. +- MariaDB 11.8.6 is the backing store on `127.0.0.1:3306`. The DB user the + stack is configured to use is `bootstrap` (from `share/conf/db.txt` / + `game.txt`). The actual password is injected via `/etc/metin/metin.env`, + which is `root:root 600` and intentionally unreadable by the runtime user + inspector account. + +## Host + +- Hostname: `vmi3229987` (Contabo), public name `mt2.jakubkadlec.dev`. +- OS: Debian 13 (trixie). +- MariaDB: `mariadbd` 11.8.6, PID `103624`, listening on `127.0.0.1:3306`. +- All metin services run as the unprivileged user + `mt2.jakubkadlec.dev:mt2.jakubkadlec.dev`. +- Runtime root: `/home/mt2.jakubkadlec.dev/metin/runtime/server` (755 MB across + `channels/`, 123 MB across `share/`, total metin workspace on the box + ~1.7 GB). + +## Processes currently alive + +From `ps auxf` + `ss -tlnp` at audit time: + +``` +mysql 103624 /usr/sbin/mariadbd — 127.0.0.1:3306 +mt2.j+ 1788997 /home/.../channels/db/db — 0.0.0.0:9000 +``` + +No other m2 binaries show up. `ps` has **zero** matches for `game_auth`, +`channel1_core1`, `channel1_core2`, `channel1_core3`, `channel99_core1`. + +Per-process inspection: + +| PID | cwd | exe (resolved) | fds of interest | +| ------- | ----------------------------------------------- | ------------------------------------------------- | --------------- | +| 1788997 | `.../runtime/server/channels/db` | `.../share/bin/db` (via `./db` symlink) | fd 3→syslog.log, fd 4→syserr.log, fd 11 TCP `*:9000`, fd 17 `[eventpoll]` (epoll fdwatch) | + +The `db` symlink inside the channel dir resolves to `../../share/bin/db`, +which is an `ELF 64-bit LSB pie executable, x86-64, dynamically linked, +BuildID fc049d0f..., not stripped`. Build identifier from +`channels/db/VERSION.txt`: **`db revision: b2b037f-dirty`** — the dirty tag is +a red flag, the build wasn't from a clean checkout of `m2dev-server-src`. + +The `usage.txt` in the same directory shows hourly heartbeat rows with +`| 0 | 0 |` since 2026-04-13 21:00 (the "sessions / active" columns are +stuck at zero — consistent with no game channels being connected). + +## Binaries actually present on disk + +``` +/home/mt2.jakubkadlec.dev/metin/runtime/server/share/bin/ +├── db ← present, used +└── game ← present (shared game binary, but not launched under any + instance name that the systemd generator expects) +``` + +What is NOT present: + +- `share/bin/game_auth` +- `share/bin/channel1_core1`, `channel1_core2`, `channel1_core3` +- `share/bin/channel99_core1` + +The `metin-game-instance-start` helper (`/usr/local/libexec/...`) is a bash +wrapper that `cd`s into `channels///` and execs `./`, +e.g. `./channel1_core1`. Those per-instance binaries don't exist yet. The +channel dirs themselves (`channel1/core1/`, etc.) already contain the +scaffolding (`CONFIG`, `conf`, `data`, `log`, `mark`, `package`, +`p2p_packet_info.txt`, `packet_info.txt`, `syserr.log`, `syslog.log`, +`version.txt`), but `version.txt` says `game revision: unknown` and the +per-instance executable file is missing. The log directory has a single +stale `syslog_2026-04-13.log`. + +Interpretation: the deploy pipeline that builds `m2dev-server-src` and drops +instance binaries into `share/bin/` has not yet been run (or has not been +re-run since the tree was laid out on 2026-04-13). Once Jakub's +`debian-foundation` build produces per-instance symlinked/hardlinked +binaries, the `metin-game@*` units should come up automatically on the next +`systemctl restart metin-server`. + +## How things are started + +All orchestration goes through systemd units under `/etc/systemd/system/`, +installed from `deploy/systemd/` via `deploy/systemd/install_systemd.py`. + +Unit list and roles: + +| Unit | Type | Role | +| ----------------------------------------- | -------- | -------------------------------------------- | +| `metin-server.service` | oneshot | top-level grouping, `Requires=mariadb.service`. `ExecStart=/bin/true`, `RemainAfterExit=yes`. All sub-units are `PartOf=metin-server.service` so restarting `metin-server` cycles everything. | +| `metin-db.service` | simple | launches `.../channels/db/db` as runtime user, `Restart=on-failure`, `LimitCORE=infinity`, env file `/etc/metin/metin.env`. | +| `metin-db-ready.service` | oneshot | runs `/usr/local/libexec/metin-wait-port 127.0.0.1 9000 30` — gate that blocks auth+game until the DB socket is listening. | +| `metin-auth.service` | simple | launches `.../channels/auth/game_auth`. Requires db-ready. | +| `metin-game@channel1_core1..3.service` | template | each runs `/usr/local/libexec/metin-game-instance-start ` which execs `./` in that channel dir. | +| `metin-game@channel99_core1.service` | template | same, for channel 99. | + +Dependency chain: + +``` +mariadb.service + │ + ▼ +metin-db.service ──► metin-db-ready.service ──► metin-auth.service + └► metin-game@*.service + │ + ▼ + metin-server.service (oneshot umbrella) +``` + +All units have `PartOf=metin-server.service`, `Restart=on-failure`, +`LimitNOFILE=65535`, `LimitCORE=infinity`. None run in Docker. None use tmux, +screen or the upstream `start.py`. **The upstream `start.py` / `stop.py` in +the repo are NOT wired up on this host** and should be treated as FreeBSD-era +legacy. + +The per-instance launcher `/usr/local/libexec/metin-game-instance-start` +(installed by `install_systemd.py`) is: + +```bash +#!/usr/bin/env bash +set -euo pipefail +instance="${1:?missing instance name}" +root_dir="/home/mt2.jakubkadlec.dev/metin/runtime/server/channels" +channel_dir="${instance%_*}" # e.g. channel1 from channel1_core2 +core_dir="${instance##*_}" # e.g. core2 +workdir="${root_dir}/${channel_dir}/${core_dir}" +cd "$workdir" +exec "./${instance}" +``` + +Notes: + +- the `%_*` / `##*_` parse is brittle — an instance name with more than one + underscore would misbehave. For current naming (`channelN_coreM`) it works. +- the helper does not redirect stdout/stderr; both go to the journal via + systemd. + +## Config files the binaries actually read + +All m2 config files referenced by the running/installed stack, resolved to +their real path on disk: + +| Config file | Read by | Purpose | +| ------------------------------------------------------------------------ | ------------- | --------------------------------------------------- | +| `share/conf/db.txt` | `db` | SQL hosts, BIND_PORT=9000, item id range, hotbackup | +| `share/conf/game.txt` | game cores | DB_ADDR=127.0.0.1, DB_PORT=9000, SQL creds, flags | +| `share/conf/CMD` | game cores | in-game command ACL (notice, warp, item, …) | +| `share/conf/item_proto.txt`, `mob_proto.txt`, `item_names*.txt`, `mob_names*.txt` | both db and game | static content tables | +| `channels/db/conf` (symlink → `share/conf`) | `db` | every db channel looks into this flat conf tree | +| `channels/db/data` (symlink → `share/data`) | `db`/`game` | mob/pc/dungeon/spawn data | +| `channels/db/locale` (symlink → `share/locale`) | all | locale assets | +| `channels/auth/CONFIG` | `game_auth` | `HOSTNAME: auth`, `CHANNEL: 1`, `PORT: 11000`, `P2P_PORT: 12000`, `AUTH_SERVER: master` | +| `channels/channel1/core1/CONFIG` | core1 | `HOSTNAME: channel1_1`, `CHANNEL: 1`, `PORT: 11011`, `P2P_PORT: 12011`, `MAP_ALLOW: 1 4 5 6 3 23 43 112 107 67 68 72 208 302 304` | +| `channels/channel1/core2/CONFIG` | core2 | `PORT: 11012`, `P2P_PORT: 12012` | +| `channels/channel1/core3/CONFIG` | core3 | `PORT: 11013`, `P2P_PORT: 12013` | +| `channels/channel99/core1/CONFIG` | ch99 core1 | `HOSTNAME: channel99_1`, `CHANNEL: 99`, `PORT: 11991`, `P2P_PORT: 12991`, `MAP_ALLOW: 113 81 100 101 103 105 110 111 114 118 119 120 121 122 123 124 125 126 127 128 181 182 183 200` | +| `/etc/metin/metin.env` | all systemd units via `EnvironmentFile=-` | host-local secrets/overrides, root:root mode 600. Contents not readable during this audit. | + +Flat `share/conf/db.txt` (verbatim, with bootstrap secrets): + +``` +WELCOME_MSG = "Database connector is running..." +SQL_ACCOUNT = "127.0.0.1 account bootstrap change-me 0" +SQL_PLAYER = "127.0.0.1 player bootstrap change-me 0" +SQL_COMMON = "127.0.0.1 common bootstrap change-me 0" +SQL_HOTBACKUP= "127.0.0.1 hotbackup bootstrap change-me 0" +TABLE_POSTFIX = "" +BIND_PORT = 9000 +CLIENT_HEART_FPS = 60 +HASH_PLAYER_LIFE_SEC = 600 +BACKUP_LIMIT_SEC = 3600 +PLAYER_ID_START = 100 +PLAYER_DELETE_LEVEL_LIMIT = 70 +PLAYER_DELETE_CHECK_SIMPLE = 1 +ITEM_ID_RANGE = 2000000000 2100000000 +MIN_LENGTH_OF_SOCIAL_ID = 6 +SIMPLE_SOCIALID = 1 +``` + +The `bootstrap` / `change-me` values are git-tracked placeholders. +`config-and-secrets.md` explicitly says these are templates, and real values +are expected to come from `/etc/metin/metin.env`. This works because the +server source re-reads credentials from the environment when injected; verify +by grepping `m2dev-server-src` for the SQL env var names used by `db`/`game`. +(**Open question**: confirm which env var names override the in-file creds; +the audit session couldn't read `metin.env` directly.) + +## Database + +- Engine: **MariaDB 11.8.6** (`mariadb --version`). +- PID: 103624, listening on `127.0.0.1:3306` only. No external TCP + exposure, no unix socket checked (likely `/run/mysqld/mysqld.sock`). +- Expected databases from `docs/database-bootstrap.md`: `account`, `player`, + `common`, `log`, `hotbackup`. +- Stack-side DB user: `bootstrap` (placeholder in git, real password in + `/etc/metin/metin.env`). +- Could not enumerate actual tables during the audit — both `mysql -uroot` + and `sudo -u mt2.jakubkadlec.dev mariadb` failed (Access denied), since + root uses unix-socket auth for `root@localhost` and the runtime user has + no CLI credentials outside the systemd environment. +- **To inspect the DB read-only:** either run as root with + `sudo mariadb` (unix socket auth — needs confirmation it's enabled), or + open `/etc/metin/metin.env` as root, grab the `bootstrap` password, then + `mariadb -ubootstrap -p account` etc. Do not attempt writes. + +## Logging + +Every m2 process writes two files in its channel dir, via fd 3 / fd 4: + +- `syslog.log` — verbose info stream (rotated by date in some dirs: + `channel1/core1/log/syslog_2026-04-13.log`). +- `syserr.log` — error stream. Look here first on crash. + +The `db` channel additionally writes to `syslog.log` (36 MB today, rotating +appears to be manual — there is a `log/` dir with a daily file but the +current `syslog.log` is at the top level) and drops `core.` ELF cores +into the channel dir on SIGSEGV/SIGABRT because `LimitCORE=infinity` is set. + +systemd journal captures stdout/stderr as well, so `journalctl -u metin-db +--since '1 hour ago'` is the fastest way to see startup banners and +`systemd`-observed restarts. Example from this audit: + +``` +Apr 14 13:26:40 vmi3229987 db[1788997]: Real Server +Apr 14 13:26:40 vmi3229987 db[1788997]: Success ACCOUNT +Apr 14 13:26:40 vmi3229987 db[1788997]: Success COMMON +Apr 14 13:26:40 vmi3229987 db[1788997]: Success HOTBACKUP +Apr 14 13:26:40 vmi3229987 db[1788997]: mysql_real_connect: Lost connection + to server at 'sending authentication information', system error: 104 +``` + +Every `db` start it opens *more than a dozen* AsyncSQL pools ("AsyncSQL: +connected to 127.0.0.1 (reconnect 1)" repeated ~12 times), suggesting a large +per-instance pool size. Worth checking if that needs tuning. + +The current `syserr.log` in `channels/db/` is dominated by: + +``` +[error] [int CPeerBase::Recv()()] socket_read failed Connection reset by peer +[error] [int CClientManager::Process()()] Recv failed +``` + +which is the peer disconnect path. Since no auth/game peers should be +connecting right now, this is either a leftover from an earlier start or +something else (maybe a healthcheck probe) is touching 9000 and aborting. +See open questions. + +## Ports + +Live `ss -tlnp` on the VPS (m2-relevant lines only): + +| L3:L4 | Who | Exposure | +| ---------------- | ------------ | -------------- | +| `0.0.0.0:9000` | `db` | **INADDR_ANY** — listens on all interfaces. Look at this. | +| `127.0.0.1:3306` | `mariadbd` | localhost only | + +Not currently listening (would be if auth/game were up): + +- `11000` / `12000` — auth client + p2p +- `11011..11013` / `12011..12013` — channel1 cores + p2p +- `11991` / `12991` — channel99 core1 + p2p + +Other listeners on the host (not m2): `:22`, `:2222` (gitea ssh), `:25` +(postfix loopback), `:80/:443` (Caddy), `:3000` (Gitea), `:2019` (Caddy +admin), `:33891` (unknown loopback), `:5355` / `:53` (resolver). + +**Firewalling note:** `db` binding to `0.0.0.0:9000` is a concern. In the +normal m2 architecture, `db` only talks to auth/game cores on the same host +and should bind to `127.0.0.1` only. Current binding is set by the +`BIND_PORT = 9000` line in `share/conf/db.txt`, which in this server fork +apparently defaults to `INADDR_ANY`. If the Contabo firewall or iptables/nft +rules don't block 9000 from the outside, this is exposed. **Open question: +verify iptables/nftables on the host, or move `db` to `127.0.0.1` explicitly +in source / config.** + +## Data directory layout + +All under `/home/mt2.jakubkadlec.dev/metin/runtime/server/share/`: + +``` +share/ +├── bin/ ← compiled binaries (only db + game present today) +├── conf/ ← db.txt, game.txt, CMD, item_proto.txt, mob_proto.txt, +│ item_names_*.txt, mob_names_*.txt (17 locales each) +├── data/ ← DTA/, dungeon/, easterevent/, mob_spawn/, monster/, +│ pc/, pc2/ (27 MB total) +├── locale/ ← 86 MB, per-locale strings + binary quest outputs +├── mark/ +└── package/ +``` + +Per-channel scaffolding under `channels/` symlinks `conf`, `data`, `locale` +back into `share/`, so each channel reads from a single canonical content +tree. + +## Disk usage footprint + +``` +/home/mt2.jakubkadlec.dev/metin/ 1.7 G (total metin workspace) + runtime/server/share/ 123 M + runtime/server/share/data/ 27 M + runtime/server/share/locale/ 86 M + runtime/server/channels/ 755 M + channels/db/core.178508{2,8} ~194 M (two 97 MB coredumps) + channels/db/syslog.log 36 M (grows fast) +``` + +Core dumps dominate the channel dir footprint right now. Cleaning up old +`core.*` files is safe when the db is not actively crashing (and only after +Jakub has looked at them). + +## How to restart channel1_core2 cleanly + +Pre-flight checklist: + +1. Confirm `share/bin/channel1_core2` actually exists on disk — right now it + does **not**, so the instance cannot start. Skip straight to the + "rebuild / redeploy" section in Jakub's `docs/deploy-workflow.md` + before trying. +2. Confirm `metin-db.service` and `metin-auth.service` are `active (running)` + (`systemctl is-active metin-db metin-auth`). If not, fix upstream first — + a clean restart of core2 requires a healthy auth + db. +3. Check that no player is currently online on that core. With `usage.txt` + at 0/0 this is trivially true today, but in prod do + `cat channels/channel1/core2/usage.txt` first. +4. Look at recent logs so you have a baseline: + `journalctl -u metin-game@channel1_core2 -n 50 --no-pager` + +Clean restart: + +```bash +# on the VPS as root or with sudo +systemctl restart metin-game@channel1_core2.service +systemctl status metin-game@channel1_core2.service --no-pager +journalctl -u metin-game@channel1_core2.service -n 100 --no-pager -f +``` + +Because the unit is `Type=simple` with `Restart=on-failure`, `systemctl +restart` sends SIGTERM, waits up to `TimeoutStopSec=60`, then brings the +process back up. The binary's own `hupsig()` handler logs the SIGTERM into +`syserr.log` and shuts down gracefully. + +Post-restart verification: + +```bash +ss -tlnp | grep -E ':(11012|12012)\b' # expect both ports listening +tail -n 30 /home/mt2.jakubkadlec.dev/metin/runtime/server/channels/channel1/core2/syserr.log +``` + +If the process refuses to stay up (`Restart=on-failure` loops it), **do not** +just bump `RestartSec`; grab the last 200 journal lines and the last 200 +syserr lines and open an issue in `metin-server/m2dev-server-src` against +Jakub. Do not edit the unit file ad-hoc on the host. + +## Open questions + +These are things the audit could not determine without making changes or +getting more access. They need a human operator to resolve. + +1. **Who produces the per-instance binaries** (`channel1_core1`, + `channel1_core2`, `channel1_core3`, `channel99_core1`, `game_auth`)? + The deploy flow expects them in `share/bin/` and channel dirs but they + are missing. Is this still hand-built, or is there a make target that + hardlinks `share/bin/game` into each `channel*/core*/` name? +2. **Why is `db` currently flapping** (`deactivating (stop-sigterm)` in + systemctl, plus two fresh core dumps on 2026-04-14 13:24/13:25 and + dozens of `CPeerBase::Recv()` errors)? Nothing should be connecting to + port 9000 right now. +3. **What the real `metin.env` contains** — specifically, the actual + `bootstrap` DB password, and whether there is a separate admin-page + password override. Audit did not touch `/etc/metin/metin.env`. +4. **Exact override-variable contract** between `share/conf/db.txt` + placeholders and the env file. We need to verify which env var names + the `db`/`game` source actually reads so we know whether the + `change-me` literal is ever used at runtime. +5. **Is `db` intended to bind `0.0.0.0:9000`?** From a defense-in-depth + standpoint it should be `127.0.0.1`. Needs either a source fix or a + host firewall rule. Check current nftables state. +6. **`VERSION.txt` says `db revision: b2b037f-dirty`.** Which tree was this + built from and why "dirty"? Point back at the `m2dev-server-src` + commit and confirm the build artefact is reproducible. +7. **Log rotation**: `channels/db/syslog.log` is already 36 MB today with + nothing connected. There is a `channels/channel1/core1/log/` dated + subdir convention that suggests daily rotation, but `db`'s own syslog + is not rotating. Confirm whether `logrotate` or an in-process rotator + is expected to own this. +8. **Hourly heartbeat in `usage.txt`** comes from where? Every ~1 h a row + is appended — this is probably the `db` backup tick, but confirm it's + not some cron job. +9. **`mysqld`'s live databases**: could not enumerate table names without + credentials. `docs/database-bootstrap.md` lists the expected set; + someone with `metin.env` access should confirm `account`, `player`, + `common`, `log`, `hotbackup` are all present and populated. +10. **Stale README**: top-level `README.md` still documents FreeBSD + + `start.py`. Not urgent, but worth a `docs:` sweep to point readers at + `docs/debian-runtime.md` as the canonical layout.