Files

Jan Nedbal 5b65928007 docs: add live server runtime audit

2026-04-14 13:36:54 +02:00

21 KiB

Raw Blame History

Server runtime audit

Engineer-to-engineer writeup of what the VPS mt2.jakubkadlec.dev is actually running as of 2026-04-14. Existing docs under docs/ describe the intended layout (debian-runtime.md, database-bootstrap.md, config-and-secrets.md); this document is a ground-truth snapshot from a live recon session, with PIDs, paths, versions and surprises.

Companion: docs/server-topology.md for the ASCII diagram and port table.

TL;DR

Only one metin binary is alive right now: the db helper on port 9000 (PID 1788997 at audit time, cwd /home/mt2.jakubkadlec.dev/metin/runtime/server/channels/db).
game_auth and all channel*_core* processes are NOT running. The listing in the original prompt (auth :11000/12000, channel1 cores :11011/12011 etc.) reflects intended state from the systemd units, not the current live process table. ss -tlnp only shows 0.0.0.0:9000 for m2.
The game/auth binaries are not present on disk either. Only share/bin/db exists; there is no share/bin/game_auth and no share/bin/channel*_core*. Those channels cannot start even if requested.
The db unit is currently flapping / crash-looping. systemctl reports deactivating (stop-sigterm); syserr.log shows repeated Connection reset by peer from client peers (auth/game trying to reconnect is the usual culprit, but here nobody is connecting — cause needs verification). Two fresh core.<pid> files (97 MB each) sit in the db channel dir from 13:24 and 13:25 today.
Orchestration is pure systemd, not the upstream start.py / tmux setup. The README still documents start.py, so the README is stale for the Debian VPS; deploy/systemd/ + docs/debian-runtime.md are authoritative.
MariaDB 11.8.6 is the backing store on 127.0.0.1:3306. The DB user the stack is configured to use is bootstrap (from share/conf/db.txt / game.txt). The actual password is injected via /etc/metin/metin.env, which is root:root 600 and intentionally unreadable by the runtime user inspector account.

Host

Hostname: vmi3229987 (Contabo), public name mt2.jakubkadlec.dev.
OS: Debian 13 (trixie).
MariaDB: mariadbd 11.8.6, PID 103624, listening on 127.0.0.1:3306.
All metin services run as the unprivileged user mt2.jakubkadlec.dev:mt2.jakubkadlec.dev.
Runtime root: /home/mt2.jakubkadlec.dev/metin/runtime/server (755 MB across channels/, 123 MB across share/, total metin workspace on the box ~1.7 GB).

Processes currently alive

From ps auxf + ss -tlnp at audit time:

mysql    103624  /usr/sbin/mariadbd                     — 127.0.0.1:3306
mt2.j+  1788997  /home/.../channels/db/db               — 0.0.0.0:9000

No other m2 binaries show up. ps has zero matches for game_auth, channel1_core1, channel1_core2, channel1_core3, channel99_core1.

Per-process inspection:

PID	cwd	exe (resolved)	fds of interest
1788997	`.../runtime/server/channels/db`	`.../share/bin/db` (via `./db` symlink)	fd 3→syslog.log, fd 4→syserr.log, fd 11 TCP `*:9000`, fd 17 `[eventpoll]` (epoll fdwatch)

The db symlink inside the channel dir resolves to ../../share/bin/db, which is an ELF 64-bit LSB pie executable, x86-64, dynamically linked, BuildID fc049d0f..., not stripped. Build identifier from channels/db/VERSION.txt: db revision: b2b037f-dirty — the dirty tag is a red flag, the build wasn't from a clean checkout of m2dev-server-src.

The usage.txt in the same directory shows hourly heartbeat rows with | 0 | 0 | since 2026-04-13 21:00 (the "sessions / active" columns are stuck at zero — consistent with no game channels being connected).

Binaries actually present on disk

/home/mt2.jakubkadlec.dev/metin/runtime/server/share/bin/
├── db        ← present, used
└── game      ← present (shared game binary, but not launched under any
                instance name that the systemd generator expects)

What is NOT present:

share/bin/game_auth
share/bin/channel1_core1, channel1_core2, channel1_core3
share/bin/channel99_core1

The metin-game-instance-start helper (/usr/local/libexec/...) is a bash wrapper that cds into channels/<channel>/<core>/ and execs ./<instance>, e.g. ./channel1_core1. Those per-instance binaries don't exist yet. The channel dirs themselves (channel1/core1/, etc.) already contain the scaffolding (CONFIG, conf, data, log, mark, package, p2p_packet_info.txt, packet_info.txt, syserr.log, syslog.log, version.txt), but version.txt says game revision: unknown and the per-instance executable file is missing. The log directory has a single stale syslog_2026-04-13.log.

Interpretation: the deploy pipeline that builds m2dev-server-src and drops instance binaries into share/bin/ has not yet been run (or has not been re-run since the tree was laid out on 2026-04-13). Once Jakub's debian-foundation build produces per-instance symlinked/hardlinked binaries, the metin-game@* units should come up automatically on the next systemctl restart metin-server.

How things are started

All orchestration goes through systemd units under /etc/systemd/system/, installed from deploy/systemd/ via deploy/systemd/install_systemd.py.

Unit list and roles:

Unit	Type	Role
`metin-server.service`	oneshot	top-level grouping, `Requires=mariadb.service`. `ExecStart=/bin/true`, `RemainAfterExit=yes`. All sub-units are `PartOf=metin-server.service` so restarting `metin-server` cycles everything.
`metin-db.service`	simple	launches `.../channels/db/db` as runtime user, `Restart=on-failure`, `LimitCORE=infinity`, env file `/etc/metin/metin.env`.
`metin-db-ready.service`	oneshot	runs `/usr/local/libexec/metin-wait-port 127.0.0.1 9000 30` — gate that blocks auth+game until the DB socket is listening.
`metin-auth.service`	simple	launches `.../channels/auth/game_auth`. Requires db-ready.
`metin-game@channel1_core1..3.service`	template	each runs `/usr/local/libexec/metin-game-instance-start <instance>` which execs `./<instance>` in that channel dir.
`metin-game@channel99_core1.service`	template	same, for channel 99.

Dependency chain:

mariadb.service
      │
      ▼
metin-db.service ──► metin-db-ready.service ──► metin-auth.service
                                             └► metin-game@*.service
                                                    │
                                                    ▼
                                             metin-server.service  (oneshot umbrella)

All units have PartOf=metin-server.service, Restart=on-failure, LimitNOFILE=65535, LimitCORE=infinity. None run in Docker. None use tmux, screen or the upstream start.py. The upstream start.py / stop.py in the repo are NOT wired up on this host and should be treated as FreeBSD-era legacy.

The per-instance launcher /usr/local/libexec/metin-game-instance-start (installed by install_systemd.py) is:

#!/usr/bin/env bash
set -euo pipefail
instance="${1:?missing instance name}"
root_dir="/home/mt2.jakubkadlec.dev/metin/runtime/server/channels"
channel_dir="${instance%_*}"           # e.g. channel1 from channel1_core2
core_dir="${instance##*_}"             # e.g. core2
workdir="${root_dir}/${channel_dir}/${core_dir}"
cd "$workdir"
exec "./${instance}"

Notes:

the %_* / ##*_ parse is brittle — an instance name with more than one underscore would misbehave. For current naming (channelN_coreM) it works.
the helper does not redirect stdout/stderr; both go to the journal via systemd.

Config files the binaries actually read

All m2 config files referenced by the running/installed stack, resolved to their real path on disk:

Config file	Read by	Purpose
`share/conf/db.txt`	`db`	SQL hosts, BIND_PORT=9000, item id range, hotbackup
`share/conf/game.txt`	game cores	DB_ADDR=127.0.0.1, DB_PORT=9000, SQL creds, flags
`share/conf/CMD`	game cores	in-game command ACL (notice, warp, item, …)
`share/conf/item_proto.txt`, `mob_proto.txt`, `item_names.txt`, `mob_names.txt`	both db and game	static content tables
`channels/db/conf` (symlink → `share/conf`)	`db`	every db channel looks into this flat conf tree
`channels/db/data` (symlink → `share/data`)	`db`/`game`	mob/pc/dungeon/spawn data
`channels/db/locale` (symlink → `share/locale`)	all	locale assets
`channels/auth/CONFIG`	`game_auth`	`HOSTNAME: auth`, `CHANNEL: 1`, `PORT: 11000`, `P2P_PORT: 12000`, `AUTH_SERVER: master`
`channels/channel1/core1/CONFIG`	core1	`HOSTNAME: channel1_1`, `CHANNEL: 1`, `PORT: 11011`, `P2P_PORT: 12011`, `MAP_ALLOW: 1 4 5 6 3 23 43 112 107 67 68 72 208 302 304`
`channels/channel1/core2/CONFIG`	core2	`PORT: 11012`, `P2P_PORT: 12012`
`channels/channel1/core3/CONFIG`	core3	`PORT: 11013`, `P2P_PORT: 12013`
`channels/channel99/core1/CONFIG`	ch99 core1	`HOSTNAME: channel99_1`, `CHANNEL: 99`, `PORT: 11991`, `P2P_PORT: 12991`, `MAP_ALLOW: 113 81 100 101 103 105 110 111 114 118 119 120 121 122 123 124 125 126 127 128 181 182 183 200`
`/etc/metin/metin.env`	all systemd units via `EnvironmentFile=-`	host-local secrets/overrides, root:root mode 600. Contents not readable during this audit.

Flat share/conf/db.txt (verbatim, with bootstrap secrets):

WELCOME_MSG  = "Database connector is running..."
SQL_ACCOUNT  = "127.0.0.1 account bootstrap change-me 0"
SQL_PLAYER   = "127.0.0.1 player bootstrap change-me 0"
SQL_COMMON   = "127.0.0.1 common bootstrap change-me 0"
SQL_HOTBACKUP= "127.0.0.1 hotbackup bootstrap change-me 0"
TABLE_POSTFIX = ""
BIND_PORT               = 9000
CLIENT_HEART_FPS        = 60
HASH_PLAYER_LIFE_SEC    = 600
BACKUP_LIMIT_SEC        = 3600
PLAYER_ID_START         = 100
PLAYER_DELETE_LEVEL_LIMIT = 70
PLAYER_DELETE_CHECK_SIMPLE = 1
ITEM_ID_RANGE           = 2000000000 2100000000
MIN_LENGTH_OF_SOCIAL_ID = 6
SIMPLE_SOCIALID         = 1

The bootstrap / change-me values are git-tracked placeholders. config-and-secrets.md explicitly says these are templates, and real values are expected to come from /etc/metin/metin.env. This works because the server source re-reads credentials from the environment when injected; verify by grepping m2dev-server-src for the SQL env var names used by db/game. (Open question: confirm which env var names override the in-file creds; the audit session couldn't read metin.env directly.)

Database

Engine: MariaDB 11.8.6 (mariadb --version).
PID: 103624, listening on 127.0.0.1:3306 only. No external TCP exposure, no unix socket checked (likely /run/mysqld/mysqld.sock).
Expected databases from docs/database-bootstrap.md: account, player, common, log, hotbackup.
Stack-side DB user: bootstrap (placeholder in git, real password in /etc/metin/metin.env).
Could not enumerate actual tables during the audit — both mysql -uroot and sudo -u mt2.jakubkadlec.dev mariadb failed (Access denied), since root uses unix-socket auth for root@localhost and the runtime user has no CLI credentials outside the systemd environment.
To inspect the DB read-only: either run as root with sudo mariadb (unix socket auth — needs confirmation it's enabled), or open /etc/metin/metin.env as root, grab the bootstrap password, then mariadb -ubootstrap -p account etc. Do not attempt writes.

Logging

Every m2 process writes two files in its channel dir, via fd 3 / fd 4:

syslog.log — verbose info stream (rotated by date in some dirs: channel1/core1/log/syslog_2026-04-13.log).
syserr.log — error stream. Look here first on crash.

The db channel additionally writes to syslog.log (36 MB today, rotating appears to be manual — there is a log/ dir with a daily file but the current syslog.log is at the top level) and drops core.<pid> ELF cores into the channel dir on SIGSEGV/SIGABRT because LimitCORE=infinity is set.

systemd journal captures stdout/stderr as well, so journalctl -u metin-db --since '1 hour ago' is the fastest way to see startup banners and systemd-observed restarts. Example from this audit:

Apr 14 13:26:40 vmi3229987 db[1788997]: Real Server
Apr 14 13:26:40 vmi3229987 db[1788997]: Success ACCOUNT
Apr 14 13:26:40 vmi3229987 db[1788997]: Success COMMON
Apr 14 13:26:40 vmi3229987 db[1788997]: Success HOTBACKUP
Apr 14 13:26:40 vmi3229987 db[1788997]: mysql_real_connect: Lost connection
     to server at 'sending authentication information', system error: 104

Every db start it opens more than a dozen AsyncSQL pools ("AsyncSQL: connected to 127.0.0.1 (reconnect 1)" repeated ~12 times), suggesting a large per-instance pool size. Worth checking if that needs tuning.

The current syserr.log in channels/db/ is dominated by:

[error] [int CPeerBase::Recv()()] socket_read failed Connection reset by peer
[error] [int CClientManager::Process()()] Recv failed

which is the peer disconnect path. Since no auth/game peers should be connecting right now, this is either a leftover from an earlier start or something else (maybe a healthcheck probe) is touching 9000 and aborting. See open questions.

Ports

Live ss -tlnp on the VPS (m2-relevant lines only):

L3:L4	Who	Exposure
`0.0.0.0:9000`	`db`	INADDR_ANY — listens on all interfaces. Look at this.
`127.0.0.1:3306`	`mariadbd`	localhost only

Not currently listening (would be if auth/game were up):

11000 / 12000 — auth client + p2p
11011..11013 / 12011..12013 — channel1 cores + p2p
11991 / 12991 — channel99 core1 + p2p

Other listeners on the host (not m2): :22, :2222 (gitea ssh), :25 (postfix loopback), :80/:443 (Caddy), :3000 (Gitea), :2019 (Caddy admin), :33891 (unknown loopback), :5355 / :53 (resolver).

Firewalling note: db binding to 0.0.0.0:9000 is a concern. In the normal m2 architecture, db only talks to auth/game cores on the same host and should bind to 127.0.0.1 only. Current binding is set by the BIND_PORT = 9000 line in share/conf/db.txt, which in this server fork apparently defaults to INADDR_ANY. If the Contabo firewall or iptables/nft rules don't block 9000 from the outside, this is exposed. Open question: verify iptables/nftables on the host, or move db to 127.0.0.1 explicitly in source / config.

Data directory layout

All under /home/mt2.jakubkadlec.dev/metin/runtime/server/share/:

share/
├── bin/          ← compiled binaries (only db + game present today)
├── conf/         ← db.txt, game.txt, CMD, item_proto.txt, mob_proto.txt,
│                   item_names_*.txt, mob_names_*.txt (17 locales each)
├── data/         ← DTA/, dungeon/, easterevent/, mob_spawn/, monster/,
│                   pc/, pc2/ (27 MB total)
├── locale/       ← 86 MB, per-locale strings + binary quest outputs
├── mark/
└── package/

Per-channel scaffolding under channels/ symlinks conf, data, locale back into share/, so each channel reads from a single canonical content tree.

Disk usage footprint

/home/mt2.jakubkadlec.dev/metin/             1.7 G   (total metin workspace)
    runtime/server/share/                    123 M
        runtime/server/share/data/            27 M
        runtime/server/share/locale/          86 M
    runtime/server/channels/                  755 M
        channels/db/core.178508{2,8}        ~194 M   (two 97 MB coredumps)
        channels/db/syslog.log                36 M   (grows fast)

Core dumps dominate the channel dir footprint right now. Cleaning up old core.* files is safe when the db is not actively crashing (and only after Jakub has looked at them).

How to restart channel1_core2 cleanly

Pre-flight checklist:

Confirm share/bin/channel1_core2 actually exists on disk — right now it does not, so the instance cannot start. Skip straight to the "rebuild / redeploy" section in Jakub's docs/deploy-workflow.md before trying.
Confirm metin-db.service and metin-auth.service are active (running) (systemctl is-active metin-db metin-auth). If not, fix upstream first — a clean restart of core2 requires a healthy auth + db.
Check that no player is currently online on that core. With usage.txt at 0/0 this is trivially true today, but in prod do cat channels/channel1/core2/usage.txt first.
Look at recent logs so you have a baseline: journalctl -u metin-game@channel1_core2 -n 50 --no-pager

Clean restart:

# on the VPS as root or with sudo
systemctl restart metin-game@channel1_core2.service
systemctl status  metin-game@channel1_core2.service --no-pager
journalctl -u metin-game@channel1_core2.service -n 100 --no-pager -f

Because the unit is Type=simple with Restart=on-failure, systemctl restart sends SIGTERM, waits up to TimeoutStopSec=60, then brings the process back up. The binary's own hupsig() handler logs the SIGTERM into syserr.log and shuts down gracefully.

Post-restart verification:

ss -tlnp | grep -E ':(11012|12012)\b'       # expect both ports listening
tail -n 30 /home/mt2.jakubkadlec.dev/metin/runtime/server/channels/channel1/core2/syserr.log

If the process refuses to stay up (Restart=on-failure loops it), do not just bump RestartSec; grab the last 200 journal lines and the last 200 syserr lines and open an issue in metin-server/m2dev-server-src against Jakub. Do not edit the unit file ad-hoc on the host.

Open questions

These are things the audit could not determine without making changes or getting more access. They need a human operator to resolve.

Who produces the per-instance binaries (channel1_core1, channel1_core2, channel1_core3, channel99_core1, game_auth)? The deploy flow expects them in share/bin/ and channel dirs but they are missing. Is this still hand-built, or is there a make target that hardlinks share/bin/game into each channel*/core*/<instance> name?
Why is db currently flapping (deactivating (stop-sigterm) in systemctl, plus two fresh core dumps on 2026-04-14 13:24/13:25 and dozens of CPeerBase::Recv() errors)? Nothing should be connecting to port 9000 right now.
What the real metin.env contains — specifically, the actual bootstrap DB password, and whether there is a separate admin-page password override. Audit did not touch /etc/metin/metin.env.
Exact override-variable contract between share/conf/db.txt placeholders and the env file. We need to verify which env var names the db/game source actually reads so we know whether the change-me literal is ever used at runtime.
Is db intended to bind 0.0.0.0:9000? From a defense-in-depth standpoint it should be 127.0.0.1. Needs either a source fix or a host firewall rule. Check current nftables state.
VERSION.txt says db revision: b2b037f-dirty. Which tree was this built from and why "dirty"? Point back at the m2dev-server-src commit and confirm the build artefact is reproducible.
Log rotation: channels/db/syslog.log is already 36 MB today with nothing connected. There is a channels/channel1/core1/log/ dated subdir convention that suggests daily rotation, but db's own syslog is not rotating. Confirm whether logrotate or an in-process rotator is expected to own this.
Hourly heartbeat in usage.txt comes from where? Every ~1 h a row is appended — this is probably the db backup tick, but confirm it's not some cron job.
mysqld's live databases: could not enumerate table names without credentials. docs/database-bootstrap.md lists the expected set; someone with metin.env access should confirm account, player, common, log, hotbackup are all present and populated.
Stale README: top-level README.md still documents FreeBSD + start.py. Not urgent, but worth a docs: sweep to point readers at docs/debian-runtime.md as the canonical layout.

21 KiB Raw Blame History