CI: Postgres service container state-bleeds across runs, enabling #259-style breakages on stale PRs #277

Closed
opened 2026-06-24 12:53:53 +00:00 by james · 1 comment
Owner

Context

PR #270's Postgres-leg failures were diagnosed as a stale-base + #259 KYSELY_TABLES omission. Rebasing on main resolves the immediate problem (the merge of #269 brought in the new migration AND the matching drop-list entries). But the investigation surfaced a deeper infra problem that's the actual enabler of every #259-shape failure: the Postgres service container in .forgejo/workflows/pr.yml is being reused across CI runs on the same self-hosted runner.

Evidence: PR #269 (Organizations) merged at 02:46Z and created organization_key_people. PR #270's CI started at 02:55Z (9 min later, on a stale base without migration 019). The job log shows cannot drop table people because other objects depend on it … constraint organization_key_people_person_id_fkey on table organization_key_people depends on table people — meaning organization_key_people was physically present in the postgres data volume at the START of PR #270's test run, even though #270's migrator never created it. That table can only have come from #269's CI run reusing the same data directory.

Each job's services: block declares image: postgres:16 and act_runner logs "Cleaning up services for job" at job end — but evidently the underlying postgres data volume is preserved on the host, so the fresh container comes up with leftover tables. The reuse is probably tied to Forgejo act_runner's host-network service-container behaviour.

This is why #259-shape bugs deterministically bite PRs that are stale across a table-adding migration: the orphan table from the previous PR's run is still in the data dir; the stale PR's KYSELY_TABLES doesn't know about it; the FK from the orphan blocks the people drop; all 15 Postgres test suites fail at _engines.ts:75.

Scope

Two independent fixes, both desirable; pick one or do both:

  1. Stop reusing the Postgres data between runs. Make each CI job get a fresh, empty Postgres. Options:

    • Add a --tmpfs /var/lib/postgresql/data mount to the service container so the data lives in RAM and dies with the container.
    • Or wipe the data dir explicitly in the Set up DB step (docker exec carol-postgres psql -c 'DROP DATABASE …; CREATE DATABASE …;' before migrate).
    • Or switch the service container to a unique container name per run so act_runner can't reuse it (need to check whether name reuse is actually the mechanism).
  2. Make KYSELY_TABLES derive from information_schema.tables instead of being hardcoded. Already flagged in #259 as the defence-in-depth fix. With this in place, a fresh-then-orphaned database still tears down cleanly because the drop list discovers tables at runtime in FK order. Closes the trap on the test side even if the infra side stays unfixed.

Doing #1 fixes the visible symptom (CI green on stale PRs); doing #2 prevents the next variant of this from biting (e.g. an act_runner upgrade that changes service-container semantics).

Acceptance criteria

  • A PR that is stale across a recently-merged table-adding migration passes Postgres tests in CI without needing a rebase. (Reproduce by branching off main~2, opening a PR, watching it go green.)
  • No KYSELY_TABLES modification required when a new table-creating migration lands; or, alternatively, a CI assertion that fails if KYSELY_TABLES and createTable calls in apps/api/db/migrations/ go out of sync (#259's other path).

Investigation pointers

  • .forgejo/workflows/pr.yml lines 224–242 (per agent investigation) — the services: block declaring the postgres container.
  • apps/api/tests/db/_engines.ts:75 — the dropTable(name).ifExists().cascade?().execute() line that's been the failure point on PR #249, #250, and #270.
  • Forgejo act_runner docs on service-container lifecycle: https://forgejo.org/docs/latest/admin/actions/.
  • Recent stale-base failure on PR #270: head SHA 3616ce3c, run timestamp 02:55Z 2026-06-24.

Out of scope

  • Switching CI off self-hosted runners.
  • Replacing Forgejo Actions.
  • Changing the dual-engine test strategy (CLAUDE.md mandates SQLite + Postgres parity).

Composes with / part of

  • #259 — same root cause family; this ticket extends the scope to the CI infra side.
  • Defends against: any future "PR is 1 commit stale and CI starts failing across 15 unrelated test files" incident.
## Context PR #270's Postgres-leg failures were diagnosed as a stale-base + #259 KYSELY_TABLES omission. Rebasing on main resolves the immediate problem (the merge of #269 brought in the new migration AND the matching drop-list entries). But the investigation surfaced a deeper infra problem that's the actual enabler of every #259-shape failure: **the Postgres service container in `.forgejo/workflows/pr.yml` is being reused across CI runs on the same self-hosted runner.** Evidence: PR #269 (Organizations) merged at 02:46Z and created `organization_key_people`. PR #270's CI started at 02:55Z (9 min later, on a stale base without migration 019). The job log shows `cannot drop table people because other objects depend on it … constraint organization_key_people_person_id_fkey on table organization_key_people depends on table people` — meaning `organization_key_people` was physically present in the postgres data volume at the START of PR #270's test run, even though #270's migrator never created it. That table can only have come from #269's CI run reusing the same data directory. Each job's `services:` block declares `image: postgres:16` and `act_runner` logs "Cleaning up services for job" at job end — but evidently the underlying postgres data volume is preserved on the host, so the fresh container comes up with leftover tables. The reuse is probably tied to Forgejo act_runner's host-network service-container behaviour. This is why #259-shape bugs deterministically bite PRs that are stale across a table-adding migration: the orphan table from the previous PR's run is still in the data dir; the stale PR's `KYSELY_TABLES` doesn't know about it; the FK from the orphan blocks the `people` drop; all 15 Postgres test suites fail at `_engines.ts:75`. ## Scope Two independent fixes, both desirable; pick one or do both: 1. **Stop reusing the Postgres data between runs.** Make each CI job get a fresh, empty Postgres. Options: - Add a `--tmpfs /var/lib/postgresql/data` mount to the service container so the data lives in RAM and dies with the container. - Or wipe the data dir explicitly in the `Set up DB` step (`docker exec carol-postgres psql -c 'DROP DATABASE …; CREATE DATABASE …;'` before migrate). - Or switch the service container to a unique container name per run so act_runner can't reuse it (need to check whether name reuse is actually the mechanism). 2. **Make `KYSELY_TABLES` derive from `information_schema.tables` instead of being hardcoded.** Already flagged in [#259](https://forge.wynning.tech/james/carol/issues/259) as the defence-in-depth fix. With this in place, a fresh-then-orphaned database still tears down cleanly because the drop list discovers tables at runtime in FK order. Closes the trap on the test side even if the infra side stays unfixed. Doing #1 fixes the *visible* symptom (CI green on stale PRs); doing #2 prevents the next variant of this from biting (e.g. an `act_runner` upgrade that changes service-container semantics). ## Acceptance criteria - [ ] A PR that is stale across a recently-merged table-adding migration passes Postgres tests in CI without needing a rebase. (Reproduce by branching off `main~2`, opening a PR, watching it go green.) - [ ] No `KYSELY_TABLES` modification required when a new table-creating migration lands; or, alternatively, a CI assertion that fails if `KYSELY_TABLES` and `createTable` calls in `apps/api/db/migrations/` go out of sync (#259's other path). ## Investigation pointers - `.forgejo/workflows/pr.yml` lines 224–242 (per agent investigation) — the `services:` block declaring the postgres container. - `apps/api/tests/db/_engines.ts:75` — the `dropTable(name).ifExists().cascade?().execute()` line that's been the failure point on PR #249, #250, and #270. - Forgejo `act_runner` docs on service-container lifecycle: <https://forgejo.org/docs/latest/admin/actions/>. - Recent stale-base failure on PR #270: head SHA `3616ce3c`, run timestamp 02:55Z 2026-06-24. ## Out of scope - Switching CI off self-hosted runners. - Replacing Forgejo Actions. - Changing the dual-engine test strategy (CLAUDE.md mandates SQLite + Postgres parity). ## Composes with / part of - [#259](https://forge.wynning.tech/james/carol/issues/259) — same root cause family; this ticket extends the scope to the CI infra side. - Defends against: any future "PR is 1 commit stale and CI starts failing across 15 unrelated test files" incident.
Author
Owner

Do both fixes 1 and 2. The tempfs solution sounds simplest.

Do both fixes 1 and 2. The tempfs solution sounds simplest.
james closed this issue 2026-06-26 23:04:11 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
james/carol#277
No description provided.