CI: Postgres service container state-bleeds across runs, enabling #259-style breakages on stale PRs #277
Labels
No labels
area:auth
area:ci
area:db
area:infra
area:native
area:pwa
area:service
epic
feature
foundation
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
james/carol#277
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
PR #270's Postgres-leg failures were diagnosed as a stale-base + #259 KYSELY_TABLES omission. Rebasing on main resolves the immediate problem (the merge of #269 brought in the new migration AND the matching drop-list entries). But the investigation surfaced a deeper infra problem that's the actual enabler of every #259-shape failure: the Postgres service container in
.forgejo/workflows/pr.ymlis being reused across CI runs on the same self-hosted runner.Evidence: PR #269 (Organizations) merged at 02:46Z and created
organization_key_people. PR #270's CI started at 02:55Z (9 min later, on a stale base without migration 019). The job log showscannot drop table people because other objects depend on it … constraint organization_key_people_person_id_fkey on table organization_key_people depends on table people— meaningorganization_key_peoplewas physically present in the postgres data volume at the START of PR #270's test run, even though #270's migrator never created it. That table can only have come from #269's CI run reusing the same data directory.Each job's
services:block declaresimage: postgres:16andact_runnerlogs "Cleaning up services for job" at job end — but evidently the underlying postgres data volume is preserved on the host, so the fresh container comes up with leftover tables. The reuse is probably tied to Forgejo act_runner's host-network service-container behaviour.This is why #259-shape bugs deterministically bite PRs that are stale across a table-adding migration: the orphan table from the previous PR's run is still in the data dir; the stale PR's
KYSELY_TABLESdoesn't know about it; the FK from the orphan blocks thepeopledrop; all 15 Postgres test suites fail at_engines.ts:75.Scope
Two independent fixes, both desirable; pick one or do both:
Stop reusing the Postgres data between runs. Make each CI job get a fresh, empty Postgres. Options:
--tmpfs /var/lib/postgresql/datamount to the service container so the data lives in RAM and dies with the container.Set up DBstep (docker exec carol-postgres psql -c 'DROP DATABASE …; CREATE DATABASE …;'before migrate).Make
KYSELY_TABLESderive frominformation_schema.tablesinstead of being hardcoded. Already flagged in #259 as the defence-in-depth fix. With this in place, a fresh-then-orphaned database still tears down cleanly because the drop list discovers tables at runtime in FK order. Closes the trap on the test side even if the infra side stays unfixed.Doing #1 fixes the visible symptom (CI green on stale PRs); doing #2 prevents the next variant of this from biting (e.g. an
act_runnerupgrade that changes service-container semantics).Acceptance criteria
main~2, opening a PR, watching it go green.)KYSELY_TABLESmodification required when a new table-creating migration lands; or, alternatively, a CI assertion that fails ifKYSELY_TABLESandcreateTablecalls inapps/api/db/migrations/go out of sync (#259's other path).Investigation pointers
.forgejo/workflows/pr.ymllines 224–242 (per agent investigation) — theservices:block declaring the postgres container.apps/api/tests/db/_engines.ts:75— thedropTable(name).ifExists().cascade?().execute()line that's been the failure point on PR #249, #250, and #270.act_runnerdocs on service-container lifecycle: https://forgejo.org/docs/latest/admin/actions/.3616ce3c, run timestamp 02:55Z 2026-06-24.Out of scope
Composes with / part of
Do both fixes 1 and 2. The tempfs solution sounds simplest.