Monitoring
Monitoring is the post-live operations page for external observation. It answers a different question than Health: not "is production fit to serve right now?", but "is the tagging service being watched by the systems that should alert humans when it stops behaving normally?"
In v1, Monitoring verifies readiness. It reads Cloud Monitoring resources in your bound GCP project, stores a snapshot, and shows whether the tagging service has an uptime check, an alert policy, and notification channels attached. GSS does not create those resources for you yet.
This page vs. the Go Live stage
There are two related "monitoring" surfaces in GSS, and they're easy to confuse. This page (/monitoring, in the sidebar) is where you create and then operate your uptime check, alert policy, notification channels, and metrics dashboard once the project is live — it re-reads Cloud Monitoring on demand and shows the current readiness state. Monitoring and Cost Controls is the Go Live step that points here: the /go-live/monitoring URL redirects to this page after Go Live finalizes (and to Cutover before then), so this page is where that setup actually happens.
Pre-live vs. post-live
- Before cutover — the page shows the Monitoring hero with "Activates after Go Live" and an empty-state card ("Monitoring activates after Go Live") with a "Continue Go Live setup →" link to the cutover page. The Re-check button is hidden.
- After cutover — the hero adds the current overall state badge, a summary, "Last checked" time, and a "Re-check monitoring readiness" button. Below the hero, the page renders the three sections described below.
What the page shows
- Readiness — Uptime, Alert policy, and Notification channels. These are detected rows, not manual checklist items. Each row carries a state badge, a summary, optional resource details (display name, expected host, path, cadence), a channel list when relevant, and an "Open in Cloud Monitoring →" fix link.
- Watcher — the daily GSS health pass cadence, last run time, and a 14-cell strip showing one cell per UTC calendar day for the last 14 days (pass/fail/no run). Use Health to run an on-demand health check.
- External — links into Cloud Monitoring, Cloud Run metrics, and Log Explorer in Google Cloud Console.
Page banners
Above the Readiness section, the page can show one of three banners depending on snapshot state:
- Re-check needed — readiness was captured under a previous cutover; the rows are hidden until you re-check.
- Last check failed — the previous re-check call to Cloud Monitoring errored; the banner shows the short summary.
- Stale — the last readiness snapshot is more than 36 hours old; the rows render but the banner asks you to refresh.
Readiness rows
| Row | What GSS looks for |
|---|---|
| Uptime | A Cloud Monitoring uptime check whose URL target matches the project's expected tagging host and whose path starts with /healthy. |
| Alert policy | A Cloud Monitoring alert policy with a condition that references the matched uptime check. Disabled policies are shown as disabled, not configured. |
| Notification channels | Verified, enabled notification channels attached to the matched alert policy. Channel labels are redacted before GSS stores the readiness snapshot. |
Status vocabulary
- Configured — GSS detected the expected resource for that row.
- Missing — no matching resource was found.
- Incomplete — the resource exists but is missing a required part, such as an alert policy with no channel attached.
- Disabled — the policy or channel exists but is turned off.
- Waiting — GSS cannot evaluate the row until a prerequisite exists. Notification channels wait when the alert policy is missing or unknown.
- Unknown — the readiness check could not read Cloud Monitoring, usually because of credentials, permissions, or API availability.
Why channels can say Waiting
Notification channels depend on the alert policy. If no matching alert policy exists, GSS does not also mark channels as missing. It shows Waiting so the page points at one underlying fix: create or repair the alert policy first, then re-check monitoring readiness.
Re-check and freshness
Monitoring stores a readiness snapshot. Health reads that snapshot; it does not call Cloud Monitoring APIs during page render.
- The Re-check monitoring readiness button reads Cloud Monitoring, writes a fresh snapshot, and refreshes the page body.
- The daily watcher also refreshes the snapshot after its normal Health pass.
- A snapshot is fresh for 36 hours. After that, the rows keep their captured state but the page asks you to re-check.
- If a refresh fails, GSS stores an Unknown snapshot with a short failure summary so the page shows what happened instead of pretending no check has run.
How Monitoring relates to Health
Health is the current production verdict. Monitoring is the external observation posture. The two surfaces deliberately stay separate:
- Use Health when you want to know whether the server-side path is currently fit to serve.
- Use Monitoring when you want to know whether Cloud Monitoring will alert someone if the tagging service becomes unreachable.
- Monitoring readiness can add an action recommendation on Health, but it does not turn Health from Healthy to Degraded or Down.
How to fix readiness states
| State | What to do |
|---|---|
| Uptime missing | Create an uptime check in Cloud Monitoring targeting https://<tagging-host>/healthy. Re-check after Cloud Monitoring saves it. |
| Alert missing | Create an alert policy whose condition references the uptime check for the tagging service. Attach a notification channel before saving. |
| Channels incomplete | Verify and enable the channel, or attach a verified channel to the matched alert policy. |
| Unknown | Reconnect Google credentials if needed. Confirm the Cloud Monitoring API is enabled and your account can list uptime checks, alert policies, and notification channels in the bound GCP project. |
Metrics dashboard
GSS no longer treats a Cloud Monitoring dashboard as a setup gate. It is still useful operationally. A practical dashboard for the tagging service includes request count, 5xx count or rate, p95 latency, instance count, CPU, memory, and billable instance time. The Monitoring page links to Cloud Run metrics and Log Explorer so you can build or inspect those views in Google Cloud Console.