The Problem with Bad Alerts
Alerting done poorly is almost as bad as no alerting at all. Too many alerts and your team starts ignoring them — a phenomenon called alert fatigue. Too few and you miss real incidents. Alerts that fire at 3am for a non-critical issue train people to turn off their phones.
Good alerting is a discipline. It requires intentional design: what to alert on, when to alert, who to notify, and through which channel. This guide walks through the principles that make alerting actually useful.
Alert on Symptoms, Not Causes
One of the most important principles in alerting: alert on user-facing symptoms, not internal causes.
Cause-based alert: "CPU usage is above 80%"
Symptom-based alert: "Response time is above 2 seconds"
High CPU might not affect users at all. Slow response times definitely do. By alerting on symptoms, you ensure every alert represents an actual user experience problem, not a technical curiosity.
This doesn't mean you never track CPU or memory — those metrics are useful for diagnosis. But they shouldn't be your primary alert triggers.
Alert Severity Levels
Not all alerts require the same response. Define severity levels and make sure everyone on your team knows what each means:
| Severity | Meaning | Response |
|---|---|---|
| Critical | Service is down or severely degraded | Immediate response, any hour |
| High | Significant degradation, users affected | Response within 30 minutes |
| Medium | Partial degradation, workaround exists | Response within business hours |
| Low | Minor issue, informational | Review at next opportunity |
Map these severity levels to notification channels. Critical alerts should page someone. Low-severity alerts should go to a Slack channel for async review.
Choosing the Right Notification Channel
PandaStack supports three alert delivery channels: email, Slack, and webhooks. Here's how to use each effectively:
Email works well for:
- Non-urgent alerts and daily digests
- Alerts that need a paper trail for compliance
- Reaching people who aren't in your Slack workspace
Slack works well for:
- Team-visible alerts where anyone can pick up the incident
- Medium-priority alerts during business hours
- Alerts that benefit from threaded discussion
Webhook works well for:
- Integrating with incident management systems
- Triggering automated remediation workflows
- Custom routing logic based on alert content
Reducing Alert Fatigue
Alert fatigue is the enemy of reliability. When engineers learn to ignore alerts, you lose your early warning system. Here's how to fight it:
Require consecutive failures. Don't alert on the first failed check — alert after two or three consecutive failures. This eliminates false positives from transient network blips.
Set meaningful thresholds. "Alert if error rate > 0.1%" might be appropriate for a payment service. For a static marketing site, you might not care until it's completely down. Tune thresholds to your actual tolerance.
Remove stale alerts. Regularly audit your alert rules. Alerts for services you've retired or thresholds that never fire (or always fire) should be updated or removed.
Group related alerts. If five services all fail simultaneously because of a shared database outage, you should get one alert about the root cause, not five separate notifications.
Alerting for Different Deployment Types
Different types of applications have different alert priorities:
Static sites — Alert on availability (is the site returning 200?) and, if relevant, CDN cache hit rates.
Docker containers — Alert on response time, error rate, and restart frequency (a container that keeps restarting is a container with a crash loop).
Databases (PostgreSQL, MySQL, Redis, MongoDB) — Alert on connection availability and query latency. A database that goes down takes every service that depends on it down too.
Cronjobs — Alert on job failure and missed executions. If a job is scheduled to run every hour and it hasn't run in two hours, something is wrong.
Edge functions — Alert on invocation errors and timeout rates.
PandaStack supports monitoring and alerting across all these deployment types from a single dashboard at [dashboard.pandastack.io](https://dashboard.pandastack.io).
Runbooks: Making Alerts Actionable
Every alert should link to a runbook — a short document that explains:
- What this alert means
- How to diagnose the root cause
- Steps to remediate
- Who to escalate to if the standard fix doesn't work
An alert without a runbook asks the on-call engineer to figure out the diagnosis and fix under pressure. A runbook turns that into a procedure.
Testing Your Alerts
Alerts that have never fired might not work when you need them. Test your alert setup periodically:
- Intentionally take down a non-production service to verify the alert fires
- Confirm the notification reaches the right channel
- Verify the alert resolves when the service recovers
Conclusion
Smart alerting is a force multiplier for your engineering team. It means fewer incidents go undetected, and the incidents that do get caught are resolved faster. Start with symptom-based alerts, pick the right channels for each severity level, and keep your alert configuration clean. Configure your alerts at [dashboard.pandastack.io](https://dashboard.pandastack.io) and see [docs.pandastack.io](https://docs.pandastack.io) for detailed setup guidance.