Mental model
A junior engineer hears “the internet is broken” and starts typing show commands at random — pings, traceroutes, interface stats — hoping one of them surfaces the cause.
A senior engineer asks two questions first:
- What changed? Nothing breaks itself spontaneously. If it worked yesterday and doesn’t today, something changed. Find that change first; the rest may be unnecessary.
- What’s the scope? One user, one VLAN, one site, or everyone? Scope tells you which layer to start at and which path to investigate.
Only after those two answers do you reach for a CLI. The CLI is for confirming a hypothesis, not for generating one.
This topic is the discipline behind that.
The seven-step Cisco methodology
Cisco’s official troubleshooting model. Memorize for the exam; use a streamlined version in practice.
- Define the problem — Be specific. “Internet broken” is useless. “User X cannot reach
gmail.comfrom VLAN 10 since 9:00am today” is actionable. - Gather information — Logs, recent changes, scope, error messages, user observations.
- Analyze information — Form a hypothesis. “DNS is failing. Or default route is gone. Or the upstream firewall dropped the policy.”
- Eliminate possible causes — Run tests that distinguish between hypotheses. Bottom-up, top-down, or divide-and-conquer.
- Propose a hypothesis — Pick the most likely cause based on tests so far.
- Test the hypothesis — Apply a fix or further test. If it works → confirmed. If not → back to step 3 with new information.
- Solve the problem and document — Apply the permanent fix. Document so the next outage with the same symptoms is solved in 5 minutes.
In real life this collapses into something like: “Define + Gather + Hypothesize + Test.” But knowing the long version helps when an outage gets messy.
Three approaches — pick by symptom
Bottom-up — start at Layer 1
Walk the OSI stack from the bottom: cable → port → MAC → IP → TCP → app.
Use when:
- Hardware failures suspected.
- “It worked yesterday” + no recent config change.
- New install where physical didn’t get verified.
Typical Layer-1 questions: cable plugged in? LEDs lit? Duplex/speed match? Port not err-disabled? Patch correct?
SW1# show interfaces Gi1/0/1
SW1# show interfaces Gi1/0/1 status
SW1# show interfaces Gi1/0/1 counters errors
Top-down — start at the application
Walk the OSI stack from the top: app → presentation → session → transport → network → data link → physical.
Use when:
- Single user / single application failure.
- “Browser shows certificate error” — start with the browser, not the cable.
- Application teams have already verified server-side health.
Typical top-down: browser → DNS lookup → TCP connect → TLS handshake → HTTP response → server. Each step gives you a stop-and-isolate point.
Divide-and-conquer — split the path in half
Pick a midpoint along the path and test reachability from both ends.
Use when:
- Symptom is “can’t reach X from Y” and you have a long path.
- You can ping/SSH into devices along the way.
Example: user at branch can’t reach server at HQ.
- Test 1: ping HQ firewall from branch — works? Problem is HQ-side, not branch-WAN.
- Test 2: ping HQ firewall from HQ access switch — works? Problem is the firewall itself or further inside.
Each test halves the search space. Binary search applied to networks.
Information gathering — what to ask first
Before any command, get answers from the user / requester:
- What’s the exact symptom? Error message verbatim. Screenshot.
- When did it start? Tied to a change?
- Who’s affected? One user, one VLAN, one site, everyone?
- What changed? Maintenance, deployment, patch, weather, power blip.
- Does it always fail or intermittently? Intermittent = different toolkit.
- Has anyone tried fixes? Often a non-engineer “fixed” something that made it worse.
For a hostile incident (large outage, public-facing), the first 5 minutes is only information gathering. Resist the urge to dive into CLI.
The show-command toolkit
These cover 80% of CCNA-level troubleshooting:
| Layer | Command | Tells you |
|---|---|---|
| L1 | show interfaces / ... status | Port up/down, errors, speed/duplex |
| L1 | show interfaces description | Quick port purpose |
| L1 | show power inline | PoE issues |
| L2 | show mac address-table | Where a MAC was learned |
| L2 | show vlan brief | VLANs configured + which ports |
| L2 | show spanning-tree | STP state, root bridge |
| L2 | show interfaces trunk | Allowed VLANs, native VLAN |
| L2 | show cdp neighbors / show lldp neighbors | What’s plugged into where |
| L3 | show ip interface brief | IP per interface, line state |
| L3 | show ip route | Routing table |
| L3 | show arp | IP↔MAC mappings learned |
| L3 | ping / traceroute | End-to-end reachability |
| L4 | telnet host 80 / nc -zv host 443 | TCP port reachability |
| L3 | show ip ospf neighbor | OSPF adjacencies |
| L3 | show ip bgp summary | BGP peering state |
| Misc | show log | Recent syslog events |
| Misc | `show running-config | section X` |
debug commands are powerful but dangerous in production — they can overwhelm CPU. Use debug ip packet detail with caution, always with a tight ACL.
Real-world flow
User reports: “I can’t reach internal-app.example.com from my laptop.”
You ask: “How long? Anyone else affected? What error?” — Answer: “30 minutes. Yes, my whole team. Browser says ‘site can’t be reached.’”
Now scope = team-wide (probably one VLAN or one switch’s users). Recent change? Helpdesk says: “Power outage 1 hour ago at the branch.”
You skip top-down and go to divide-and-conquer:
- From your laptop (different site): can YOU resolve
internal-app.example.com? Yes → DNS is fine globally. - From a router at the affected branch: ping the branch’s gateway → works → L1/L2 are fine.
- Ping the HQ application server’s IP → fails. Now bisect: ping HQ firewall → works. Ping the server’s L3-switch → fails.
- The server’s L3 switch is unreachable. SSH from another HQ switch → CDP says it’s down.
- Walk to the rack. Power outage took out the switch’s UPS. UPS dead. Reboot.
15 minutes from ticket to fix. Method beat luck.
When to stop and escalate
Sometimes the right next step is “ask for help” — not because you can’t continue but because you’re wasting time:
- You’ve spent 30 minutes and have no working hypothesis.
- The symptom contradicts everything you know about the stack.
- The fix would require a change with audit trail (firewall rule, BGP route).
- You’re outside business hours and the system isn’t critical — wait for the right people.
Senior engineers escalate often. Junior engineers escalate too late.
Common mistakes
-
Changing things before understanding them. “Let me just bounce that interface and see.” Now you’ve added a variable. Always understand first, change second.
-
Multiple changes at once. Fixed it? Was it the ACL change, the route adjustment, or the routing-protocol restart? You don’t know — and next time, you won’t know what to try.
-
Forgetting to capture data before reload. Show outputs first, then
reload. Logs are gone after restart. -
Treating symptoms instead of causes. Restarting the AP every hour because users complain isn’t a fix. Find why it’s locking up.
-
Ignoring user observations. “It only happens when I’m on the call” is data. Don’t dismiss because it sounds non-technical.
-
Trusting one diagnostic without confirmation. A single ping success doesn’t mean the problem is fixed. Try the actual workflow.
-
Skipping documentation after the fix. Same outage in 6 months, by a different engineer, takes 4 hours again because no one wrote it down.
-
Confusing correlation with causation. “It started after I deployed X.” Maybe. Or maybe X is a coincidence. Test.
A pre-built mental checklist
For your first 60 seconds when paged:
- Scope. Who’s affected? One / some / all?
- Recent change. Anything deployed in last 24h?
- External vs internal. Cloud / WAN dependency?
- Multiple symptoms or one. Single fault or compound?
- Severity escalation rule. If 10+ users / mission-critical, page the team.
For your first 5 minutes:
- L1 sanity. Lights on the affected port/switch?
- L3 sanity. Default gateway reachable? DNS responding?
- Logs. Any syslog spam from devices around the path?
For your first 30 minutes:
- Hypothesis + test loop. Pick a likely cause, design a test, run it, refine.
- Documentation as you go. Output captured, timeline noted.
- Communication. Stakeholders know status. Don’t go silent.
Lab to try tonight
This topic is best practiced on a real lab with intentional breakage:
- Build any small topology in CML — say, 3 switches + 1 router + 2 hosts.
- Verify everything works (ping host-to-host).
- Ask a colleague to break one thing in your absence — shut a port, change a native VLAN, remove a route, mistype a password.
- Come back. The host can’t reach. Now troubleshoot using the methodology — define scope, hypothesize, narrow down.
- Time yourself. Beat your previous time.
- Bonus: instead of one breakage, ask for two compound failures — that’s where method really shines vs guessing.
Real network engineers learn this skill from years of being paged at 3 AM. Lab practice cuts that learning curve.
Cheat strip
| Concept | Plain English |
|---|---|
| Define the problem | Specific, measurable. “Internet broken” is not a problem statement |
| Gather information | Scope + recent changes + symptoms first, commands second |
| Bottom-up | Layer 1 → up. Use for hardware-suspect issues |
| Top-down | App → down. Use for single-app failures |
| Divide-and-conquer | Split the path. Use for long paths with intermediate access |
| Two openers | ”What changed?” and “Who’s affected?” |
| Show, not debug | show for normal triage; debug only when needed, with caution |
| One change at a time | If you fix it, you know what fixed it |
| Document the fix | Future you (or your replacement) will thank you |
| Escalate when stuck | After 30 min with no hypothesis, get a second pair of eyes |
| CCNA depth | Recognize the methodology, name the approaches, know the seven steps |