Scott Ramey headshot

Scott Ramey

Implementation | Integrations | Customer Success (Non-Engineering Portfolio)

API Troubleshooting Integration Recovery Data Validation

Case Study — Enterprise API Integration: Diagnostic & Recovery

All content is sanitized and anonymized. Platform names and user counts have been generalized. Technical methodology and diagnostic approach reflect real execution.


Problem

Four days after go-live, a large enterprise customer flagged that new employees were not appearing in their platform dashboards and e-receipt data from their expense system had stopped flowing. The integration had been live and clean for 96 hours — this was not a configuration issue from setup. Something had changed.

With 173,000+ end users across a multi-org structure, a silent data gap carried real risk: employees blocked from platform access, expense records incomplete, and a customer who had just gone live now questioning whether the integration was stable.

Initial Customer Report "Three new hires from this week aren't showing up in the dashboard. And we're not seeing any e-receipts come through since Tuesday morning."

Diagnostic Approach — Work Top to Bottom

The first rule: don't touch anything until you know where the break is. Three questions before moving:

  • What specifically is wrong? — Employees missing from dashboard. E-receipts not flowing.
  • When did it start? — Tuesday morning. Not day one. Something changed after go-live.
  • Isolated or widespread? — All new hires affected. All e-receipts affected. Systemic, not one-off.

Systemic + post-go-live onset = integration layer, not configuration. Pulled the sync log first.

Step 1 — Read the Sync Log

// Sync Log — Expense Platform Roster Integration Last sync: Tue Mar 18 — 3:14 AM ✗ FAILED Previous: Mon Mar 17 — 3:11 AM ✓ SUCCESS Status: Authentication Error Records: Processed: 0 | Errors: 1 Error: 401 Unauthorized — Token expired // E-receipt forwarding log Last event: Tue Mar 18 — 2:58 AM ✗ FAILED Error: 403 Forbidden — Insufficient permissions
What This Told Me Two separate failures, same timestamp window. The roster sync hit a 401 (expired token) — re-auth needed, no engineering required. The e-receipt stream hit a 403 (valid credentials, wrong permissions) — different root cause. Both are fixable without escalation if I sequence them right.

Step 2 — Work the Diagnostic Ladder

1
Is the connection live? Checked all integration connections. Roster sync: red. E-receipt forwarding: red. Both dropped at the same time window Tuesday.
✗ Connection down — authentication failure
2
When did data last sync successfully? Last clean roster sync: Monday 3:11 AM. Last clean e-receipt event: Monday night. Gap of ~27 hours by the time customer reported.
✗ Abnormal gap — not a delay, a failure
3
Is data present but in the wrong place? No — data wasn't arriving at all. Not a mapping issue. Source confirmed healthy: new hires existed in the expense system.
✗ Data not arriving — API or ingestion layer
4
Root cause isolated Roster: expired access token (401). E-receipt: permission scope stripped during a credentials rotation the customer's IT team ran Monday night — they didn't flag it.
→ Fix identified. Both resolvable without engineering.

Step 3 — Validate With a Sample Payload

Before touching anything, I pulled a sample API response from the last failed sync to confirm what was actually being sent — and where it was breaking. Four fields to check every time.

✓ Last Healthy Payload (Mon)

{ "employee_id": "EMP-774821", "email": "j.chen@corp.com", "org_unit": "US-West", "status": "active", "timestamp": "2026-03-17T03:11:42Z" }

✗ Failed Payload (Tue)

{ "error": "401 Unauthorized", "message": "Access token expired", "token_expiry": "2026-03-17T23:59:59Z", "records_processed": 0, "timestamp": "2026-03-18T03:14:01Z" }
What the Payload Showed The token had a hard expiry at 11:59 PM Monday — exactly matching the failure window. The employee data structure was intact and correct on the last healthy sync. No data corruption, no mapping issue. Pure auth failure. Re-authenticate and reprocess.

Step 4 — Separate Issue: E-Receipt Permission Scope

The e-receipt 403 was a different problem. The credentials were valid — the integration could authenticate — but the permission scope had been narrowed. When the customer's IT team rotated credentials Monday night, they re-issued a token without the receipt.read scope attached.

Original Token Scope
  • roster.read
  • roster.write
  • receipt.read
  • org.read
Rotated Token Scope
  • roster.read
  • roster.write
  • receipt.read (stripped)
  • org.read
Key Distinction A 401 = identity problem (who are you?). A 403 = permission problem (you're authenticated but not allowed). These require different fixes. Conflating them wastes time and creates customer confusion.

Escalation Decision

Before touching anything, I made a clear call on what I owned vs. what engineering needed to own.

I Fix — No Engineering
  • Re-authenticate roster sync (new token)
  • Walk customer IT through re-issuing token with correct permission scope
  • Verify connection status after each fix
  • Request backfill of missed sync window
  • Set proactive token expiry reminder
Escalate to Engineering If
  • Re-auth successful but data still not flowing
  • Correct permission scope confirmed but 403 persists
  • Records arriving but employees landing in wrong org
  • Backfill requested but historical data not reprocessing
Message to Engineering (Template Used)

"Hey — working on an integration where roster sync dropped with a 401 (token expired) and e-receipt forwarding hit a 403 (permission scope stripped during a credentials rotation). I've re-authenticated the roster connection and confirmed it's syncing cleanly. The 403 is resolved on the customer side — IT re-issued the token with receipt.read included. Both connections are now green. Requesting a backfill for the 27-hour gap (Mar 18 3AM – Mar 19 6AM). Can you confirm and queue?"


Resolution & Validation

Both issues were resolved without engineering intervention. Engineering was looped in only for the backfill request — framed with the problem, what had been ruled out, and the specific ask.

Hour 0
Customer reports missing employees and e-receipts Pulled sync log immediately. Auth failure confirmed within 4 minutes.
Hour 0.5
Roster sync: re-authenticated New token issued. Connection tested. Confirmed clean sync cycle within 10 minutes. Communicated status to customer.
Hour 1.5
E-receipt: permission scope corrected Walked IT through re-issuing token with full required scopes. Tested event forwarding. Confirmed flowing.
Hour 2
Backfill requested + data parity check 27-hour gap identified. Engineering queued reprocess. Validated source vs. platform record counts matched after completion.
Hour 4
72-hour clean sync window confirmed All three new hires visible in dashboard. All e-receipts flowing. Zero errors in subsequent sync cycles. Customer confirmed resolution.

Customer Communication — Four Beats

Beat 1 — Name the Symptom "What you're seeing is a sync failure that started Tuesday morning — no new employees are being pulled in and e-receipt forwarding has stopped. This is an authentication issue, not a data problem."
Beat 2 — Explain the Cause "Your access token expired overnight, which cut off the roster sync. Separately, when IT rotated credentials Monday night, the receipt permission wasn't included in the new token. Two separate causes, both fixable without engineering."
Beat 3 — State the Steps "I'm re-authenticating the roster connection now. Once that's green, I'll walk your IT team through re-issuing the credential with the correct permission scope. Then I'll request a backfill for the gap window."
Beat 4 — Close With the Outcome "Both connections are confirmed live. Your three new hires are now visible in the dashboard. E-receipts are flowing. The backfill is complete — no records were lost. I've also set a proactive reminder before your next token expiry so this doesn't happen again."

Outcomes

<2 hrs
Full resolution, start to close
0
Records lost or permanently missing
27 hrs
Gap window backfilled completely
0
Engineering escalations required
173K+
End users protected from disruption
72 hrs
Clean sync window confirmed post-fix

Technical Components

  • API authentication — token lifecycle management, expiry handling, re-authentication flow
  • HTTP error code diagnosis — 401 vs. 403 distinction and appropriate response for each
  • Permission scope auditing — reading and comparing OAuth token scopes
  • JSON payload inspection — identifying healthy vs. failed response structure
  • Sync log analysis — reading cadence, status, error type, and records processed
  • Data parity validation — source system vs. platform record count comparison
  • Backfill coordination — gap identification, engineering handoff, confirmation
  • Customer communication — structured four-beat framework throughout investigation

Learnings

  • 401 and 403 are not the same problem. Conflating them adds time and confusion. Read the error code first — it tells you exactly where to look.
  • IT credential rotations are a silent risk post-go-live. Proactively document token expiry dates and required permission scopes at kickoff. Set calendar reminders. Don't wait for a 403.
  • Never go silent during investigation. A customer who knows you're working the problem doesn't panic. A customer who hears nothing does.
  • Know your lane. Both issues here were resolvable without engineering. Go to engineering with a problem, what you've ruled out, and the fix you're proposing — not a solved problem.