August 28, 2025

Global Manufacturer Giant — Enterprise Exception Handling & Self-Healing Framework

Enterprise exception handling and self-healing—centralized taxonomy, ticketing, and lifecycle control

Summary

Implemented an enterprise exception handling & self-healing framework that normalized faults across SOA composites, OSB services, and ODI jobs, converted them into a single error taxonomy, and auto-raised incidents with context. The system throttled duplicates, shut down storming services safely, and exposed one-click reprocessing/bring-up, reducing MTTR and preventing cascades.


Problem

  • Scattered handling & noisy logs: Inconsistent patterns across Java/SOA/OSB/ODI meant errors were handled ad-hoc; critical signals were buried.
  • High MTTR & operator fatigue: Repeated faults created storm conditions; tickets were raised late or without context.
  • No safe-stop controls: Faulty endpoints caused retries, dead-letter growth, and upstream saturation; bring-up steps were manual and error-prone.

Solution Mechanics

Primary pattern: Rules/validation (central taxonomy + fault policies + duplicate detection).
Secondary pattern: API-led orchestration (controlled lifecycle, ticketing, and reprocessing services).

  • Normalization & Capture

    • Standardized fault policies in composites; consistent catch/catch-all usage with canonical AIA fault message.
    • OSB error handlers published fault details; ODI logged to a custom table and invoked the framework.
    • All paths converged on an AIA Error Topic (JMS) for one ingress.
  • Error Listener & Classification

    • Error Listener service consumed the AIA topic and persisted records to ERROR_NOTIFICATION_INFO.
    • Duplicate detector (time-window, service + business ID) classified new vs duplicate; throttled notifications.
  • Automated Response & Ticketing

    • New: routed to IMS_ERRORFALLOUT_QFallout Consumer assembled Remedy payload (from REMEDY_GROUP_DETAILS) and created an incident; emailed owners.
    • Duplicate: fetched policy from ERROR_REFERENCE/SERVICE_REFERENCE; when shutdown required, invoked Lifecycle Service to disable OSB proxies or stop SOA services; status updated via ErrorHandlingDBService.
  • Controlled Bring-up & Reprocessing

    • Heartbeat Service (input: target name, optional START/STOP) invoked Error Reprocessing Service to walk TARGET_SERVICE_MAPPING and start/enable or stop/disable dependent services in order.
    • Message resubmission utilities and DLQ/replay scripts restored transactions once dependencies recovered.
  • Data & Governance

    • Canonical tables: ERROR_NOTIFICATION_INFO, ERROR_REFERENCE, SERVICE_REFERENCE, TARGET_SERVICE_MAPPING, REMEDY_GROUP_DETAILS.
    • Notification throttling; audit trails (who/what/when), policy versioning, and role-based visibility.
    • Enterprise Manager/BAM for drill-downs; masked PII in logs.

Diagram 1 - Context Diagram — Centralized error intake, classification, ticketing, and lifecycle control

Context Diagram — Centralized error intake, classification, ticketing, and lifecycle control

Diagram 2 - Sequence — Fault → normalize → classify (new/duplicate) → ticket or controlled shutdown → heartbeat bring-up

Sequence — Fault → normalize → classify (new/duplicate) → ticket or controlled shutdown → heartbeat bring-up

Diagram 3 - Operations — Throttling, DLQ/replay, tables & policy versioning

Operations — Throttling, DLQ/replay, tables & policy versioning


Process Flow

  1. SOA/OSB/ODI emits an error → normalized into AIA fault message and published to the AIA Error Topic (JMS).
  2. Error Listener persists to ERROR_NOTIFICATION_INFO and classifies by duplicate window + business identifiers.
  3. If new → post to IMS_ERRORFALLOUT_Q; Fallout Consumer enriches from REMEDY_GROUP_DETAILS and creates Remedy incident; email sent to the owning group.
  4. If duplicate and shutdown required → fetch policy from ERROR_REFERENCE/SERVICE_REFERENCE → call Lifecycle Service to disable OSB / stop SOA; mark service DOWN.
  5. Operators view details in Enterprise Manager/BAM; DLQ items are parked and tracked.
  6. On recovery, Heartbeat Service (START) triggers Error Reprocessing Service to enable/start impacted services using TARGET_SERVICE_MAPPING order.
  7. Replay/resubmit messages; update ERROR_NOTIFICATION_INFO/SERVICE_REFERENCE; close incident with context.
  8. Unknown patterns flow to an “unclassified” bucket for taxonomy update and policy tuning.

Outcomes

  • Lower MTTR via auto-ticketing with context and consistent diagnostics.
  • Storm prevention by safely halting repeating offenders and coordinated re-start.
  • Operational clarity with one error taxonomy, throttled notifications, and drill-downs.

Strategic Business Impact

  • MTTR –30–60% (Modeled) — assumes baseline incident volumes and duplicate suppression window; measured reductions on selected interfaces extrapolated.
  • Ticket quality uplift (Proxy) — higher first-time-fix from enriched incidents (service, instance, business identifiers).
  • Outage containment (Proxy) — fewer cross-system cascades due to controlled shutdown and ordered bring-up.

Role & Scope

Owned framework architecture and rollout: fault policy standards, taxonomy, Error Listener, ErrorHandlingDBService, Lifecycle/Heartbeat/Reprocessing services, Fallout/Remedy integration, DLQ/replay tooling, tables and audits, and OEM/BAM dashboards.


Key Decisions & Trade-offs

  • Centralized taxonomy + topic intake vs per-app handling → uniform ops view; requires migration of legacy handlers.
  • Duplicate window & throttling curbs alert fatigue; risk of masking rare bursts → mitigated by role-based overrides.
  • Controlled shutdown contains blast radius; demands ordered bring-up and dependency mapping.
  • Ticket automation (Remedy) accelerates response; tight coupling to ticket schema needs version governance.
  • Strict PII masking safeguards data; limits some ad-hoc queries → solved with redaction-aware dashboards.

Risks & Mitigations

  • False positives in duplicate detection → tune window + add business keys; manual override path.
  • Shutdown loops if root cause persists → backoff strategy; cap attempts; require operator ack beyond N cycles.
  • Ticket storms from upstream retries → throttling + roll-up incidents per service/tenant.
  • DLQ growth → aging alerts, replay windows, and auto-archive rules.
  • Policy drift across teams → versioned policies, pre-prod validation pack, and mandatory catch/catch-all linting.

Suggested Metrics (run-time SLOs)

  • Error-to-ticket time p95 (ingest → incident created).
  • Duplicate suppression rate and notification throttle hit rate.
  • Automatic shutdown mean time (first duplicate → action) and bring-up success rate.
  • DLQ depth & replay age per interface.
  • Unknown-to-classified ratio over rolling 30 days.
  • Policy compliance (% services with standardized fault policy & catch blocks).

Closing principle

Treat errors as data, not noise—normalize, act safely, and learn on every repeat.


Ready to take your idea to the next level? Let's work together.