Explainable AI Needs Explainable Infrastructure

Observability tools like OpenTelemetry reveal the invisible faults in AI pipelines.

May 2nd, 2025 10:00am by Pronnoy Goswami

Featued image for: Explainable AI Needs Explainable Infrastructure

While developing an AI system, have you ever spent days and late nights to eventually realize the real issue was embedded deep in your infrastructure layer? Recently, I encountered exactly this challenge while working on an AI system. To my surprise, I inferred that-

The sudden drops in model accuracy or inconsistency were not caused by faulty models. However, they were rooted in subtle infrastructure issues, such as latency spikes or other misconfigurations.

From this root-cause analysis, I learned that achieving true explainable AI (XAI) requires transparency not just in the model but also in the infrastructure layer that forms the bedrock for the AI models. This approach, which I term “explainable infrastructure,” bridges the critical gap between transparency and operational observability.

The Real-World Problem: Unexplained Model Performance Drops

I was building a high-traffic recommendation system. Suddenly, I observed a sudden and unexplained drop in prediction accuracy. After a rigorous investigation into the model itself, I discovered the root cause was traced back to intermittent latency issues in the distributed storage system, AWS Simple Storage Service (S3) in this case.

According to Gartner’s 2023 report on cloud infrastructure reliability, 47% of unplanned downtime in AI/ML systems stems from infrastructure misconfigurations, including network latency and storage bottlenecks.

Why Infrastructure Transparency Matters

The performance of an AI model depends on the reliability of the underlying infrastructure. The fundamental elements of the infrastructure, like database latency, network performance, and memory allocation, can indirectly influence AI model decisions, introducing minute but impactful biases or inaccuracies.

Latency spikes in distributed systems account for ~35% of AI model performance degradation, often masked as model drift, as stated in the Google Cloud SRE Handbook, 2022.

To address this, I leveraged observability techniques typically used in a large-scale distributed system, specifically distributed tracing. This allowed us to bridge the gap between infrastructure metrics and AI model predictions.

Architecture for an Explainable AI Infrastructure

To visualize how the components interact, consider the following simplified architecture:

Figure 1. Architecture diagram for the infrastructure setup

OpenTelemetry Setup for AI inference pipeline

Here’s how I integrated OpenTelemetry into our AI inference pipeline to achieve transparency from infrastructure to model decision.

My OpenTelemetry Setup: We initialize OpenTelemetry to trace and capture detailed spans across the entire inference pipeline, providing granular visibility into latency and performance bottlenecks.

# OpenTelemetry setup
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

jaeger_exporter = JaegerExporter(agent_host_name="localhost", agent_port=6831)
span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# implement distributed tracing
def ai_inference(input_data):
    with tracer.start_as_current_span("ai_inference_pipeline") as span:
        infra_latency = measure_storage_latency()
        span.set_attribute("storage_latency_ms", infra_latency)

        prediction = run_model(input_data)
        span.set_attribute("model_prediction", prediction)

        return prediction





# measuring storage latency for calls to AWS s3

def measure_storage_latency():
    start_time = time.time()
    perform_user_query()

    latency_ms = (time.time() - start_time) * 1000
    

    return latency_ms

# OpenTelemetry setup

from opentelemetry import trace

from opentelemetry.exporter.jaeger.thrift import JaegerExporter

from opentelemetry.sdk.trace import TracerProvider

from opentelemetry.sdk.trace.export import BatchSpanProcessor

trace.set_tracer_provider(TracerProvider())

tracer = trace.get_tracer(__name__)

jaeger_exporter = JaegerExporter(agent_host_name="localhost", agent_port=6831)

span_processor = BatchSpanProcessor(jaeger_exporter)

trace.get_tracer_provider().add_span_processor(span_processor)

# implement distributed tracing

def ai_inference(input_data):

with tracer.start_as_current_span("ai_inference_pipeline") as span:

infra_latency = measure_storage_latency()

span.set_attribute("storage_latency_ms", infra_latency)

prediction = run_model(input_data)

span.set_attribute("model_prediction", prediction)

return prediction

# measuring storage latency for calls to AWS s3

def measure_storage_latency():

start_time = time.time()

perform_user_query()

latency_ms = (time.time() - start_time) * 1000

return latency_ms

Code 1. OpenTelemetry Setup for AI inference pipeline

Visualizing Metrics with Grafana Dasards

We created Grafana dasards to correlate infrastructure events with AI model performance. Here’s a simplified configuration:

Grafana Dasard Panel for Latency Visualization: This panel visually tracks storage latency over time, enabling immediate identification of potential infrastructure bottlenecks.

{
  "title": "Storage Latency",
  "type": "graph",
  "datasource": "Jaeger",
  "targets": [
    {
      "expr": "rate(storage_latency_ms[5m])",
      "interval": "1m"
    }
  ],
  "yaxes": [
    {
      "format": "ms",
      "label": "Latency (ms)"
    },
    {}
  ]
}

{

"title": "Storage Latency",

"type": "graph",

"datasource": "Jaeger",

"targets": [

{

"expr": "rate(storage_latency_ms[5m])",

"interval": "1m"

}

"yaxes": [

{

"format": "ms",

"label": "Latency (ms)"

{}

]

}

Code 2. Grafana dasard setup to measure latency

Configuring Grafana Alerts for Latency Spikes

We proactively monitor infrastructure using alerts. To detect and alert on latency issues, I set up a simple Grafana alert rule:

{
  "alert": {
    "conditions": [
      {
        "evaluator": {"params": [300], "type": "gt"},
        "query": {"params": ["A", "5m", "now"]},
        "reducer": {"params": [], "type": "avg"},
        "type": "query"
      }
    ],
    "executionErrorState": "alerting",
    "frequency": "1m",
    "handler": 1,
    "name": "High Storage Latency Alert",
    "noDataState": "no_data",
    "notifications": []
  },
  "title": "Storage Latency Alert",
  "type": "graph"
}

{

"alert": {

"conditions": [

{

"evaluator": {"params": [300], "type": "gt"},

"query": {"params": ["A", "5m", "now"]},

"reducer": {"params": [], "type": "avg"},

"type": "query"

}

"executionErrorState": "alerting",

"frequency": "1m",

"handler": 1,

"name": "High Storage Latency Alert",

"noDataState": "no_data",

"notifications": []

"title": "Storage Latency Alert",

"type": "graph"

}

Code 3. Configuring alerts for latency spikes

Actionable Insights

Unified Observability: It is essential to integrate your AI models’ metrics with infrastructure metrics. The north-star goal should be to track the end-to-end health of the system.
Proactive Alerting: Setting alerts on the infrastructure-level anomalies allows proactive detection of issues. This allows faster lead times for fix es and a better user experience.
Regular Reviews: Routinely check infrastructure health alongside model performance during regular operational reviews.

These explainable infrastructure practices, especially with observability, can help an organization reduce its troubleshooting times dramatically. This is a major change in the mindset in which debugging becomes proactive rather than reactive. Therefore, significantly enhancing system reliability and building trust in the AI solutions.

Final Thoughts

In my humble opinion, the intersection of infrastructure observability and explainable AI is ripe for innovation. The future AI systems will rely massively on transparent infrastructure observability tools, methodologies, and processes. This ensures greater accountability for stakeholders and builds end-user confidence while using the AI systems. The MIT Technology Review, 2024, in their research stated –

The next frontier for trustworthy AI isn’t just explainable models—it’s explainable infrastructure.

Explainable AI infrastructure is not merely a technical solution; it’s foundational and essential to building trustworthy, reliable AI. I’d love to hear your thoughts—how are you ensuring transparency across your AI systems?

Pronnoy Goswami is a cloud, AI infrastructure, and distributed systems specialist passionate about building scalable and resilient architectures. He actively contributes to the tech community as an author, startup advisor, and technical reviewer for top publishers like Springer, Apress, and...