Is LangSmith Down? How to Check LangSmith Status in Real-Time

Is LangSmith Down? How to Check LangSmith Status in Real-Time

Quick Answer: To check if LangSmith is down, visit apistatuscheck.com/api/langsmith for real-time monitoring, or check the official status.langchain.com page. Common signs include tracing connection failures, run logging delays, dashboard timeouts, evaluation pipeline errors, and API authentication issues affecting your LLM application monitoring.

When your LLM observability platform stops working, you lose critical visibility into your production AI systems. LangSmith is LangChain's essential tracing, debugging, and evaluation platform used by thousands of developers to monitor LLM applications, debug complex agent chains, and evaluate model performance. Any downtime means you're flying blind—unable to debug issues, trace errors, or monitor your AI applications in production. Knowing how to quickly verify LangSmith's status can save hours of troubleshooting and help you make informed decisions about your observability strategy.

How to Check LangSmith Status in Real-Time

1. API Status Check (Fastest Method)

The quickest way to verify LangSmith's operational status is through apistatuscheck.com/api/langsmith. This real-time monitoring service:

  • Tests actual API endpoints every 60 seconds
  • Shows response times and latency trends
  • Tracks historical uptime over 30/60/90 days
  • Provides instant alerts when issues are detected
  • Monitors tracing ingestion and API availability

Unlike status pages that rely on manual updates, API Status Check performs active health checks against LangSmith's production endpoints, giving you the most accurate real-time picture of service availability—critical when you need to know if your traces are being recorded.

2. Official LangChain Status Page

LangChain maintains status.langchain.com as their official communication channel for service incidents. The page displays:

  • Current operational status for all services
  • Active incidents and investigations
  • Scheduled maintenance windows
  • Historical incident reports
  • Component-specific status (API, Tracing, Dashboard, Evaluations)

Pro tip: Subscribe to status updates via email or webhook on the status page to receive immediate notifications when incidents occur. This is especially important during production deployments when tracing failures could mask critical issues.

3. Check Your LangSmith Dashboard

If the LangSmith Dashboard at smith.langchain.com is loading slowly or showing errors, this often indicates broader infrastructure issues. Pay attention to:

  • Login failures or timeouts
  • Project list loading errors
  • Trace data not appearing
  • Run details failing to load
  • Evaluation dashboard timeouts

Quick test: Navigate to a recent project and check if traces from the last hour are visible. Missing recent traces often indicates ingestion problems.

4. Test API Endpoints Directly

For developers, making a test API call can quickly confirm connectivity and authentication:

import requests
from datetime import datetime

# Test LangSmith API connectivity
api_key = "your_langsmith_api_key"
headers = {"x-api-key": api_key}

try:
    response = requests.get(
        "https://api.smith.langchain.com/api/v1/sessions",
        headers=headers,
        timeout=10
    )
    
    if response.status_code == 200:
        print(f"✓ LangSmith API responding normally ({response.elapsed.total_seconds():.2f}s)")
    else:
        print(f"✗ LangSmith API error: HTTP {response.status_code}")
        
except requests.exceptions.Timeout:
    print("✗ LangSmith API timeout - possible outage")
except requests.exceptions.ConnectionError:
    print("✗ Cannot connect to LangSmith API")

Look for HTTP response codes outside the 2xx range, timeout errors exceeding 10 seconds, or SSL/TLS handshake failures.

5. Check LangChain Integration Logs

If you're using LangChain with LangSmith tracing enabled, check your application logs for tracing errors:

import os
from langchain.callbacks import LangChainTracer
from langchain_openai import ChatOpenAI

# Enable debug logging to see tracing issues
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "your_api_key"

# If traces aren't appearing, check for connection errors in logs
llm = ChatOpenAI(temperature=0)
response = llm.invoke("test query")

# Check your logs for errors like:
# "Failed to post run to LangSmith"
# "Connection timeout to api.smith.langchain.com"
# "Authentication failed"

Silent trace failures are common—your application continues working, but observability data is lost.

Common LangSmith Issues and How to Identify Them

Tracing Connection Failures

Symptoms:

  • Traces not appearing in dashboard despite being sent
  • ConnectionError or Timeout exceptions in LangSmith callbacks
  • Silent failures (no errors, but no traces recorded)
  • Inconsistent trace ingestion (some traces appear, others don't)

What it means: When tracing ingestion is degraded, your LLM applications continue functioning normally, but you lose critical observability data. This is particularly dangerous because you won't notice until you actively need to debug something.

Common error patterns:

# Error in LangChain logs
LangSmithConnectionError: Failed to upload run to LangSmith
  Caused by: requests.exceptions.ConnectionError: 
  HTTPSConnectionPool(host='api.smith.langchain.com', port=443): 
  Max retries exceeded

# Silent failure - check for this
import logging
logging.basicConfig(level=logging.DEBUG)
# Look for: "Failed to post run" without raising exceptions

How to detect:

  1. Check if traces from the last 5 minutes appear in your project
  2. Compare trace count with your application's actual LLM call volume
  3. Look for gaps in trace timestamps
  4. Monitor callback execution time (>5s indicates problems)

Run Logging Delays

Symptoms:

  • Traces appearing 5-30+ minutes after execution
  • Out-of-order trace timestamps
  • Incomplete run trees (parent runs visible but child runs missing)
  • Evaluation results delayed or not updating

Impact: Delayed trace ingestion makes real-time debugging impossible. If you're investigating a production issue happening right now, but traces won't appear for 20 minutes, you're effectively blind during the critical incident window.

Verification test:

from langsmith import Client
from datetime import datetime, timedelta
import time

client = Client()

# Create a test run with timestamp
test_run_id = str(uuid.uuid4())
start_time = datetime.now()

# Send a simple trace
client.create_run(
    name="delay_test",
    run_type="llm",
    inputs={"query": "test"},
    outputs={"response": "test"},
    start_time=start_time,
    end_time=start_time + timedelta(seconds=1)
)

# Wait 30 seconds and check if it's visible
time.sleep(30)
runs = client.list_runs(project_name="your_project", limit=10)

# If your test run isn't in the most recent runs, ingestion is delayed

Dashboard Timeouts

Signs the dashboard is impacted:

  • Infinite loading spinners on project pages
  • "Failed to load runs" errors
  • Trace detail pages timing out
  • Comparison view not rendering
  • Annotation interface unresponsive

Common patterns:

# Browser console errors indicating backend issues
Failed to fetch runs: 504 Gateway Timeout
Error loading trace: Network request failed
WebSocket connection to wss://smith.langchain.com failed

Dashboard issues often accompany API problems but can also occur independently. The dashboard relies heavily on real-time queries across large trace datasets, making it sensitive to backend performance degradation.

Workaround: During dashboard outages, you can still query runs programmatically:

from langsmith import Client

client = Client()

# Fetch recent runs via API when dashboard is down
runs = client.list_runs(
    project_name="production",
    start_time=datetime.now() - timedelta(hours=1),
    limit=100
)

for run in runs:
    if run.error:
        print(f"Error in {run.name}: {run.error}")

Evaluation Pipeline Errors

Indicators:

  • Evaluation runs stuck in "pending" status
  • Dataset comparisons failing to complete
  • Evaluator functions timing out
  • A/B test results not updating

Example failure:

from langsmith.evaluation import evaluate

# Evaluation may hang or fail during outages
results = evaluate(
    lambda inputs: my_llm_app(inputs["query"]),
    data="my_dataset",
    evaluators=[correctness_evaluator],
    # This may timeout or fail silently
)

# Check evaluation status
if results.status == "error":
    print(f"Evaluation failed: {results.error_message}")

Evaluation pipelines are particularly vulnerable because they involve:

  • Multiple concurrent API calls
  • Large dataset processing
  • Real-time LLM invocations
  • Complex result aggregation

A partial LangSmith outage may cause evaluations to fail midway through, wasting compute resources and time.

API Authentication Issues

Common authentication errors:

# 401 Unauthorized
requests.exceptions.HTTPError: 401 Client Error: Unauthorized
Message: Invalid API key

# 403 Forbidden  
HTTPError: 403 Client Error: Forbidden for url
Message: API key does not have access to this resource

# Intermittent auth failures (cache issues)
Sometimes succeeds, sometimes fails with 401

During outages, you may see:

  • Previously working API keys suddenly returning 401
  • Intermittent authentication failures (works, then doesn't)
  • Session tokens expiring faster than normal
  • OAuth flows failing

Distinguishing outage from misconfiguration:

import time

def test_auth_stability():
    """Test if auth issues are intermittent (outage) or persistent (config error)"""
    client = Client(api_key=os.getenv("LANGSMITH_API_KEY"))
    
    results = []
    for i in range(5):
        try:
            client.list_runs(limit=1)
            results.append("success")
        except Exception as e:
            results.append(f"error: {str(e)}")
        time.sleep(2)
    
    # If results alternate between success/error, likely an outage
    # If all errors, likely config issue
    return results

# Output during outage: ['success', 'error: 401', 'success', 'error: 401', 'success']
# Output during config issue: ['error: 401', 'error: 401', 'error: 401', ...]

The Real Impact When LangSmith Goes Down

Lost Observability During Critical Incidents

The most dangerous impact of LangSmith downtime is losing visibility exactly when you need it most:

Scenario: Your production LLM application starts hallucinating or returning errors. Normally, you'd:

  1. Check LangSmith traces to see what's happening
  2. Identify which prompts are causing issues
  3. Review the chain-of-thought for problematic runs
  4. Compare current behavior against baselines

With LangSmith down: You're debugging blind. You can't see:

  • Which LLM calls are failing
  • What prompts users are sending
  • How your agents are reasoning
  • Whether the issue is in your code or the LLM provider

This transforms a 10-minute debugging session into hours of manual log analysis.

Unable to Debug Production Issues

Modern LLM applications are complex multi-step systems:

# Typical LangChain agent with multiple steps
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_openai import ChatOpenAI

# When this fails in production, you NEED traces to debug
agent_executor = AgentExecutor(
    agent=create_openai_functions_agent(llm, tools, prompt),
    tools=tools,
    verbose=True,
    return_intermediate_steps=True  # Only useful if LangSmith is recording!
)

result = agent_executor.invoke({"input": user_query})

Without LangSmith traces:

  • You can't see which tool the agent chose
  • You can't see the reasoning for tool selection
  • You can't see intermediate LLM responses
  • You can't identify where in the chain it failed

Financial impact: If your AI application generates $10,000/hour in revenue, and a bug that would take 10 minutes to debug with traces takes 4 hours without, that's $39,000+ in lost revenue.

Halted Evaluation and Experimentation

Data science and ML teams rely on LangSmith for systematic evaluation:

# A/B testing different prompts
from langsmith.evaluation import evaluate

# Can't run this during outages
results_v1 = evaluate(
    lambda x: chain_v1.invoke(x),
    data="production_sample_1000",
    evaluators=[accuracy, relevance, tone]
)

results_v2 = evaluate(
    lambda x: chain_v2.invoke(x), 
    data="production_sample_1000",
    evaluators=[accuracy, relevance, tone]
)

# Decision blocked: which version to deploy?

Delayed decisions:

  • Can't validate prompt improvements before deploying
  • Can't compare new models against baselines
  • Can't run regression tests on agent changes
  • Can't evaluate fine-tuned model performance

For teams running continuous evaluation pipelines, LangSmith downtime means:

  • Blocked deployments waiting for eval results
  • Delayed experimentation cycles
  • Inability to validate production changes
  • Risk of deploying untested changes

Compliance and Audit Trail Gaps

Many industries require complete audit trails of AI system behavior:

Healthcare AI: Need to trace every diagnostic recommendation Financial services: Must log all AI-generated trading or lending decisions
Legal tech: Require complete chain-of-custody for AI-assisted analysis

When LangSmith is down:

  • Audit trails have gaps
  • Compliance requirements may be violated
  • Cannot retroactively generate required reports
  • Potential regulatory penalties

Example compliance requirement:

# Healthcare: Must log all patient-facing AI interactions
@trace_for_compliance
def diagnose_symptoms(symptoms: str) -> Diagnosis:
    """This MUST be traced for regulatory compliance"""
    diagnosis = medical_llm.invoke(symptoms)
    # If LangSmith is down, this trace is lost
    # Potential HIPAA/FDA compliance violation
    return diagnosis

Failed Chain-of-Thought Debugging

For complex reasoning tasks, LangSmith's chain-of-thought visibility is irreplaceable:

# Multi-step reasoning chain
from langchain.chains import LLMChain, SequentialChain

# Step 1: Analyze requirements
analyze_chain = LLMChain(llm=llm, prompt=analyze_prompt, output_key="analysis")

# Step 2: Generate solution
solution_chain = LLMChain(llm=llm, prompt=solution_prompt, output_key="solution")

# Step 3: Review quality
review_chain = LLMChain(llm=llm, prompt=review_prompt, output_key="review")

# Combined chain
full_chain = SequentialChain(
    chains=[analyze_chain, solution_chain, review_chain],
    input_variables=["requirements"],
    output_variables=["analysis", "solution", "review"]
)

With LangSmith working: You see each step's inputs, outputs, reasoning, and timing With LangSmith down: You only see final output—no idea which step failed or why

Broken Monitoring and Alerting

Many teams build production monitoring on top of LangSmith data:

# Production monitoring that depends on LangSmith
from langsmith import Client

def check_production_health():
    client = Client()
    
    # Get last hour of production runs
    recent_runs = client.list_runs(
        project_name="production",
        start_time=datetime.now() - timedelta(hours=1)
    )
    
    # Calculate error rate
    total = len(list(recent_runs))
    errors = len([r for r in recent_runs if r.error])
    error_rate = errors / total if total > 0 else 0
    
    # Alert if error rate > 5%
    if error_rate > 0.05:
        send_alert(f"Production error rate: {error_rate:.1%}")

When LangSmith is down:

  • Production monitoring breaks
  • Alert systems go dark
  • SLA tracking impossible
  • Cannot detect production degradation

This creates a double-blind situation: LangSmith is down AND you can't monitor your actual application.

Incident Response Playbook for LangSmith Outages

1. Implement Graceful Degradation

Configure LangSmith tracing to fail silently rather than breaking your application:

import os
from langsmith import Client
from langchain.callbacks import LangChainTracer

# Safe tracing configuration
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

# Create tracer with timeout and error handling
try:
    tracer = LangChainTracer(
        project_name="production",
        client=Client(
            api_key=os.getenv("LANGSMITH_API_KEY"),
            timeout_ms=5000  # Fail fast if LangSmith is slow
        )
    )
except Exception as e:
    print(f"Warning: LangSmith tracer failed to initialize: {e}")
    tracer = None  # Application continues without tracing

# Use tracer only if available
callbacks = [tracer] if tracer else []

# Your LLM calls continue working even if LangSmith is down
response = llm.invoke(
    "user query",
    config={"callbacks": callbacks}
)

Key principle: Observability is valuable but should NEVER block production traffic.

2. Enable Local Trace Buffering

Queue traces locally when LangSmith is unavailable, then replay when service resumes:

import json
from pathlib import Path
from datetime import datetime
from langsmith import Client

class BufferedLangSmithClient:
    """Buffers traces locally during LangSmith outages"""
    
    def __init__(self, buffer_dir="./langsmith_buffer"):
        self.client = Client()
        self.buffer_dir = Path(buffer_dir)
        self.buffer_dir.mkdir(exist_ok=True)
        
    def create_run(self, **kwargs):
        try:
            # Try normal submission
            return self.client.create_run(**kwargs, timeout=5)
        except Exception as e:
            # Buffer locally on failure
            print(f"LangSmith unavailable, buffering trace: {e}")
            self._buffer_trace("create_run", kwargs)
            return None
    
    def _buffer_trace(self, operation, data):
        """Save trace to local file"""
        timestamp = datetime.now().isoformat()
        filename = self.buffer_dir / f"{timestamp}_{operation}.json"
        
        with open(filename, 'w') as f:
            json.dump({
                'operation': operation,
                'data': data,
                'timestamp': timestamp
            }, f)
    
    def replay_buffered_traces(self):
        """Replay all buffered traces when service resumes"""
        buffered_files = sorted(self.buffer_dir.glob("*.json"))
        
        print(f"Replaying {len(buffered_files)} buffered traces...")
        
        for trace_file in buffered_files:
            try:
                with open(trace_file) as f:
                    trace = json.load(f)
                
                # Replay the operation
                if trace['operation'] == 'create_run':
                    self.client.create_run(**trace['data'])
                
                # Delete successfully replayed trace
                trace_file.unlink()
                
            except Exception as e:
                print(f"Failed to replay {trace_file}: {e}")

# Usage
buffered_client = BufferedLangSmithClient()

# This continues working during outages
buffered_client.create_run(
    name="my_run",
    run_type="llm",
    inputs={"query": "test"}
)

# After outage resolves
buffered_client.replay_buffered_traces()

This ensures no trace data is lost during outages, maintaining complete observability history.

3. Implement Alternative Logging

When LangSmith is down, fall back to structured logging for basic observability:

import logging
import json
from datetime import datetime
from typing import Dict, Any

# Configure JSON logging as LangSmith fallback
class LLMLogger:
    """Fallback logging when LangSmith is unavailable"""
    
    def __init__(self, log_file="llm_traces.jsonl"):
        self.logger = logging.getLogger("llm_traces")
        handler = logging.FileHandler(log_file)
        handler.setFormatter(logging.Formatter('%(message)s'))
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.INFO)
    
    def log_run(self, 
                run_name: str,
                inputs: Dict[str, Any],
                outputs: Dict[str, Any],
                error: str = None,
                metadata: Dict[str, Any] = None):
        """Log LLM run in LangSmith-compatible format"""
        
        trace = {
            "timestamp": datetime.now().isoformat(),
            "run_name": run_name,
            "inputs": inputs,
            "outputs": outputs,
            "error": error,
            "metadata": metadata or {}
        }
        
        self.logger.info(json.dumps(trace))
    
    def search_traces(self, query: str = None, error_only: bool = False):
        """Basic trace search in logs"""
        traces = []
        with open("llm_traces.jsonl") as f:
            for line in f:
                trace = json.loads(line)
                if error_only and not trace.get("error"):
                    continue
                if query and query not in json.dumps(trace):
                    continue
                traces.append(trace)
        return traces

# Use as fallback
fallback_logger = LLMLogger()

try:
    # Try LangSmith first
    langsmith_client.create_run(...)
except:
    # Fall back to local logging
    fallback_logger.log_run(
        run_name="my_chain",
        inputs={"query": user_input},
        outputs={"response": llm_response}
    )

4. Monitor LangSmith Health Proactively

Don't wait for users to report issues—detect LangSmith problems automatically:

import requests
from datetime import datetime, timedelta
import time

class LangSmithHealthMonitor:
    """Proactive LangSmith health monitoring"""
    
    def __init__(self, check_interval=60):
        self.check_interval = check_interval
        self.consecutive_failures = 0
        self.last_success = datetime.now()
    
    def check_health(self) -> bool:
        """Check if LangSmith is responding"""
        try:
            response = requests.get(
                "https://api.smith.langchain.com/api/v1/sessions",
                headers={"x-api-key": os.getenv("LANGSMITH_API_KEY")},
                timeout=10
            )
            
            if response.status_code == 200:
                self.consecutive_failures = 0
                self.last_success = datetime.now()
                return True
            else:
                self.consecutive_failures += 1
                return False
                
        except Exception as e:
            self.consecutive_failures += 1
            return False
    
    def get_status(self) -> dict:
        """Get current health status"""
        is_healthy = self.check_health()
        
        return {
            "healthy": is_healthy,
            "consecutive_failures": self.consecutive_failures,
            "last_success": self.last_success.isoformat(),
            "minutes_since_success": (datetime.now() - self.last_success).seconds // 60
        }
    
    def monitor(self, alert_callback):
        """Continuous monitoring with alerts"""
        while True:
            status = self.get_status()
            
            # Alert after 3 consecutive failures
            if status["consecutive_failures"] >= 3:
                alert_callback(
                    f"🚨 LangSmith appears to be down! "
                    f"{status['minutes_since_success']} minutes since last success"
                )
            
            # Alert when service recovers
            if status["healthy"] and status["consecutive_failures"] == 0:
                if status["minutes_since_success"] > 5:
                    alert_callback(
                        f"✅ LangSmith has recovered after "
                        f"{status['minutes_since_success']} minutes"
                    )
            
            time.sleep(self.check_interval)

# Run in background
def send_slack_alert(message):
    # Your alert logic
    print(message)

monitor = LangSmithHealthMonitor(check_interval=60)
monitor.monitor(alert_callback=send_slack_alert)

5. Prepare Team Communication

Have templates ready for internal and external communication:

Internal Slack alert:

🚨 LangSmith Outage Detected

Status: LangSmith API unresponsive
Impact: Tracing and evaluation unavailable
Duration: 15 minutes so far
Action: Switched to fallback logging

Monitor: https://apistatuscheck.com/api/langsmith
Official: https://status.langchain.com

Team: Continue development. Traces are being buffered locally.

Customer-facing status:

We're currently experiencing delays in our AI monitoring dashboard 
due to issues with our observability provider (LangSmith). 

Your application functionality is NOT affected—only internal 
monitoring is impacted. We're tracking the issue and will provide 
updates as the situation develops.

6. Post-Outage Recovery Steps

After LangSmith service is restored:

def post_outage_recovery():
    """Run after LangSmith outage resolves"""
    
    print("1. Replaying buffered traces...")
    buffered_client.replay_buffered_traces()
    
    print("2. Verifying trace completeness...")
    # Check for gaps in trace timeline
    recent_runs = client.list_runs(
        start_time=outage_start_time,
        end_time=datetime.now()
    )
    
    expected_count = get_expected_run_count()
    actual_count = len(list(recent_runs))
    
    if actual_count < expected_count * 0.95:
        print(f"⚠️  Warning: Only {actual_count}/{expected_count} expected traces present")
    
    print("3. Re-running failed evaluations...")
    # Retry any evaluations that failed during outage
    retry_failed_evaluations(outage_start_time, outage_end_time)
    
    print("4. Updating monitoring dashboards...")
    # Refresh any cached data
    refresh_dashboards()
    
    print("✅ Recovery complete")

Frequently Asked Questions

How often does LangSmith go down?

LangSmith maintains strong uptime, typically exceeding 99.9% availability. Major outages affecting all customers are infrequent (2-4 times per year), though brief degradations or regional issues may occur more often. As a relatively newer platform compared to established cloud services, LangSmith is actively improving infrastructure resilience. Most development teams experience minimal disruption from LangSmith downtime in a typical month.

What's the difference between LangSmith status page and API Status Check?

The official LangChain status page (status.langchain.com) is manually updated by LangChain's team during incidents, which can sometimes lag behind actual issues by several minutes. API Status Check performs automated health checks every 60 seconds against live LangSmith API endpoints, often detecting issues before they're officially reported. For production systems, use both: API Status Check for immediate detection and the official status page for detailed incident information and ETAs.

Will my LangChain application stop working if LangSmith is down?

No! LangSmith is purely an observability layer—your LangChain application will continue functioning normally even if LangSmith is completely unavailable. However, you'll lose tracing, debugging visibility, and evaluation capabilities during the outage. Best practice is to configure tracing callbacks with timeouts and error handling so LangSmith failures don't slow down your application's response times.

How do I prevent losing trace data during LangSmith outages?

Implement local buffering as shown in the incident playbook above. Queue traces locally when LangSmith is unavailable, then replay them when service resumes. This ensures complete trace history even through extended outages. Alternatively, implement dual-logging to both LangSmith and a local structured logging system, giving you fallback observability during outages.

Can I use multiple tracing platforms simultaneously?

Yes! LangChain supports multiple callbacks, so you can send traces to both LangSmith and alternative platforms like Helicone or OpenLIT:

from langchain.callbacks import LangChainTracer
from helicone import HeliconeCallback

# Dual tracing for redundancy
callbacks = [
    LangChainTracer(project_name="production"),  # Primary
    HeliconeCallback(api_key=helicone_key)        # Backup
]

response = llm.invoke(query, config={"callbacks": callbacks})

This provides redundancy: if LangSmith is down, you still have traces in Helicone.

Does LangSmith downtime affect OpenAI, Anthropic, or other LLM providers?

No. LangSmith is independent from LLM providers. If LangSmith is down but OpenAI is operational or Anthropic is working, your LLM calls will continue normally—you'll just lose observability. Conversely, if your LLM provider is down but LangSmith is up, you'll see the provider failures clearly in LangSmith traces (which is actually quite helpful for diagnosis).

How do I debug my LLM application when LangSmith is down?

Fall back to these techniques:

  1. Verbose mode: Enable verbose=True in LangChain chains to see console output
  2. Local logging: Implement structured JSON logging as shown in the playbook
  3. LLM provider dashboards: Check OpenAI, Anthropic, or other provider dashboards directly
  4. Traditional debugging: Use debuggers, print statements, and application logs
  5. Buffered traces: If you implemented buffering, review local trace files

What SLA does LangSmith provide?

LangSmith's SLA terms vary by plan. Enterprise customers typically receive 99.9% uptime guarantees with credits for violations, while free and Plus tiers have best-effort availability without SLA credits. Review your specific plan at langchain.com or contact LangChain sales for enterprise SLA options. Remember that even with SLAs, you should implement graceful degradation since credits don't prevent the operational impact of downtime.

Should I pay for LangSmith Plus/Enterprise for better reliability?

Paid tiers often receive priority during incidents and may have access to dedicated infrastructure with better reliability. However, the core API infrastructure is generally shared. The bigger value of paid tiers is:

  • Higher rate limits: Less likely to hit throttling during high-volume tracing
  • Longer data retention: 30+ days vs 14 days, critical for long-term analysis
  • Team features: Shared projects, better for production collaboration
  • Support SLA: Faster incident response when issues occur

For production systems, Plus ($39/month) or Enterprise is usually worth it for the data retention alone, regardless of uptime differences.

Is there a LangSmith downtime notification service?

Yes, several options exist:

  • Subscribe to official updates at status.langchain.com
  • Use API Status Check for automated alerts via email, Slack, or webhook
  • Implement custom monitoring as shown in the playbook above
  • Set up synthetic monitoring with tools like Datadog, Pingdom, or New Relic

Recommended setup: Combine official status page subscription (for detailed updates) with API Status Check automated monitoring (for immediate detection).

Related Observability Platforms

LangSmith is part of the broader LLM observability ecosystem. Consider monitoring these related platforms:

Stay Ahead of LangSmith Outages

Don't lose visibility into your LLM applications when you need it most. Subscribe to real-time LangSmith alerts and get notified instantly when issues are detected—before your team notices missing traces.

API Status Check monitors LangSmith 24/7 with:

  • 60-second health checks of API and tracing endpoints
  • Instant alerts via email, Slack, Discord, or webhook
  • Historical uptime tracking and incident reports
  • Multi-platform monitoring for your entire LLM stack

Start monitoring LangSmith now →


Last updated: February 4, 2026. LangSmith status information is provided in real-time based on active monitoring. For official incident reports, always refer to status.langchain.com.

Monitor Your APIs

Check the real-time status of 100+ popular APIs used by developers.

View API Status →