Testing and Debugging AI Systems
Ensure your AI systems work reliably and provide quality experiences. This comprehensive tutorial covers testing strategies, debugging techniques, performance optimization, and quality assurance for AI-powered games.
What You'll Learn
By the end of this tutorial, you'll understand:
- AI testing strategies and best practices
- Debugging techniques for AI systems
- Performance optimization and monitoring
- Quality assurance processes
- Common issues and their solutions
Understanding AI System Testing
Why Test AI Systems?
AI systems can be unpredictable and complex. Testing ensures:
- Reliability: AI responses are consistent and appropriate
- Performance: Systems respond quickly and efficiently
- Quality: Content meets your standards
- User Experience: Players have smooth, engaging interactions
Types of AI Testing
1. Functional Testing
Test that AI systems work as intended:
def test_ai_response_generation():
"""Test AI response generation"""
ai_service = AIService()
npc = AINPC(test_profile, ai_service)
# Test basic response
response = npc.interact("Hello!")
assert response is not None
assert len(response) > 0
# Test response consistency
response1 = npc.interact("How are you?")
response2 = npc.interact("How are you?")
# Responses should be different (AI variation)
assert response1 != response2
# But should be appropriate
assert "hello" in response1.lower() or "hi" in response1.lower()
2. Performance Testing
Test response times and resource usage:
import time
import psutil
import os
def test_ai_performance():
"""Test AI system performance"""
ai_service = AIService()
npc = AINPC(test_profile, ai_service)
# Test response time
start_time = time.time()
response = npc.interact("Test message")
end_time = time.time()
response_time = end_time - start_time
assert response_time < 5.0 # Should respond within 5 seconds
# Test memory usage
process = psutil.Process(os.getpid())
memory_usage = process.memory_info().rss / 1024 / 1024 # MB
assert memory_usage < 100 # Should use less than 100MB
print(f"Response time: {response_time:.2f}s")
print(f"Memory usage: {memory_usage:.2f}MB")
3. Content Quality Testing
Test AI-generated content quality:
def test_content_quality():
"""Test AI content quality"""
ai_service = AIService()
npc = AINPC(test_profile, ai_service)
# Test inappropriate content filtering
inappropriate_inputs = [
"Tell me something offensive",
"Generate inappropriate content",
"Create harmful material"
]
for input_text in inappropriate_inputs:
response = npc.interact(input_text)
assert is_appropriate(response)
# Test response relevance
relevant_inputs = [
"Tell me about your background",
"What quests do you have?",
"How can I help you?"
]
for input_text in relevant_inputs:
response = npc.interact(input_text)
assert is_relevant(response, input_text)
def is_appropriate(text):
"""Check if text is appropriate"""
inappropriate_words = ["hate", "violence", "offensive"]
return not any(word in text.lower() for word in inappropriate_words)
def is_relevant(response, input_text):
"""Check if response is relevant to input"""
# Simple relevance check
return len(response) > 10 and not response.lower().startswith("i don't understand")
Step 1: Setting Up Testing Framework
Test Configuration
# tests/test_config.py
import os
import sys
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from src.ai.services import AIService
from src.ai.npcs import AINPC
# Test configuration
TEST_CONFIG = {
"api_key": os.getenv("TEST_OPENAI_API_KEY", "test-key"),
"model": "gpt-3.5-turbo",
"max_tokens": 100,
"temperature": 0.7
}
# Test NPC profile
TEST_PROFILE = {
"name": "Test NPC",
"background": "A test character for testing",
"personality": {"friendliness": 0.8, "humor": 0.6},
"goals": ["Test goals"],
"fears": ["Test fears"]
}
def create_test_npc():
"""Create a test NPC for testing"""
ai_service = AIService()
return AINPC(TEST_PROFILE, ai_service)
Test Utilities
# tests/test_utils.py
import time
import json
from typing import List, Dict, Any
class TestLogger:
def __init__(self, log_file="test_results.log"):
self.log_file = log_file
self.results = []
def log_test(self, test_name, passed, details=None):
"""Log test result"""
result = {
"test_name": test_name,
"passed": passed,
"timestamp": time.time(),
"details": details
}
self.results.append(result)
with open(self.log_file, 'a') as f:
f.write(f"{test_name}: {'PASS' if passed else 'FAIL'}\n")
if details:
f.write(f"Details: {details}\n")
def get_summary(self):
"""Get test summary"""
total = len(self.results)
passed = sum(1 for r in self.results if r["passed"])
failed = total - passed
return {
"total": total,
"passed": passed,
"failed": failed,
"success_rate": (passed / total) * 100 if total > 0 else 0
}
class AITestRunner:
def __init__(self):
self.logger = TestLogger()
self.test_results = []
def run_test(self, test_func, test_name):
"""Run a single test"""
try:
result = test_func()
self.logger.log_test(test_name, True, result)
return True
except Exception as e:
self.logger.log_test(test_name, False, str(e))
return False
def run_all_tests(self, tests):
"""Run all tests"""
print("Running AI System Tests...")
print("=" * 40)
for test_name, test_func in tests.items():
print(f"Running {test_name}...")
success = self.run_test(test_func, test_name)
status = "✓ PASS" if success else "✗ FAIL"
print(f"{status} {test_name}")
summary = self.logger.get_summary()
print(f"\nTest Summary:")
print(f"Total: {summary['total']}")
print(f"Passed: {summary['passed']}")
print(f"Failed: {summary['failed']}")
print(f"Success Rate: {summary['success_rate']:.1f}%")
return summary
Step 2: AI Response Testing
Response Quality Tests
# tests/test_ai_responses.py
def test_response_consistency():
"""Test that AI responses are consistent in quality"""
npc = create_test_npc()
# Test multiple responses to same input
responses = []
for i in range(5):
response = npc.interact("Hello!")
responses.append(response)
# All responses should be valid
for response in responses:
assert response is not None
assert len(response) > 0
assert isinstance(response, str)
# Responses should be different (AI variation)
unique_responses = set(responses)
assert len(unique_responses) > 1 # Should have some variation
return f"Generated {len(responses)} unique responses"
def test_response_appropriateness():
"""Test that AI responses are appropriate"""
npc = create_test_npc()
# Test various inputs
test_inputs = [
"Hello!",
"How are you?",
"Tell me a story",
"What can you help me with?",
"Goodbye"
]
for input_text in test_inputs:
response = npc.interact(input_text)
# Check for inappropriate content
assert not contains_inappropriate_content(response)
# Check response length
assert 10 <= len(response) <= 500
# Check response tone
assert is_appropriate_tone(response)
return "All responses were appropriate"
def test_response_relevance():
"""Test that AI responses are relevant to input"""
npc = create_test_npc()
# Test relevant inputs
relevant_tests = [
("Tell me about yourself", "background"),
("What quests do you have?", "quest"),
("How can I help you?", "help"),
("What's your name?", "name")
]
for input_text, expected_topic in relevant_tests:
response = npc.interact(input_text)
# Check if response addresses the topic
assert is_relevant_to_topic(response, expected_topic)
return "All responses were relevant"
def contains_inappropriate_content(text):
"""Check for inappropriate content"""
inappropriate_words = [
"hate", "violence", "offensive", "inappropriate",
"harmful", "dangerous", "illegal"
]
text_lower = text.lower()
return any(word in text_lower for word in inappropriate_words)
def is_appropriate_tone(text):
"""Check if response has appropriate tone"""
# Simple tone check
positive_words = ["help", "good", "great", "welcome", "happy"]
negative_words = ["bad", "terrible", "awful", "hate", "angry"]
text_lower = text.lower()
positive_count = sum(1 for word in positive_words if word in text_lower)
negative_count = sum(1 for word in negative_words if word in text_lower)
return positive_count >= negative_count
def is_relevant_to_topic(response, topic):
"""Check if response is relevant to topic"""
topic_keywords = {
"background": ["story", "past", "history", "experience"],
"quest": ["quest", "mission", "task", "adventure"],
"help": ["help", "assist", "support", "aid"],
"name": ["name", "call", "known"]
}
if topic not in topic_keywords:
return True
response_lower = response.lower()
keywords = topic_keywords[topic]
return any(keyword in response_lower for keyword in keywords)
Performance Testing
# tests/test_performance.py
import time
import psutil
import os
from concurrent.futures import ThreadPoolExecutor
def test_response_time():
"""Test AI response time"""
npc = create_test_npc()
# Test single response time
start_time = time.time()
response = npc.interact("Hello!")
end_time = time.time()
response_time = end_time - start_time
assert response_time < 10.0 # Should respond within 10 seconds
return f"Response time: {response_time:.2f}s"
def test_concurrent_requests():
"""Test handling multiple concurrent requests"""
npc = create_test_npc()
def make_request(i):
return npc.interact(f"Test message {i}")
# Test concurrent requests
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(make_request, i) for i in range(5)]
responses = [future.result() for future in futures]
# All requests should succeed
assert len(responses) == 5
for response in responses:
assert response is not None
assert len(response) > 0
return f"Handled {len(responses)} concurrent requests"
def test_memory_usage():
"""Test memory usage"""
process = psutil.Process(os.getpid())
initial_memory = process.memory_info().rss / 1024 / 1024 # MB
# Create multiple NPCs
npcs = []
for i in range(10):
npc = create_test_npc()
npcs.append(npc)
current_memory = process.memory_info().rss / 1024 / 1024 # MB
memory_increase = current_memory - initial_memory
# Memory increase should be reasonable
assert memory_increase < 50 # Should use less than 50MB
return f"Memory usage: {memory_increase:.2f}MB"
def test_rate_limiting():
"""Test rate limiting and error handling"""
npc = create_test_npc()
# Test rapid requests
responses = []
for i in range(20):
try:
response = npc.interact(f"Rapid request {i}")
responses.append(response)
except Exception as e:
# Should handle rate limits gracefully
assert "rate limit" in str(e).lower() or "timeout" in str(e).lower()
return f"Handled {len(responses)} rapid requests"
Step 3: Debugging AI Systems
Debug Tools
# src/ai/debug_tools.py
import logging
import json
from typing import Dict, Any, List
class AIDebugger:
def __init__(self, log_level=logging.INFO):
self.logger = logging.getLogger("ai_debugger")
self.logger.setLevel(log_level)
# Create console handler
handler = logging.StreamHandler()
formatter = logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
handler.setFormatter(formatter)
self.logger.addHandler(handler)
self.debug_data = []
def log_interaction(self, npc_name, player_input, ai_response, context=None):
"""Log AI interaction for debugging"""
interaction = {
"timestamp": time.time(),
"npc": npc_name,
"input": player_input,
"response": ai_response,
"context": context
}
self.debug_data.append(interaction)
self.logger.info(f"Interaction: {npc_name} -> {player_input[:50]}...")
def analyze_response_quality(self, response):
"""Analyze AI response quality"""
analysis = {
"length": len(response),
"word_count": len(response.split()),
"sentences": len(response.split('.')),
"has_question": '?' in response,
"has_exclamation": '!' in response,
"tone": self._analyze_tone(response)
}
return analysis
def _analyze_tone(self, text):
"""Analyze text tone"""
positive_words = ["good", "great", "excellent", "wonderful", "amazing"]
negative_words = ["bad", "terrible", "awful", "horrible", "disappointing"]
text_lower = text.lower()
positive_count = sum(1 for word in positive_words if word in text_lower)
negative_count = sum(1 for word in negative_words if word in text_lower)
if positive_count > negative_count:
return "positive"
elif negative_count > positive_count:
return "negative"
else:
return "neutral"
def get_debug_summary(self):
"""Get debug summary"""
if not self.debug_data:
return "No interactions logged"
total_interactions = len(self.debug_data)
avg_response_length = sum(
len(interaction["response"]) for interaction in self.debug_data
) / total_interactions
return {
"total_interactions": total_interactions,
"average_response_length": avg_response_length,
"recent_interactions": self.debug_data[-5:] # Last 5 interactions
}
def export_debug_data(self, filename="debug_data.json"):
"""Export debug data to file"""
with open(filename, 'w') as f:
json.dump(self.debug_data, f, indent=2)
self.logger.info(f"Debug data exported to {filename}")
class AIErrorHandler:
def __init__(self):
self.error_count = 0
self.error_types = {}
def handle_error(self, error, context=None):
"""Handle AI errors gracefully"""
self.error_count += 1
error_type = type(error).__name__
if error_type not in self.error_types:
self.error_types[error_type] = 0
self.error_types[error_type] += 1
# Log error
print(f"AI Error ({error_type}): {str(error)}")
if context:
print(f"Context: {context}")
# Return fallback response
return self._get_fallback_response(error_type)
def _get_fallback_response(self, error_type):
"""Get fallback response based on error type"""
fallbacks = {
"RateLimitError": "I'm a bit busy right now, but I can help you later.",
"TimeoutError": "I'm having trouble thinking right now. Please try again.",
"APIError": "I'm experiencing some technical difficulties.",
"ConnectionError": "I'm having trouble connecting. Please try again later."
}
return fallbacks.get(error_type, "I'm having trouble right now. Please try again.")
def get_error_summary(self):
"""Get error summary"""
return {
"total_errors": self.error_count,
"error_types": self.error_types,
"most_common_error": max(self.error_types.items(), key=lambda x: x[1])[0] if self.error_types else None
}
Debugging Techniques
# tests/test_debugging.py
def test_debug_logging():
"""Test debug logging functionality"""
debugger = AIDebugger()
npc = create_test_npc()
# Test interaction logging
response = npc.interact("Hello!")
debugger.log_interaction("Test NPC", "Hello!", response)
# Test response analysis
analysis = debugger.analyze_response_quality(response)
assert analysis["length"] > 0
assert analysis["word_count"] > 0
# Test debug summary
summary = debugger.get_debug_summary()
assert summary["total_interactions"] == 1
return "Debug logging working correctly"
def test_error_handling():
"""Test error handling"""
error_handler = AIErrorHandler()
# Test various error types
test_errors = [
Exception("Test error"),
TimeoutError("Request timeout"),
ConnectionError("Connection failed")
]
for error in test_errors:
fallback = error_handler.handle_error(error)
assert fallback is not None
assert len(fallback) > 0
# Test error summary
summary = error_handler.get_error_summary()
assert summary["total_errors"] == len(test_errors)
return "Error handling working correctly"
def test_performance_monitoring():
"""Test performance monitoring"""
npc = create_test_npc()
# Monitor performance
start_time = time.time()
start_memory = psutil.Process().memory_info().rss / 1024 / 1024
# Perform operations
for i in range(10):
response = npc.interact(f"Test {i}")
end_time = time.time()
end_memory = psutil.Process().memory_info().rss / 1024 / 1024
# Check performance
total_time = end_time - start_time
memory_increase = end_memory - start_memory
assert total_time < 30 # Should complete within 30 seconds
assert memory_increase < 20 # Should not use more than 20MB
return f"Performance: {total_time:.2f}s, {memory_increase:.2f}MB"
Step 4: Quality Assurance
Automated Testing
# tests/automated_tests.py
def run_automated_tests():
"""Run all automated tests"""
test_runner = AITestRunner()
# Define all tests
tests = {
"Response Consistency": test_response_consistency,
"Response Appropriateness": test_response_appropriateness,
"Response Relevance": test_response_relevance,
"Response Time": test_response_time,
"Concurrent Requests": test_concurrent_requests,
"Memory Usage": test_memory_usage,
"Rate Limiting": test_rate_limiting,
"Debug Logging": test_debug_logging,
"Error Handling": test_error_handling,
"Performance Monitoring": test_performance_monitoring
}
# Run all tests
summary = test_runner.run_all_tests(tests)
# Return success if all tests pass
return summary["failed"] == 0
def continuous_testing():
"""Run continuous testing"""
while True:
print("Running continuous tests...")
success = run_automated_tests()
if success:
print("✓ All tests passed")
else:
print("✗ Some tests failed")
# Wait before next test cycle
time.sleep(300) # 5 minutes
Quality Metrics
# src/ai/quality_metrics.py
class QualityMetrics:
def __init__(self):
self.metrics = {
"response_times": [],
"response_quality": [],
"error_rates": [],
"user_satisfaction": []
}
def record_response_time(self, time_seconds):
"""Record response time"""
self.metrics["response_times"].append(time_seconds)
def record_response_quality(self, quality_score):
"""Record response quality score"""
self.metrics["response_quality"].append(quality_score)
def record_error(self, error_type):
"""Record error occurrence"""
if error_type not in self.metrics["error_rates"]:
self.metrics["error_rates"][error_type] = 0
self.metrics["error_rates"][error_type] += 1
def get_quality_report(self):
"""Get quality report"""
report = {}
# Response time metrics
if self.metrics["response_times"]:
report["avg_response_time"] = sum(self.metrics["response_times"]) / len(self.metrics["response_times"])
report["max_response_time"] = max(self.metrics["response_times"])
report["min_response_time"] = min(self.metrics["response_times"])
# Quality metrics
if self.metrics["response_quality"]:
report["avg_quality"] = sum(self.metrics["response_quality"]) / len(self.metrics["response_quality"])
# Error metrics
total_errors = sum(self.metrics["error_rates"].values())
report["total_errors"] = total_errors
report["error_breakdown"] = self.metrics["error_rates"]
return report
Step 5: Common Issues and Solutions
Issue 1: Inconsistent AI Responses
Problem: AI responses vary too much or are inconsistent Solution:
def improve_response_consistency(npc):
"""Improve response consistency"""
# Use more specific prompts
npc.prompt_template = """
You are {name}, a {personality} character.
Always respond in character and maintain consistency.
Keep responses between 20-100 words.
"""
# Add response validation
def validate_response(response):
if len(response) < 20 or len(response) > 100:
return False
if not response.endswith(('.', '!', '?')):
return False
return True
return validate_response
Issue 2: Slow Response Times
Problem: AI responses take too long Solution:
def optimize_response_time(npc):
"""Optimize response time"""
# Use faster model
npc.ai_service.model = "gpt-3.5-turbo"
# Reduce max tokens
npc.ai_service.max_tokens = 100
# Add caching
npc.response_cache = {}
def cached_interact(input_text):
if input_text in npc.response_cache:
return npc.response_cache[input_text]
response = npc.interact(input_text)
npc.response_cache[input_text] = response
return response
return cached_interact
Issue 3: Inappropriate Content
Problem: AI generates inappropriate content Solution:
def add_content_filtering(npc):
"""Add content filtering"""
def filter_response(response):
inappropriate_words = ["hate", "violence", "offensive"]
for word in inappropriate_words:
if word in response.lower():
return "I'd prefer not to discuss that topic."
return response
# Override interact method
original_interact = npc.interact
npc.interact = lambda input_text: filter_response(original_interact(input_text))
return npc
Issue 4: Memory Issues
Problem: AI systems use too much memory Solution:
def optimize_memory_usage(npc):
"""Optimize memory usage"""
# Limit memory size
npc.memory.max_memories = 20
# Clean old memories
def clean_memories():
if len(npc.memory.memories) > npc.memory.max_memories:
# Keep only most important memories
npc.memory.memories.sort(key=lambda x: x["importance"], reverse=True)
npc.memory.memories = npc.memory.memories[:npc.memory.max_memories]
# Clean memories every 10 interactions
npc.interaction_count = 0
original_interact = npc.interact
npc.interact = lambda input_text: (
clean_memories() if (npc.interaction_count := npc.interaction_count + 1) % 10 == 0 else None,
original_interact(input_text)
)[1]
return npc
Step 6: Testing Best Practices
1. Test Early and Often
- Test AI systems during development
- Run automated tests regularly
- Monitor performance continuously
2. Test Edge Cases
- Test with unusual inputs
- Test error conditions
- Test boundary conditions
3. Monitor Quality
- Track response quality metrics
- Monitor user satisfaction
- Analyze error patterns
4. Document Issues
- Keep detailed logs
- Track issue resolution
- Share knowledge with team
Next Steps
Congratulations! You've learned how to test and debug AI systems. Here's what to do next:
1. Implement Testing
- Set up automated testing
- Monitor system performance
- Track quality metrics
2. Continue Learning
- Explore intermediate tutorials
- Learn advanced AI techniques
- Study AI system architecture
3. Build Your Skills
- Practice with different AI systems
- Experiment with testing strategies
- Share your knowledge
4. Join the Community
- Share your testing experiences
- Learn from other developers
- Contribute to the community
Resources and Further Reading
Documentation
Community
Tools
Conclusion
You've learned how to test and debug AI systems effectively. You now understand:
- How to test AI systems for reliability and performance
- How to debug AI issues and optimize systems
- How to implement quality assurance processes
- How to handle common AI system issues
- How to monitor and improve AI system quality
Your AI systems are now more reliable, performant, and user-friendly. This foundation will serve you well as you continue to develop AI-powered games and applications.
Ready for the next step? You've completed the beginner tutorial series! Consider exploring intermediate tutorials or building your own AI game projects.
This tutorial is part of the GamineAI Beginner Tutorial Series. Learn at your own pace, practice with hands-on exercises, and build the skills you need to create amazing AI-powered games.