Agentic Website Manager
Agentic Website Manager
A modular AI-powered system for automatically managing research website content using specialized agents and LLM orchestration.
Overview
This system replaces the monolithic website_content_manager.py
with a modular, agent-based architecture that provides the same functionality but with better organization, maintainability, and extensibility.
Architecture
The system consists of four specialized agents:
🕷️ WebAgent
- Purpose: Web searching and paper downloading
- Tools: arXiv API search, PDF downloading, rate limiting
- Output: List of papers with metadata and local file paths
🔍 ParseAgent
- Purpose: PDF processing and figure extraction
- Tools: PDF text extraction, figure quality assessment, image processing
- Output: Extracted figures, text content, and metadata
📝 ContentAgent
- Purpose: Paper classification and content generation
- Tools: LLM-based classification, summary generation, HTML creation
- Output: Research categories, summaries, and formatted content
🔍 CriticAgent
- Purpose: Validation and quality assurance
- Tools: Classification validation, content quality checks, consistency analysis
- Output: Validation reports and recommendations
🎯 Orchestrator
- Purpose: Coordinates all agents in complex workflows
- Features: Pipeline management, error handling, progress tracking
Key Features
✅ Modular Design
- Each agent has a single responsibility
- Easy to modify, extend, or replace individual components
- Clear interfaces between agents
✅ Configuration Management
- All prompts, settings, and configurations in
config.py
- Easy to customize for different users or use cases
- No hardcoded values throughout the system
✅ LangChain Integration
- Uses LangChain for agent orchestration
- ReAct agents with tool access
- Structured agent communication
✅ Comprehensive Validation
- Automatic quality checks on all generated content
- Classification accuracy validation
- Data consistency verification
- Detailed reporting
✅ Error Handling & Logging
- Robust error handling at every stage
- Detailed logging and progress tracking
- Graceful degradation when components fail
✅ Portability
- Self-contained package structure
- Clear dependency management
- Easy to deploy to different environments
Installation
- Install dependencies:
cd scripts/agentic_website_manager pip install -r requirements.txt
- Set up API keys:
export GEMINI_API_KEY="your-gemini-api-key"
- Verify installation:
python orchestrator.py --help
Usage
Command Line Interface
# Full website update (publications + research + portfolio)
python orchestrator.py --full-update
# Research and portfolio only (preserve existing publications)
python orchestrator.py --research-only
# Skip validation checks (faster)
python orchestrator.py --full-update --no-validation
Python API
from agentic_website_manager import WebsiteOrchestrator
# Initialize orchestrator
orchestrator = WebsiteOrchestrator()
# Run full pipeline
results = orchestrator.run_full_pipeline(validate=True)
# Research-only update
results = orchestrator.run_research_only_update(validate=True)
# Check results
if results['success']:
print(f"Processed {results['final_results']['papers_processed']} papers")
print(f"Generated {results['final_results']['categories_created']} categories")
else:
print(f"Pipeline failed: {results['error']}")
Individual Agent Usage
from agentic_website_manager import WebAgent, ParseAgent, ContentAgent, CriticAgent
# Use individual agents
web_agent = WebAgent()
papers = web_agent.execute("search_and_download")
parse_agent = ParseAgent()
figures = parse_agent.execute(papers['papers'], "extract_figures")
content_agent = ContentAgent()
content = content_agent.execute(papers['papers'], figures['figures_extracted'])
critic_agent = CriticAgent()
validation = critic_agent.execute({
'papers': papers['papers'],
'categories': content['categories'],
'figures': figures['figures_extracted']
})
Configuration
Customizing Prompts
Edit config.py
to modify LLM prompts:
config.prompts['paper_classification'] = """
Your custom classification prompt here...
{paper_list}
"""
Search Configuration
Modify search terms and parameters:
config.search_config['search_terms'] = [
'au:"Your Name"',
'au:"Name, Your"'
]
Classification Categories
Customize research categories:
config.classification_config['categories']['new-category'] = {
'name': 'New Research Area',
'description': 'Description of the research area',
'keywords': ['keyword1', 'keyword2']
}
File Structure
agentic_website_manager/
├── __init__.py # Package initialization
├── config.py # All configuration and prompts
├── base_agent.py # Base agent class
├── web_agent.py # Web search and download agent
├── parse_agent.py # PDF parsing and figure extraction
├── content_agent.py # Content classification and generation
├── critic_agent.py # Validation and quality assurance
├── orchestrator.py # Main coordination and CLI
├── requirements.txt # Dependencies
└── README.md # This file
Extending the System
Adding New Agents
- Inherit from
BaseAgent
- Define tools using
@tool
decorator - Implement
execute()
method - Add to orchestrator workflow
Adding New Tools
from langchain.tools import tool
@tool
def your_custom_tool(input_param: str) -> str:
"""Description of what your tool does."""
# Tool implementation
return result
Custom Workflows
# Create custom orchestration
def custom_workflow():
web_agent = WebAgent()
# ... custom logic
return results
Validation and Quality Checks
The system includes comprehensive validation:
- Classification Validation: Ensures all papers are properly categorized
- Content Quality: Checks HTML/Markdown formatting and completeness
- Data Consistency: Verifies cross-references between papers, categories, and figures
- Error Reporting: Detailed reports with recommendations
Comparison with Original System
Feature | Original | Agentic System |
---|---|---|
Architecture | Monolithic | Modular agents |
Configuration | Hardcoded | Centralized config |
Validation | Basic | Comprehensive |
Extensibility | Limited | Highly extensible |
Error Handling | Basic | Robust |
Testing | Difficult | Agent-level testing |
Maintainability | Low | High |
Troubleshooting
Common Issues
- API Key Error: Ensure
GEMINI_API_KEY
is set - Missing Dependencies: Run
pip install -r requirements.txt
- Permission Errors: Check file system permissions
- Network Issues: Check internet connection for arXiv access
Debug Mode
Enable verbose logging:
import logging
logging.basicConfig(level=logging.DEBUG)
Validation Failures
Run with validation to identify issues:
python orchestrator.py --full-update # validation enabled by default
License
This system is designed to be portable and can be easily adapted for different users and use cases. The modular architecture ensures that individual components can be modified or replaced without affecting the entire system.
Future Enhancements
- Support for additional paper sources (PubMed, IEEE, etc.)
- Advanced figure analysis and captioning
- Multi-language support
- Integration with citation managers
- Real-time content updates
- Web dashboard for monitoring and control