Universal Website Content Manager

A generalized Python script to automatically manage academic website content for any researcher with publications on arXiv. The script uses LLM-based content generation to create research summaries, extract figures from papers, and maintain publication listings.

Features

Researcher-Agnostic: Works with any researcher by using configuration files
ArXiv Integration: Automatically fetches publications from arXiv API
LLM-Powered: Uses Google Gemini for intelligent content generation and paper classification
Figure Extraction: Automatically extracts scientific figures from PDF papers
Research Categorization: Dynamically groups papers into research themes
Portfolio Generation: Creates detailed research portfolio pages
Jekyll Compatible: Generates Jekyll-compatible markdown files

Setup

Install Dependencies:

pip install google-generativeai PyMuPDF Pillow numpy requests

Get Google Gemini API Key:
- Visit: https://makersuite.google.com/app/apikey
- Set environment variable: export GEMINI_API_KEY='your-api-key-here'
Create Configuration File: Copy researcher_config.json template and customize for your researcher

Configuration

Create a JSON configuration file with the following structure:

{
  "researcher": {
    "name": "Full Name",
    "title": "Dr.",
    "search_terms": [
      "au:\"Full Name\"",
      "au:\"Last, First\"",
      "au:\"F Last\""
    ],
    "author_patterns": [
      "(Last.*First)",
      "(Last.*F[\\s\\.])"
    ]
  },
  "website": {
    "base_dir": ".",
    "publications_dir": "_publications",
    "portfolio_dir": "_portfolio",
    "research_page": "_pages/research.html",
    "figures_dir": "images/research/figures",
    "papers_dir": "temp_papers"
  },
  "research_intro": "Research description paragraph...",
  "llm": {
    "api_key_env": "GEMINI_API_KEY",
    "model": "gemini-2.5-flash"
  }
}

Configuration Fields

researcher.name: Full name as it appears in publications
researcher.search_terms: List of arXiv search queries to find papers
researcher.author_patterns: Regex patterns to match author names (optional)
website.base_dir: Base directory for website files
research_intro: Introductory paragraph for research page
llm.api_key_env: Environment variable name for API key
llm.model: Gemini model to use

Usage

Basic Usage

# Full update with default config
python scripts/universal_website_content_manager.py --config researcher_config.json --full-update

# Research and portfolio only (preserve existing publications)
python scripts/universal_website_content_manager.py --config researcher_config.json --research-only

Specific Updates

# Update publications only
python scripts/universal_website_content_manager.py --config researcher_config.json --update-publications

# Update research page only
python scripts/universal_website_content_manager.py --config researcher_config.json --update-research

# Update portfolio pages only
python scripts/universal_website_content_manager.py --config researcher_config.json --update-portfolio

What It Creates

Publication Files (_publications/):
- Individual markdown files for each paper
- Jekyll front matter with metadata
- arXiv links and citations
Research Page (_pages/research.html):
- Dynamically categorized research overview
- Extracted figures from papers
- LLM-generated category summaries
Portfolio Pages (_portfolio/):
- Detailed research area descriptions
- Figure galleries
- Interactive modals for images
Figure Directory (images/research/figures/):
- High-quality scientific figures extracted from PDFs
- Automatically named and organized

Examples

See the included example configurations:

researcher_config.json - Template for Nesar Ramachandra
example_researcher_config.json - Example for a different researcher

Customization Tips

Search Terms: Include all name variations your researcher uses
Author Patterns: Use regex for complex name matching
Directory Structure: Adjust paths to match your Jekyll site structure
Research Introduction: Write a compelling overview paragraph
LLM Model: Choose between different Gemini models based on needs

Troubleshooting

No papers found: Check search terms and author patterns
LLM errors: Verify API key and network connection
PDF download fails: Some papers may have access restrictions
Figure extraction issues: Requires PyMuPDF and valid PDF files

Migration from Original Script

To migrate from the hardcoded website_content_manager.py:

Create a configuration file based on researcher_config.json
Update search terms and patterns for your specific researcher
Test with --research-only first to verify configuration
Run full update once everything works correctly

Nesar Ramachandra