Releases: jamwithai/arxiv-paper-curator
Week 1: The Infrastructure That Powers RAG Systems
What's included:
• Complete Docker Compose infrastructure setup
• FastAPI backend with health monitoring endpoints
• PostgreSQL database for metadata storage
• OpenSearch hybrid search engine with dashboards
• Apache Airflow for workflow orchestration
• Ollama integration for local LLMs
• Interactive Jupyter notebook tutorial (notebooks/week1/week1_setup.ipynb)
Key learning outcomes:
• Set up production-grade RAG infrastructure
• Configure and orchestrate multiple services
• Implement health checks and monitoring
• Build async REST APIs with FastAPI
• Work with vector and text search in OpenSearch
Prerequisites:
• Docker Desktop with Docker Compose
• Python 3.12+
• UV Package Manager
• 8GB+ RAM, 20GB+ disk space
Getting started:
git clone --branch week1.0 <repository-url>
cd arxiv-paper-curator
uv sync
docker compose up --build -d
Week 6: Production-ready RAG: Monitoring & Caching
Week 6: Production Monitoring and Caching
Production-ready RAG system with comprehensive observability and performance optimization:
✅ Langfuse Integration
- End-to-end RAG pipeline tracing and analytics
- Real-time performance monitoring dashboards
- Query pattern analysis and success rate tracking
✅ Redis Caching Layer
- 150-400x performance improvement for repeated
queries - Intelligent cache key strategies with TTL management
- 60%+ cache hit rate eliminating redundant LLM calls
Week 5: Complete RAG System with LLM Integration
Major Features:
- Add Ollama service integration with llama3.2 models
- Implement dual API design (standard + streaming endpoints)
- Create optimized prompt templates with minimal context
- Build Gradio interface for interactive RAG testing
- Add production configuration and error handling
- Include comprehensive documentation and examples
Technical Improvements:
- Streaming RAG responses via Server-Sent Events
- Clean prompt engineering with 80% context reduction
- Automatic source deduplication and citation formatting
- Production-ready error handling and health checks
- Configurable model selection (llama3.2:1b, 3b, etc.)
Week 4.0: Document Chunking and Hybrid Search
Major Features:
- Section-based document chunking with intelligent overlaps
- Jina AI embeddings for semantic similarity search
- Hybrid search with RRF (Reciprocal Rank Fusion)
- Unified OpenSearch index architecture
- Production FastAPI endpoints with error handling
Technical Implementation:
- Real 1024-dimensional vector embeddings
- Automatic embedding generation in API endpoints
- Graceful fallback from hybrid to BM25 search
- Comprehensive chunking strategies for academic papers
- Enhanced search relevance with semantic understanding
Week 3: The Search Foundation Every RAG System Needs
The Search Foundation Every RAG System Needs
- OpenSearch BM25 keyword search implementation
- Production-grade search service with factory patterns
- Multi-field search with field boosting (title 3x, abstract 2x)
- Advanced query features: filtering, pagination, highlighting
- Search API endpoints with comprehensive validation
- Airflow pipeline integration for real-time indexing
- Complete end-to-end search functionality
Key Features:
-
BM25 relevance scoring algorithm
-
Category filtering and date sorting
-
Sub-100ms search performance
-
Comprehensive test coverage
-
Clean architecture with dependency injection
Blog: https://jamwithai.substack.com/p/the-search-foundation-every-rag-system
Week 2: Building the ArXiv Paper Ingestion Pipeline
What's included:
• ArXiv API client with rate limiting and date filtering
• PDF downloader service with local caching
• Docling parser for structured PDF content extraction
• Metadata fetcher orchestration pipeline
• PostgreSQL integration for paper storage
• Airflow DAG for automated daily ingestion (weekdays only)
• Interactive Jupyter notebook tutorial
(notebooks/week2/week2_arxiv_integration.ipynb)
Key learning outcomes:
• Build async API clients with rate limiting
• Implement PDF processing pipelines
• Work with document parsing and extraction
• Design robust data ingestion workflows
• Create Airflow DAGs for automation
• Handle errors gracefully in data pipelines
Prerequisites:
• Completed Week 1 infrastructure setup
• Docker Desktop with Docker Compose running
• Python 3.12+ with UV Package Manager
• All services healthy (FastAPI, PostgreSQL, OpenSearch,
Airflow, Ollama)
Getting started:
git clone --branch week2.0 <repository-url>
cd arxiv-paper-curator
uv sync
docker compose down -v
docker compose up --build -d