|
|
||
|---|---|---|
| .. | ||
| tools | ||
| utils | ||
| README.md | ||
| flow.py | ||
| main.py | ||
| nodes.py | ||
| requirements.txt | ||
README.md
Web Crawler with Content Analysis
A web crawler tool built with PocketFlow that crawls websites and analyzes content using LLM.
Features
- Crawls websites while respecting domain boundaries
- Extracts text content and links from pages
- Analyzes content using GPT-4 to generate:
- Page summaries
- Main topics/keywords
- Content type classification
- Processes pages in batches for efficiency
- Generates a comprehensive analysis report
Installation
- Clone the repository
- Install dependencies:
pip install -r requirements.txt - Set your OpenAI API key:
export OPENAI_API_KEY='your-api-key'
Usage
Run the crawler:
python main.py
You will be prompted to:
- Enter the website URL to crawl
- Specify maximum number of pages to crawl (default: 10)
The tool will then:
- Crawl the specified website
- Extract and analyze content using GPT-4
- Generate a report with findings
Project Structure
pocketflow-tool-crawler/
├── tools/
│ ├── crawler.py # Web crawling functionality
│ └── parser.py # Content analysis using LLM
├── utils/
│ └── call_llm.py # LLM API wrapper
├── nodes.py # PocketFlow nodes
├── flow.py # Flow configuration
├── main.py # Main script
└── requirements.txt # Dependencies
Limitations
- Only crawls within the same domain
- Text content only (no images/media)
- Rate limited by OpenAI API
- Basic error handling
Dependencies
- pocketflow: Flow-based processing
- requests: HTTP requests
- beautifulsoup4: HTML parsing
- openai: GPT-4 API access