72 lines
1.6 KiB
Markdown
72 lines
1.6 KiB
Markdown
# Web Crawler with Content Analysis
|
|
|
|
A web crawler tool built with PocketFlow that crawls websites and analyzes content using LLM.
|
|
|
|
## Features
|
|
|
|
- Crawls websites while respecting domain boundaries
|
|
- Extracts text content and links from pages
|
|
- Analyzes content using GPT-4 to generate:
|
|
- Page summaries
|
|
- Main topics/keywords
|
|
- Content type classification
|
|
- Processes pages in batches for efficiency
|
|
- Generates a comprehensive analysis report
|
|
|
|
## Installation
|
|
|
|
1. Clone the repository
|
|
2. Install dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
3. Set your OpenAI API key:
|
|
```bash
|
|
export OPENAI_API_KEY='your-api-key'
|
|
```
|
|
|
|
## Usage
|
|
|
|
Run the crawler:
|
|
```bash
|
|
python main.py
|
|
```
|
|
|
|
You will be prompted to:
|
|
1. Enter the website URL to crawl
|
|
2. Specify maximum number of pages to crawl (default: 10)
|
|
|
|
The tool will then:
|
|
1. Crawl the specified website
|
|
2. Extract and analyze content using GPT-4
|
|
3. Generate a report with findings
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
pocketflow-tool-crawler/
|
|
├── tools/
|
|
│ ├── crawler.py # Web crawling functionality
|
|
│ └── parser.py # Content analysis using LLM
|
|
├── utils/
|
|
│ └── call_llm.py # LLM API wrapper
|
|
├── nodes.py # PocketFlow nodes
|
|
├── flow.py # Flow configuration
|
|
├── main.py # Main script
|
|
└── requirements.txt # Dependencies
|
|
```
|
|
|
|
## Limitations
|
|
|
|
- Only crawls within the same domain
|
|
- Text content only (no images/media)
|
|
- Rate limited by OpenAI API
|
|
- Basic error handling
|
|
|
|
## Dependencies
|
|
|
|
- pocketflow: Flow-based processing
|
|
- requests: HTTP requests
|
|
- beautifulsoup4: HTML parsing
|
|
- openai: GPT-4 API access
|