pocketflow/cookbook/pocketflow-tool-crawler/README.md

# Web Crawler with Content Analysis

A web crawler tool built with PocketFlow that crawls websites and analyzes content using LLM.

## Features

- Crawls websites while respecting domain boundaries
- Extracts text content and links from pages
- Analyzes content using GPT-4 to generate:
  - Page summaries
  - Main topics/keywords
  - Content type classification
- Processes pages in batches for efficiency
- Generates a comprehensive analysis report

## Installation

1. Clone the repository
2. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```
3. Set your OpenAI API key:
   ```bash
   export OPENAI_API_KEY='your-api-key'
   ```

## Usage

Run the crawler:
```bash
python main.py
```

You will be prompted to:
1. Enter the website URL to crawl
2. Specify maximum number of pages to crawl (default: 10)

The tool will then:
1. Crawl the specified website
2. Extract and analyze content using GPT-4
3. Generate a report with findings

## Project Structure

```
pocketflow-tool-crawler/
├── tools/
│   ├── crawler.py     # Web crawling functionality
│   └── parser.py      # Content analysis using LLM
├── utils/
│   └── call_llm.py    # LLM API wrapper
├── nodes.py           # PocketFlow nodes
├── flow.py           # Flow configuration
├── main.py           # Main script
└── requirements.txt   # Dependencies
```

## Limitations

- Only crawls within the same domain
- Text content only (no images/media)
- Rate limited by OpenAI API
- Basic error handling

## Dependencies

- pocketflow: Flow-based processing
- requests: HTTP requests
- beautifulsoup4: HTML parsing
- openai: GPT-4 API access