History

Alan ALves 557a14f695 feat: add new examples from pocketflow-academy		2025-03-19 10:31:04 -03:00
..
tools	feat: add new examples from pocketflow-academy	2025-03-19 10:31:04 -03:00
utils	feat: add new examples from pocketflow-academy	2025-03-19 10:31:04 -03:00
README.md	feat: add new examples from pocketflow-academy	2025-03-19 10:31:04 -03:00
flow.py	feat: add new examples from pocketflow-academy	2025-03-19 10:31:04 -03:00
main.py	feat: add new examples from pocketflow-academy	2025-03-19 10:31:04 -03:00
nodes.py	feat: add new examples from pocketflow-academy	2025-03-19 10:31:04 -03:00
requirements.txt	feat: add new examples from pocketflow-academy	2025-03-19 10:31:04 -03:00

README.md

Web Crawler with Content Analysis

A web crawler tool built with PocketFlow that crawls websites and analyzes content using LLM.

Features

Crawls websites while respecting domain boundaries
Extracts text content and links from pages
Analyzes content using GPT-4 to generate:
- Page summaries
- Main topics/keywords
- Content type classification
Processes pages in batches for efficiency
Generates a comprehensive analysis report

Installation

Clone the repository
Install dependencies:
```
pip install -r requirements.txt
```
Set your OpenAI API key:
```
export OPENAI_API_KEY='your-api-key'
```

Usage

Run the crawler:

python main.py

You will be prompted to:

Enter the website URL to crawl
Specify maximum number of pages to crawl (default: 10)

The tool will then:

Crawl the specified website
Extract and analyze content using GPT-4
Generate a report with findings

Project Structure

pocketflow-tool-crawler/
├── tools/
│   ├── crawler.py     # Web crawling functionality
│   └── parser.py      # Content analysis using LLM
├── utils/
│   └── call_llm.py    # LLM API wrapper
├── nodes.py           # PocketFlow nodes
├── flow.py           # Flow configuration
├── main.py           # Main script
└── requirements.txt   # Dependencies

Limitations

Only crawls within the same domain
Text content only (no images/media)
Rate limited by OpenAI API
Basic error handling

Dependencies

pocketflow: Flow-based processing
requests: HTTP requests
beautifulsoup4: HTML parsing
openai: GPT-4 API access