pocketflow/cookbook/pocketflow-tool-crawler
Alan ALves 557a14f695 feat: add new examples from pocketflow-academy 2025-03-19 10:31:04 -03:00
..
tools feat: add new examples from pocketflow-academy 2025-03-19 10:31:04 -03:00
utils feat: add new examples from pocketflow-academy 2025-03-19 10:31:04 -03:00
README.md feat: add new examples from pocketflow-academy 2025-03-19 10:31:04 -03:00
flow.py feat: add new examples from pocketflow-academy 2025-03-19 10:31:04 -03:00
main.py feat: add new examples from pocketflow-academy 2025-03-19 10:31:04 -03:00
nodes.py feat: add new examples from pocketflow-academy 2025-03-19 10:31:04 -03:00
requirements.txt feat: add new examples from pocketflow-academy 2025-03-19 10:31:04 -03:00

README.md

Web Crawler with Content Analysis

A web crawler tool built with PocketFlow that crawls websites and analyzes content using LLM.

Features

  • Crawls websites while respecting domain boundaries
  • Extracts text content and links from pages
  • Analyzes content using GPT-4 to generate:
    • Page summaries
    • Main topics/keywords
    • Content type classification
  • Processes pages in batches for efficiency
  • Generates a comprehensive analysis report

Installation

  1. Clone the repository
  2. Install dependencies:
    pip install -r requirements.txt
    
  3. Set your OpenAI API key:
    export OPENAI_API_KEY='your-api-key'
    

Usage

Run the crawler:

python main.py

You will be prompted to:

  1. Enter the website URL to crawl
  2. Specify maximum number of pages to crawl (default: 10)

The tool will then:

  1. Crawl the specified website
  2. Extract and analyze content using GPT-4
  3. Generate a report with findings

Project Structure

pocketflow-tool-crawler/
├── tools/
│   ├── crawler.py     # Web crawling functionality
│   └── parser.py      # Content analysis using LLM
├── utils/
│   └── call_llm.py    # LLM API wrapper
├── nodes.py           # PocketFlow nodes
├── flow.py           # Flow configuration
├── main.py           # Main script
└── requirements.txt   # Dependencies

Limitations

  • Only crawls within the same domain
  • Text content only (no images/media)
  • Rate limited by OpenAI API
  • Basic error handling

Dependencies

  • pocketflow: Flow-based processing
  • requests: HTTP requests
  • beautifulsoup4: HTML parsing
  • openai: GPT-4 API access