batch

2024-12-28 21:26:41 +00:00 · 2024-12-28 21:26:41 +00:00 · 215ae8e089
parent c3e8c75968
commit 215ae8e089
1 changed files with 22 additions and 116 deletions
--- a/docs/batch.md
+++ b/docs/batch.md
@ -7,30 +7,29 @@ nav_order: 4

 # Batch

-**Batch** functionality in Mini LLM Flow makes it easier to handle a **list** of items in one Node or **rerun** a Flow multiple times. This is particularly useful for:
-
- **Chunk-based processing** (e.g., summarizing large texts in parts).  
+**Batch** makes it easier to handle large inputs in one Node or **rerun** a Flow multiple times. Useful for:
+- **Chunk-based** processing (e.g., large texts in parts).  
 - **Multi-file** processing.  
- **Iterating** over lists of parameters (e.g., user queries, documents, or URLs).
+- **Iterating** over lists of params (e.g., user queries, documents, URLs).

 ## 1. BatchNode

-A **BatchNode** extends `Node` but changes how `prep()` and `exec()` behave:
+A **BatchNode** extends `Node` but changes `prep()` and `exec()`:

- **`prep(shared)`**: Should return an **iterable** (list, generator, etc.) of items. 
- **`exec(shared, item)`**: Is called **once per item** in that iterable. 
- **`post(shared, prep_res, exec_res_list)`**: Receives a **list** of results from all the `exec()` calls. You can combine or store them.
+- **`prep(shared)`**: returns an **iterable** (list, generator, etc.).
+- **`exec(shared, item)`**: called **once** per item in that iterable.
+- **`post(shared, prep_res, exec_res_list)`**: receives a **list** of all `exec()` results, can combine or store them, and returns an **Action**.

-### Example: Map Summaries
+
+### Example: Summarize a Large File

 ```python
 class MapSummaries(BatchNode):
    def prep(self, shared):
-        # Suppose we have a big file; we want to chunk it
+        # Suppose we have a big file; chunk it
        content = shared["data"].get("large_text.txt", "")
        chunk_size = 10000
        chunks = [content[i:i+chunk_size] for i in range(0, len(content), chunk_size)]
-        # Return this list. The exec() method will be called once per chunk
        return chunks

    def exec(self, shared, chunk):
@ -39,140 +38,47 @@ class MapSummaries(BatchNode):
        return summary

    def post(self, shared, prep_res, exec_res_list):
-        # prep_res is the list of chunks
-        # exec_res_list is the list of summaries from each chunk
        combined = "\n".join(exec_res_list)
        shared["summary"]["large_text.txt"] = combined
        return "default"
-```

-**Flow** usage:
-```python
 map_summaries = MapSummaries()
 flow = Flow(start=map_summaries)
 flow.run(shared)
 ```

- After `prep()` returns multiple chunks, `exec()` is called for each chunk. 
- The aggregated `exec_res_list` is passed to `post()`, where you can do final processing.
-
-### Key Differences from a Normal Node
-
-1. **`exec()`** is called once per item returned by `prep()`.  
-2. The final **output** of `exec()` calls is collected into a list and given to `post()`.  
-3. `post()` still returns an **action**—just like a regular Node.
-
 ---

 ## 2. BatchFlow

-A **BatchFlow** runs a **Flow** multiple times, each time with a different set of `params`. You can think of it as a loop that replays the Flow for each parameter set.
+A **BatchFlow** runs a **Flow** multiple times, each with different `params`. Think of it as a loop that replays the Flow for each param set.
+

 ### Example: Summarize Many Files

 ```python
 class SummarizeAllFiles(BatchFlow):
    def prep(self, shared):
-        # Return a list of parameter dicts (one per file)
        filenames = list(shared["data"].keys())  # e.g., ["file1.txt", "file2.txt", ...]
-        params_list = [{"filename": fn} for fn in filenames]
-        return params_list
+        return [{"filename": fn} for fn in filenames]

-    # No custom exec() or post(), so we rely on BatchFlow’s default
-```
-
-Then define a **Flow** that handles **one** file. Suppose we have `Flow(start=summarize_file)`.  
-
-```python
-# Example "per-file" flow (just one node):
-summarize_file = SummarizeFile()
-
-# Or possibly something more elaborate:
+# Suppose we have a per-file flow:
 # load_file >> summarize >> reduce etc.
+summarize_file = SummarizeFile(start=load_file)

-# Then we wrap it into a BatchFlow:
 summarize_all_files = SummarizeAllFiles(start=summarize_file)
-
-# Running it:
 summarize_all_files.run(shared)
 ```

 **Under the hood**:
-1. `prep(shared)` in `SummarizeAllFiles` returns a list of param dicts, e.g., `[{filename: "file1.txt"}, {filename: "file2.txt"}, ...]`.
-2. The BatchFlow **iterates** over these param dicts. For each one, it sets the params on the sub-Flow (in this case, `summarize_file` or a bigger flow) and calls `flow.run(shared)`.
-3. Once done, you have run the same Flow for each item.
+1. `prep(shared)` returns a list of param dicts (e.g., `[{filename: "file1.txt"}, {filename: "file2.txt"}, ...]`).
+2. The BatchFlow **iterates** over them, sets params on the sub-Flow, and calls `flow.run(shared)` each time.
+3. The Flow is run repeatedly, once per item.

 ### Nested or Multi-level Batches

-You could nest a BatchFlow inside another BatchFlow. For instance, if you wanted to:
+You can nest a BatchFlow in another BatchFlow. For example:
+- Outer batch: iterate over directories.
+- Inner batch: summarize each file in a directory.

- Outer batch: iterate over directories (Flow that enumerates files in each directory).  
- Inner batch: summarize each file in that directory.
-
-This can be done by making the **outer** BatchFlow’s `exec()` return a list of files, which triggers the **inner** BatchFlow each time. For most simpler use cases, a single BatchFlow is enough.
-
---
-
-## 3. Best Practices & Tips
-
-1. **Plan your Input**: For a BatchNode, design `prep()` to yield only the minimal necessary data (e.g., text chunks).  
-2. **Aggregating Results**: `post()` is the place to combine partial results from `exec_res_list`.  
-3. **Large Batches**: If you have **thousands of items**, consider processing in chunks (e.g., yield 100 items at a time) or using an **Async** approach for concurrency.  
-4. **Hierarchy**:
-   - **BatchNode** is good for a single-step repeated operation (e.g., chunk-based summarization).
-   - **BatchFlow** is good if you have a **multi-step** process you want to repeat for a list of parameters.
-
---
-
-## 4. Putting It All Together
-
-```python
-# We'll combine the ideas:
-class MapSummaries(BatchNode):
-    def prep(self, shared):
-        content = shared["data"].get("bigfile.txt", "")
-        chunk_size = 10000
-        return [content[i:i+chunk_size] for i in range(0, len(content), chunk_size)]
-
-    def exec(self, shared, chunk):
-        return call_llm(f"Summarize chunk: {chunk}")
-
-    def post(self, shared, prep_res, exec_res_list):
-        combined = "\n".join(exec_res_list)
-        shared["summary"]["bigfile.txt"] = combined
-        return "default"
-
-map_summaries_node = MapSummaries()
-map_flow = Flow(start=map_summaries_node)
-
-# If we want to do the above for multiple big files in shared['data']:
-class SummarizeAllFiles(BatchFlow):
-    def prep(self, shared):
-        # Generate param dicts, each specifying a file
-        return [{"filename": fn} for fn in shared["data"]]
-
-# But to handle chunking inside the Flow, we might do:
-# 1) A node that sets a param "filename" in a shared place 
-# 2) Or combine logic differently.
-
-# For now, let's just show usage:
-summarize_all = SummarizeAllFiles(start=map_flow)
-summarize_all.run(shared)
-```
-
-In this snippet:
-
- `MapSummaries` is a `BatchNode` that chunk-summarizes one file.  
- `map_flow` is a `Flow` with that single BatchNode.  
- `SummarizeAllFiles` is a `BatchFlow` that runs `map_flow` for every file in `shared["data"]`.
-
-**Result**: Each file is chunked by `MapSummaries`, and you get a summary for each.
-
---
-
-## Summary
-
- **BatchNode**: Single-step repetition. `prep()` returns a list, `exec()` is called once per item, `post()` aggregates results.  
- **BatchFlow**: Repeatedly runs a Flow with different params. Great for multi-step or nested processes.  
-
-By mixing these two patterns, you can easily handle **large data** or **multiple inputs** in a streamlined, scalable way. 
+The **outer** BatchFlow’s `exec()` can return a list of directories; the **inner** BatchFlow then processes each file in those dirs.