--- layout: default title: "Text Chunking" parent: "Utility Function" nav_order: 5 --- ## Text Chunking We recommend some implementations of commonly used text chunking approaches. > Text Chunking is more a micro optimization, compared to the Flow Design. > > It's recommended to start with the Naive Chunking and optimize later. {: .best-practice } --- ## Example Python Code Samples ### 1. Naive (Fixed-Size) Chunking Splits text by a fixed number of words, ignoring sentence or semantic boundaries. ```python def fixed_size_chunk(text, chunk_size=100): chunks = [] for i in range(0, len(text), chunk_size): chunks.append(text[i : i + chunk_size]) return chunks ``` However, sentences are often cut awkwardly, losing coherence. ### 2. Sentence-Based Chunking ```python import nltk def sentence_based_chunk(text, max_sentences=2): sentences = nltk.sent_tokenize(text) chunks = [] for i in range(0, len(sentences), max_sentences): chunks.append(" ".join(sentences[i : i + max_sentences])) return chunks ``` However, might not handle very long sentences or paragraphs well. ### 3. Other Chunking - **Paragraph-Based**: Split text by paragraphs (e.g., newlines). Large paragraphs can create big chunks. - **Semantic**: Use embeddings or topic modeling to chunk by semantic boundaries. - **Agentic**: Use an LLM to decide chunk boundaries based on context or meaning.