diff --git a/docs/index.md b/docs/index.md index e0ac2e8..a3b0524 100644 --- a/docs/index.md +++ b/docs/index.md @@ -39,7 +39,9 @@ We model the LLM workflow as a **Graph + Shared Store**: - [Tool](./utility_function/tool.md) - [(Optional) Viz and Debug](./utility_function/viz.md) - [(Optional) Web Search](./utility_function/websearch.md) -- Chunking +- [(Optional) Chunking](./utility_function/chunking.md) +- [(Optional) Embedding](./utility_function/embedding.md) +- [(Optional) Vector](./utility_function/vector.md) > We do not provide built-in utility functions. Example implementations are provided as reference. {: .warning } diff --git a/docs/utility_function/chunking.md b/docs/utility_function/chunking.md new file mode 100644 index 0000000..9bd609b --- /dev/null +++ b/docs/utility_function/chunking.md @@ -0,0 +1,54 @@ +--- +layout: default +title: "Text Chunking" +parent: "Utility Function" +nav_order: 5 +--- + +## Text Chunking + +We recommend some implementations of commonly used text chunking approaches. + + +> Text Chunking is more a micro optimization, compared to the Flow Design. +> +> It's recommended to start with the Naive Chunking and optimize later. +{: .best-practice } + +--- + +## Example Python Code Samples + +### 1. Naive (Fixed-Size) Chunking +Splits text by a fixed number of words, ignoring sentence or semantic boundaries. + +```python +def fixed_size_chunk(text, chunk_size=100): + chunks = [] + for i in range(0, len(text), chunk_size): + chunks.append(text[i : i + chunk_size]) + return chunks +``` + +However, sentences are often cut awkwardly, losing coherence. + +### Sentence-Based Chunking + +```python +import nltk + +def sentence_based_chunk(text, max_sentences=2): + sentences = nltk.sent_tokenize(text) + chunks = [] + for i in range(0, len(sentences), max_sentences): + chunks.append(" ".join(sentences[i : i + max_sentences])) + return chunks +``` + +However, might not handle very long sentences or paragraphs well. + +### Other Chunking + +- **Paragraph-Based**: Split text by paragraphs (e.g., newlines). Large paragraphs can create big chunks. +- **Semantic**: Use embeddings or topic modeling to chunk by semantic boundaries. +- **Agentic**: Use an LLM to decide chunk boundaries based on context or meaning. \ No newline at end of file diff --git a/docs/utility_function/embedding.md b/docs/utility_function/embedding.md new file mode 100644 index 0000000..90dcca9 --- /dev/null +++ b/docs/utility_function/embedding.md @@ -0,0 +1,111 @@ +--- +layout: default +title: "Web Search" +parent: "Embedding" +nav_order: 6 +--- + +# Embedding + +Below you will find an overview table of various text embedding APIs, along with example Python code. + +> Embedding is more a micro optimization, compared to the Flow Design. +> +> It's recommended to start with the most convenient one and optimize later. +{: .best-practice } + + +| **API** | **Free Tier** | **Pricing** | **Docs** | +| --- | --- | --- | --- | +| **OpenAI** | ~$5 credit | ~$0.0001/1K tokens | [OpenAI Embeddings](https://platform.openai.com/docs/api-reference/embeddings) | +| **Azure OpenAI** | $200 credit | Same as OpenAI (~$0.0001/1K tokens) | [Azure OpenAI Embeddings](https://learn.microsoft.com/azure/cognitive-services/openai/how-to/create-resource?tabs=portal) | +| **Google Vertex AI** | $300 credit | ~$0.025 / million chars | [Vertex AI Embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) | +| **AWS Bedrock** | No free tier, but AWS credits may apply | ~$0.00002/1K tokens (Titan V2) | [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/) | +| **Cohere** | Limited free tier | ~$0.0001/1K tokens | [Cohere Embeddings](https://docs.cohere.com/docs/cohere-embed) | +| **Hugging Face** | ~$0.10 free compute monthly | Pay per second of compute | [HF Inference API](https://huggingface.co/docs/api-inference) | +| **Jina** | 1M tokens free | Pay per token after | [Jina Embeddings](https://jina.ai/embeddings/) | + +## Example Python Code + +### 1. OpenAI +```python +import openai + +openai.api_key = "YOUR_API_KEY" +resp = openai.Embedding.create(model="text-embedding-ada-002", input="Hello world") +vec = resp["data"][0]["embedding"] +print(vec) +``` + +### 2. Azure OpenAI +```python +import openai + +openai.api_type = "azure" +openai.api_base = "https://YOUR_RESOURCE_NAME.openai.azure.com" +openai.api_version = "2023-03-15-preview" +openai.api_key = "YOUR_AZURE_API_KEY" + +resp = openai.Embedding.create(engine="ada-embedding", input="Hello world") +vec = resp["data"][0]["embedding"] +print(vec) +``` + +### 3. Google Vertex AI +```python +from vertexai.preview.language_models import TextEmbeddingModel +import vertexai + +vertexai.init(project="YOUR_GCP_PROJECT_ID", location="us-central1") +model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001") + +emb = model.get_embeddings(["Hello world"]) +print(emb[0]) +``` + +### 4. AWS Bedrock +```python +import boto3, json + +client = boto3.client("bedrock-runtime", region_name="us-east-1") +body = {"inputText": "Hello world"} +resp = client.invoke_model(modelId="amazon.titan-embed-text-v2:0", contentType="application/json", body=json.dumps(body)) +resp_body = json.loads(resp["body"].read()) +vec = resp_body["embedding"] +print(vec) +``` + +### 5. Cohere +```python +import cohere + +co = cohere.Client("YOUR_API_KEY") +resp = co.embed(texts=["Hello world"]) +vec = resp.embeddings[0] +print(vec) +``` + +### 6. Hugging Face +```python +import requests + +API_URL = "https://api-inference.huggingface.co/models/sentence-transformers/all-MiniLM-L6-v2" +HEADERS = {"Authorization": "Bearer YOUR_HF_TOKEN"} + +res = requests.post(API_URL, headers=HEADERS, json={"inputs": "Hello world"}) +vec = res.json()[0] +print(vec) +``` + +### 7. Jina +```python +import requests + +url = "https://api.jina.ai/v2/embed" +headers = {"Authorization": "Bearer YOUR_JINA_TOKEN"} +payload = {"data": ["Hello world"], "model": "jina-embeddings-v3"} +res = requests.post(url, headers=headers, json=payload) +vec = res.json()["data"][0]["embedding"] +print(vec) +``` + diff --git a/docs/utility_function/vector.md b/docs/utility_function/vector.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/utility_function/websearch.md b/docs/utility_function/websearch.md index 27d13fb..d13a13d 100644 --- a/docs/utility_function/websearch.md +++ b/docs/utility_function/websearch.md @@ -17,7 +17,7 @@ We recommend some implementations of commonly used web search tools. | **SerpApi** | 100 searches/month free | Start at $75/month for 5,000 searches| [Link](https://serpapi.com/) | | **RapidAPI** | Many options | Many options | [Link](https://rapidapi.com/search?term=search&sortBy=ByRelevance) | -## Example Python Code Samples +## Example Python Code ### 1. Google Custom Search JSON API ```python