Docs Cloud Redpanda Connect Components Processors text_chunker text_chunker Available in: Cloud, Self-Managed Breaks down text-based message content into manageable chunks using a configurable strategy. This processor is ideal for creating vector embeddings of large text documents. Common Advanced # Common configuration fields, showing default values label: "" text_chunker: strategy: "" # No default (required) chunk_size: 512 chunk_overlap: 100 separators: - "\n\n" - "\n" - " " - "" length_measure: runes include_code_blocks: false keep_reference_links: false # All configuration fields, showing default values label: "" text_chunker: strategy: "" # No default (required) chunk_size: 512 chunk_overlap: 100 separators: - "\n\n" - "\n" - " " - "" length_measure: runes token_encoding: cl100k_base # No default (optional) allowed_special: [] disallowed_special: - all include_code_blocks: false keep_reference_links: false Fields allowed_special[] A list of special tokens to include in the output from this processor. Type: array Default: [] chunk_overlap The number of characters duplicated in adjacent chunks of text. Type: int Default: 100 chunk_size The maximum size of each chunk, using the selected length_measure. Type: int Default: 512 disallowed_special[] A list of special tokens to exclude from the output of this processor. Type: array Default: - all include_code_blocks When set to true, this processor includes code blocks in the output. Type: bool Default: false keep_reference_links When set to true, this processor includes reference links in the output. Type: bool Default: false length_measure Choose a method to measure the length of a string. Type: string Default: runes Option Summary graphemes Use unicode graphemes to determine the length of a string. runes Use the number of codepoints to determine the length of a string. token Use the number of tokens (using the token_encoding tokenizer) to determine the length of a string. utf8 Determine the length of text using the number of utf8 bytes. separators[] A list of strings to use as separators between chunks when the recursive_character strategy option is specified. By default, the following separators are tried in turn until one is successful: Double newlines (` ) - Single newlines ( ) - Spaces (`" “,”") Type: array Default: - "\n\n" - "\n" - " " - "" strategy Choose a strategy for breaking content down into chunks. Type: string Option Summary markdown Split text by markdown headers. recursive_character Split text recursively by characters (defined in separators). token Split text by tokens. token_encoding The type of encoding to use for tokenization. Type: string # Examples: token_encoding: cl100k_base token_encoding: r50k_base Back to top × Simple online edits For simple changes, such as fixing a typo, you can edit the content directly on GitHub. Edit on GitHub Or, open an issue to let us know about something that you want us to change. Open an issue Contribution guide For extensive content updates, or if you prefer to work locally, read our contribution guide . Was this helpful? thumb_up thumb_down group Ask in the community mail Share your feedback group_add Make a contribution 🎉 Thanks for your feedback! sync_response try