text_chunker

Breaks down text-based message content into manageable chunks using a configurable strategy. This processor is ideal for creating vector embeddings of large text documents.

  • Common

  • Advanced

# Common configuration fields, showing default values
label: ""
text_chunker:
  strategy: "" # No default (required)
  chunk_size: 512
  chunk_overlap: 100
  separators:
    - "\n\n"
    - "\n"
    - " "
    - ""
  length_measure: runes
  include_code_blocks: false
  keep_reference_links: false
# All configuration fields, showing default values
label: ""
text_chunker:
  strategy: "" # No default (required)
  chunk_size: 512
  chunk_overlap: 100
  separators:
    - "\n\n"
    - "\n"
    - " "
    - ""
  length_measure: runes
  token_encoding: cl100k_base # No default (optional)
  allowed_special: []
  disallowed_special:
    - all
  include_code_blocks: false
  keep_reference_links: false

Fields

allowed_special[]

A list of special tokens to include in the output from this processor.

Type: array

Default: []

chunk_overlap

The number of characters duplicated in adjacent chunks of text.

Type: int

Default: 100

chunk_size

The maximum size of each chunk, using the selected length_measure.

Type: int

Default: 512

disallowed_special[]

A list of special tokens to exclude from the output of this processor.

Type: array

Default:

- all

include_code_blocks

When set to true, this processor includes code blocks in the output.

Type: bool

Default: false

When set to true, this processor includes reference links in the output.

Type: bool

Default: false

length_measure

Choose a method to measure the length of a string.

Type: string

Default: runes

Option Summary

graphemes

Use unicode graphemes to determine the length of a string.

runes

Use the number of codepoints to determine the length of a string.

token

Use the number of tokens (using the token_encoding tokenizer) to determine the length of a string.

utf8

Determine the length of text using the number of utf8 bytes.

separators[]

A list of strings to use as separators between chunks when the recursive_character strategy option is specified.

By default, the following separators are tried in turn until one is successful:

  • Double newlines (`

) - Single newlines ( ) - Spaces (`" “,”")

Type: array

Default:

- "\n\n"
- "\n"
- " "
- ""

strategy

Choose a strategy for breaking content down into chunks.

Type: string

Option Summary

markdown

Split text by markdown headers.

recursive_character

Split text recursively by characters (defined in separators).

token

Split text by tokens.

token_encoding

The type of encoding to use for tokenization.

Type: string

# Examples:
token_encoding: cl100k_base
token_encoding: r50k_base