Class: RelevantChunks::Chunker
- Inherits:
-
Object
- Object
- RelevantChunks::Chunker
- Defined in:
- lib/relevant_chunks/chunker.rb
Overview
Handles text chunking with smart boundary detection and configurable overlap
The Chunker class splits text into chunks while trying to maintain natural boundaries like sentence endings and paragraphs. It also supports overlapping chunks to ensure context is maintained across chunk boundaries.
Class Attribute Summary collapse
-
.configuration ⇒ Object
Returns the value of attribute configuration.
Instance Attribute Summary collapse
-
#max_tokens ⇒ Integer
readonly
Maximum number of tokens per chunk.
-
#overlap_size ⇒ Integer
readonly
Size of overlap between chunks.
Instance Method Summary collapse
-
#chunk_text(text) ⇒ Array<String>
Split text into chunks with smart boundary detection.
-
#initialize(max_tokens: 1000, overlap_size: 100) ⇒ Chunker
constructor
Initialize a new Chunker instance.
Constructor Details
#initialize(max_tokens: 1000, overlap_size: 100) ⇒ Chunker
Initialize a new Chunker instance
31 32 33 34 35 |
# File 'lib/relevant_chunks/chunker.rb', line 31 def initialize(max_tokens: 1000, overlap_size: 100) @max_tokens = max_tokens @overlap_size = overlap_size @logger = Logger.new($stdout) end |
Class Attribute Details
.configuration ⇒ Object
Returns the value of attribute configuration.
17 18 19 |
# File 'lib/relevant_chunks/chunker.rb', line 17 def configuration @configuration end |
Instance Attribute Details
#max_tokens ⇒ Integer (readonly)
Returns Maximum number of tokens per chunk.
24 25 26 |
# File 'lib/relevant_chunks/chunker.rb', line 24 def max_tokens @max_tokens end |
#overlap_size ⇒ Integer (readonly)
Returns Size of overlap between chunks.
21 22 23 |
# File 'lib/relevant_chunks/chunker.rb', line 21 def overlap_size @overlap_size end |
Instance Method Details
#chunk_text(text) ⇒ Array<String>
Split text into chunks with smart boundary detection
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# File 'lib/relevant_chunks/chunker.rb', line 44 def chunk_text(text) @logger.info "Starting chunk_text with text length: #{text.length}" return [text] if text.length <= max_tokens chunks = [] current_position = 0 while current_position < text.length chunk_end = find_chunk_boundary(text, current_position) add_chunk(text, current_position, chunk_end, chunks) # If we've reached the end, break break if chunk_end >= text.length - 1 # Calculate next position with overlap next_position = calculate_next_position(current_position, chunk_end) break if should_stop_chunking?(next_position, current_position, text) current_position = next_position @logger.info "Moving to position: #{current_position}" end @logger.info "Final chunks: #{chunks.inspect}" chunks end |