LlmTranslate

AI-powered Markdown translator that preserves formatting while translating content using various AI providers.

Features

🤖 AI-Powered Translation: Support for OpenAI, Anthropic, and Ollama
📝 Markdown Format Preservation: Keeps code blocks, links, images, and formatting intact
📄 Document Splitting: Intelligent splitting of large documents for optimal translation
🔧 Flexible Configuration: YAML-based configuration with environment variable support
📁 Batch Processing: Recursively processes entire directory structures
🚀 CLI Interface: Easy-to-use command-line interface with Thor
📊 Progress Tracking: Built-in logging and reporting
⚡ Error Handling: Robust error handling with retry mechanisms
🎯 Customizable: Custom prompts, file patterns, and output strategies

Installation

Add this line to your application's Gemfile:

gem 'llm_translate'

And then execute:

bundle install

Or install it yourself as:

gem install llm_translate

Dependencies

The gem requires the rubyllm gem for AI integration:

gem install rubyllm

Quick Start

Initialize a configuration file:
```
llm_translate init
```

Set your API key:

export LLM_TRANSLATE_API_KEY="your-api-key-here"

Translate your markdown files:

llm_translate translate --config ./llm_translate.yml

Configuration

The translator uses a YAML configuration file. Here's a minimal example:

# llm_translate.yml
ai:
  api_key: ${LLM_TRANSLATE_API_KEY}
  provider: "openai"
  model: "gpt-4"
  temperature: 0.3

translation:
  target_language: "zh-CN"
  preserve_formatting: true

  # Document Splitting Configuration
  enable_splitting: true
  max_chars: 20000
  every_chars: 20000

  default_prompt: |
    Please translate the following Markdown content to Chinese, keeping all formatting intact:
    - Preserve code blocks, links, images, and other Markdown syntax
    - Keep English technical terms and product names
    - Ensure natural and fluent translation

    Content:
    {content}

files:
  input_directory: "./docs"
  output_directory: "./docs-translated"
  filename_suffix: ".zh"

logging:
  level: "info"
  output: "console"

AI Providers

OpenAI

ai:
  provider: "openai"
  api_key: ${OPENAI_API_KEY}
  model: "gpt-4"

Anthropic

ai:
  provider: "anthropic"
  api_key: ${ANTHROPIC_API_KEY}
  model: "claude-3-sonnet-20240229"

Ollama (Local)

ai:
  provider: "ollama"
  model: "llama2"
  # Set OLLAMA_HOST environment variable if not using default

Document Splitting

For large documents that exceed token limits or need more manageable processing, the translator includes an intelligent document splitting feature.

How It Works

Automatic Detection: When a document exceeds the configured max_chars threshold, splitting is automatically triggered
Smart Splitting: Documents are split at natural Markdown boundaries (headers, code blocks, lists, etc.)
Individual Translation: Each chunk is translated separately with proper context
Seamless Merging: Translated chunks are automatically merged back into a complete document

Configuration

translation:
  # Enable document splitting
  enable_splitting: true

  # Trigger splitting when document exceeds this character count
  max_chars: 20000

  # Target size for each chunk
  every_chars: 20000

  # Number of chunks to translate concurrently
  concurrent_chunks: 3

Benefits

Large Document Support: Handle documents of any size without token limit issues
Better Translation Quality: Smaller chunks allow for more focused translation
Concurrent Processing: Translate multiple chunks simultaneously for faster processing
Format Preservation: Maintains Markdown structure across splits
Automatic Processing: No manual intervention required

Example

# Translate a large GitLab documentation file (65,000+ characters)
llm_translate translate --config ./config.yml --input ./large_doc.md --output ./large_doc.zh.md

# Output:
# [INFO] Document size (65277 chars) exceeds limit, splitting...
# [INFO] Document split into 4 chunks
# [INFO] Translating 4 chunks with 3 concurrent workers...
# [INFO] Translating chunk 1/4 (18500 chars)...
# [INFO] Translating chunk 2/4 (19200 chars)...
# [INFO] Translating chunk 3/4 (17800 chars)...
# [INFO] ✓ Completed chunk 1/4
# [INFO] ✓ Completed chunk 2/4
# [INFO] ✓ Completed chunk 3/4
# [INFO] Translating chunk 4/4 (9777 chars)...
# [INFO] ✓ Completed chunk 4/4
# [INFO] Merging translated chunks...

Usage

Basic Translation

Directory Mode (Default)

llm_translate translate --config ./llm_translate.yml

Single File Mode

To translate a single file, configure input_file and output_file in your configuration:

files:
  # Single file mode
  input_file: "./README.md"
  output_file: "./README.zh.md"

When both input_file and output_file are specified, the translator will operate in single file mode, ignoring directory-related settings.

Command Line Options

llm_translate translate [OPTIONS]

Options:
  -c, --config PATH      Configuration file path (default: ./llm_translate.yml)
  -i, --input PATH       Input directory (overrides config)
  -o, --output PATH      Output directory (overrides config)
  -p, --prompt TEXT      Custom translation prompt (overrides config)
  -v, --verbose          Enable verbose output
  -d, --dry-run          Perform a dry run without actual translation

Other Commands:
  llm_translate init        Initialize a new configuration file
  llm_translate version     Show version information

Configuration File Structure

# AI Configuration
ai:
  api_key: ${LLM_TRANSLATE_API_KEY}
  provider: "openai"  # openai, anthropic, ollama
  model: "gpt-4"
  temperature: 0.3
  max_tokens: 4000
  retry_attempts: 3
  retry_delay: 2
  timeout: 60

# Translation Settings
translation:
  target_language: "zh-CN"
  source_language: "auto"
  default_prompt: "Your custom prompt with {content} placeholder"
  preserve_formatting: true
  translate_code_comments: false

  # Document Splitting Settings
  enable_splitting: true      # Enable document splitting for large files
  max_chars: 20000           # Trigger splitting when document exceeds this size
  every_chars: 20000         # Target size for each chunk
  concurrent_chunks: 3       # Number of chunks to translate concurrently

# File Processing
files:
  input_directory: "./docs"
  output_directory: "./docs-translated"
  filename_strategy: "suffix"  # suffix, replace, directory
  filename_suffix: ".zh"
  include_patterns:
    - "**/*.md"
    - "**/*.markdown"
  exclude_patterns:
    - "**/node_modules/**"
    - "**/.*"
  preserve_directory_structure: true
  overwrite_policy: "ask"  # ask, overwrite, skip, backup
  backup_directory: "./backups"

# Logging
logging:
  level: "info"  # debug, info, warn, error
  output: "console"  # console, file, both
  file_path: "./logs/translator.log"
  verbose_translation: false
  error_log_path: "./logs/errors.log"

# Error Handling
error_handling:
  on_error: "log_and_continue"  # stop, log_and_continue, skip_file
  max_consecutive_errors: 5
  retry_on_failure: 2
  generate_error_report: true
  error_report_path: "./logs/error_report.md"

# Performance
performance:
  concurrent_files: 3
  request_interval: 1  # seconds between requests
  max_memory_mb: 500

# Output
output:
  show_progress: true
  show_statistics: true
  generate_report: true
  report_path: "./reports/translation_report.md"
  format: "markdown"
  include_metadata: true

Examples

Translate Documentation

# Translate all markdown files in ./docs to Chinese
llm_translate translate --input ./docs --output ./docs-zh

# Use custom prompt
llm_translate translate --prompt "翻译以下内容为中文，保持技术术语不变: {content}"

# Dry run to see what would be translated
llm_translate translate --dry-run --verbose

Batch Translation

# Translate multiple language versions
for lang in zh-CN ja-JP ko-KR; do
  llm_translate translate --config "./configs/llm_translate-${lang}.yml"
done

Development

After checking out the repo, run:

bundle install

To run tests:

bundle exec rspec

To run linting:

bundle exec rubocop

To install this gem onto your local machine:

bundle exec rake install

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/llm_translate/llm_translate.

License

The gem is available as open source under the terms of the MIT License.

Changelog

v0.2.0

NEW: Document Splitting feature for large files
NEW: Intelligent Markdown-aware splitting at natural boundaries
NEW: Automatic chunk translation and merging
IMPROVED: Better handling of large documents (65k+ characters)
IMPROVED: Enhanced configuration options for document processing
IMPROVED: Optimized performance settings for split document workflows

v0.1.0

Initial release
Support for OpenAI, Anthropic, and Ollama providers
Markdown format preservation
Configurable translation prompts
Batch file processing
Comprehensive error handling and logging