LlmTranslate
AI-powered Markdown translator that preserves formatting while translating content using various AI providers.
Features
- ๐ค AI-Powered Translation: Support for OpenAI, Anthropic, and Ollama
- ๐ Markdown Format Preservation: Keeps code blocks, links, images, and formatting intact
- ๐ Document Splitting: Intelligent splitting of large documents for optimal translation
- ๐ง Flexible Configuration: YAML-based configuration with environment variable support
- ๐ Batch Processing: Recursively processes entire directory structures
- ๐ CLI Interface: Easy-to-use command-line interface with Thor
- ๐ Progress Tracking: Built-in logging and reporting
- โก Error Handling: Robust error handling with retry mechanisms
- ๐ฏ Customizable: Custom prompts, file patterns, and output strategies
Installation
Add this line to your application's Gemfile:
gem 'llm_translate'
And then execute:
bundle install
Or install it yourself as:
gem install llm_translate
Dependencies
The gem requires the rubyllm gem for AI integration:
gem install rubyllm
Quick Start
Initialize a configuration file:
llm_translate initSet your API key:
export LLM_TRANSLATE_API_KEY="your-api-key-here"Translate your markdown files:
llm_translate translate --config ./llm_translate.yml
Configuration
The translator uses a YAML configuration file. Here's a minimal example:
# llm_translate.yml
ai:
api_key: ${LLM_TRANSLATE_API_KEY}
provider: "openai"
model: "gpt-4"
temperature: 0.3
translation:
target_language: "zh-CN"
preserve_formatting: true
# Document Splitting Configuration
enable_splitting: true
max_chars: 20000
every_chars: 20000
default_prompt: |
Please translate the following Markdown content to Chinese, keeping all formatting intact:
- Preserve code blocks, links, images, and other Markdown syntax
- Keep English technical terms and product names
- Ensure natural and fluent translation
Content:
{content}
files:
input_directory: "./docs"
output_directory: "./docs-translated"
filename_suffix: ".zh"
logging:
level: "info"
output: "console"
AI Providers
OpenAI
ai:
provider: "openai"
api_key: ${OPENAI_API_KEY}
model: "gpt-4"
Anthropic
ai:
provider: "anthropic"
api_key: ${ANTHROPIC_API_KEY}
model: "claude-3-sonnet-20240229"
Ollama (Local)
ai:
provider: "ollama"
model: "llama2"
# Set OLLAMA_HOST environment variable if not using default
Document Splitting
For large documents that exceed token limits or need more manageable processing, the translator includes an intelligent document splitting feature.
How It Works
- Automatic Detection: When a document exceeds the configured
max_charsthreshold, splitting is automatically triggered - Smart Splitting: Documents are split at natural Markdown boundaries (headers, code blocks, lists, etc.)
- Individual Translation: Each chunk is translated separately with proper context
- Seamless Merging: Translated chunks are automatically merged back into a complete document
Configuration
translation:
# Enable document splitting
enable_splitting: true
# Trigger splitting when document exceeds this character count
max_chars: 20000
# Target size for each chunk
every_chars: 20000
# Number of chunks to translate concurrently
concurrent_chunks: 3
Benefits
- Large Document Support: Handle documents of any size without token limit issues
- Better Translation Quality: Smaller chunks allow for more focused translation
- Concurrent Processing: Translate multiple chunks simultaneously for faster processing
- Format Preservation: Maintains Markdown structure across splits
- Automatic Processing: No manual intervention required
Example
# Translate a large GitLab documentation file (65,000+ characters)
llm_translate translate --config ./config.yml --input ./large_doc.md --output ./large_doc.zh.md
# Output:
# [INFO] Document size (65277 chars) exceeds limit, splitting...
# [INFO] Document split into 4 chunks
# [INFO] Translating 4 chunks with 3 concurrent workers...
# [INFO] Translating chunk 1/4 (18500 chars)...
# [INFO] Translating chunk 2/4 (19200 chars)...
# [INFO] Translating chunk 3/4 (17800 chars)...
# [INFO] โ Completed chunk 1/4
# [INFO] โ Completed chunk 2/4
# [INFO] โ Completed chunk 3/4
# [INFO] Translating chunk 4/4 (9777 chars)...
# [INFO] โ Completed chunk 4/4
# [INFO] Merging translated chunks...
Usage
Basic Translation
Directory Mode (Default)
llm_translate translate --config ./llm_translate.yml
Single File Mode
To translate a single file, configure input_file and output_file in your configuration:
files:
# Single file mode
input_file: "./README.md"
output_file: "./README.zh.md"
When both input_file and output_file are specified, the translator will operate in single file mode, ignoring directory-related settings.
Command Line Options
llm_translate translate [OPTIONS]
Options:
-c, --config PATH Configuration file path (default: ./llm_translate.yml)
-i, --input PATH Input directory (overrides config)
-o, --output PATH Output directory (overrides config)
-p, --prompt TEXT Custom translation prompt (overrides config)
-v, --verbose Enable verbose output
-d, --dry-run Perform a dry run without actual translation
Other Commands:
llm_translate init Initialize a new configuration file
llm_translate version Show version information
Configuration File Structure
# AI Configuration
ai:
api_key: ${LLM_TRANSLATE_API_KEY}
provider: "openai" # openai, anthropic, ollama
model: "gpt-4"
temperature: 0.3
max_tokens: 4000
retry_attempts: 3
retry_delay: 2
timeout: 60
# Translation Settings
translation:
target_language: "zh-CN"
source_language: "auto"
default_prompt: "Your custom prompt with {content} placeholder"
preserve_formatting: true
translate_code_comments: false
# Document Splitting Settings
enable_splitting: true # Enable document splitting for large files
max_chars: 20000 # Trigger splitting when document exceeds this size
every_chars: 20000 # Target size for each chunk
concurrent_chunks: 3 # Number of chunks to translate concurrently
# File Processing
files:
input_directory: "./docs"
output_directory: "./docs-translated"
filename_strategy: "suffix" # suffix, replace, directory
filename_suffix: ".zh"
include_patterns:
- "**/*.md"
- "**/*.markdown"
exclude_patterns:
- "**/node_modules/**"
- "**/.*"
preserve_directory_structure: true
overwrite_policy: "ask" # ask, overwrite, skip, backup
backup_directory: "./backups"
# Logging
logging:
level: "info" # debug, info, warn, error
output: "console" # console, file, both
file_path: "./logs/translator.log"
verbose_translation: false
error_log_path: "./logs/errors.log"
# Error Handling
error_handling:
on_error: "log_and_continue" # stop, log_and_continue, skip_file
max_consecutive_errors: 5
retry_on_failure: 2
generate_error_report: true
error_report_path: "./logs/error_report.md"
# Performance
performance:
concurrent_files: 3
request_interval: 1 # seconds between requests
max_memory_mb: 500
# Output
output:
show_progress: true
show_statistics: true
generate_report: true
report_path: "./reports/translation_report.md"
format: "markdown"
include_metadata: true
Examples
Translate Documentation
# Translate all markdown files in ./docs to Chinese
llm_translate translate --input ./docs --output ./docs-zh
# Use custom prompt
llm_translate translate --prompt "็ฟป่ฏไปฅไธๅ
ๅฎนไธบไธญๆ๏ผไฟๆๆๆฏๆฏ่ฏญไธๅ: {content}"
# Dry run to see what would be translated
llm_translate translate --dry-run --verbose
Batch Translation
# Translate multiple language versions
for lang in zh-CN ja-JP ko-KR; do
llm_translate translate --config "./configs/llm_translate-${lang}.yml"
done
Development
After checking out the repo, run:
bundle install
To run tests:
bundle exec rspec
To run linting:
bundle exec rubocop
To install this gem onto your local machine:
bundle exec rake install
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/llm_translate/llm_translate.
License
The gem is available as open source under the terms of the MIT License.
Changelog
v0.2.0
- NEW: Document Splitting feature for large files
- NEW: Intelligent Markdown-aware splitting at natural boundaries
- NEW: Automatic chunk translation and merging
- IMPROVED: Better handling of large documents (65k+ characters)
- IMPROVED: Enhanced configuration options for document processing
- IMPROVED: Optimized performance settings for split document workflows
v0.1.0
- Initial release
- Support for OpenAI, Anthropic, and Ollama providers
- Markdown format preservation
- Configurable translation prompts
- Batch file processing
- Comprehensive error handling and logging