RubyLLM::RedCandle
A RubyLLM plugin that enables local LLM execution using quantized GGUF models through the Red Candle gem.
What Makes This Different
While all other RubyLLM providers communicate via HTTP APIs, Red Candle runs models locally using the Candle Rust crate. This brings true local inference to Ruby with:
- No network latency - Models run directly on your machine
- No API costs - Free execution once models are downloaded
- Privacy - Your data never leaves your machine
- Structured output - Generate JSON conforming to schemas using grammar-constrained generation
- Streaming - Token-by-token output for responsive UIs
- Hardware acceleration - Metal (macOS), CUDA (NVIDIA), or CPU
Installation
Add this line to your application's Gemfile:
gem 'ruby_llm-red_candle'
And then execute:
$ bundle install
Note: The red-candle gem requires a Rust toolchain to compile native extensions. See the Red Candle installation guide for details.
Quick Start
require 'ruby_llm'
require 'ruby_llm-red_candle'
# Create a chat with a local model
chat = RubyLLM.chat(provider: :red_candle, model: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF')
# Ask a question
response = chat.ask("What is the capital of France?")
puts response.content
Usage
Basic Chat (Non-Streaming)
The simplest way to use Red Candle is with synchronous chat:
chat = RubyLLM.chat(
provider: :red_candle,
model: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF'
)
response = chat.ask("What are the benefits of functional programming?")
puts response.content
# Access token estimates
puts "Input tokens: #{response.input_tokens}"
puts "Output tokens: #{response.output_tokens}"
Multi-Turn Conversations
RubyLLM maintains conversation history automatically:
chat = RubyLLM.chat(
provider: :red_candle,
model: 'Qwen/Qwen2.5-1.5B-Instruct-GGUF'
)
chat.ask("My name is Alice.")
chat.ask("I'm a software engineer who loves Ruby.")
response = chat.ask("What do you know about me?")
# The model remembers previous messages
puts response.content
Streaming Output
For responsive UIs, stream tokens as they're generated:
chat = RubyLLM.chat(
provider: :red_candle,
model: 'TheBloke/Mistral-7B-Instruct-v0.2-GGUF'
)
chat.ask("Explain recursion step by step") do |chunk|
print chunk.content # Print each token as it arrives
$stdout.flush
end
puts # Final newline
The block receives RubyLLM::Chunk objects with each generated token.
Structured Output (JSON Schema)
Generate JSON output that conforms to a schema using grammar-constrained generation:
chat = RubyLLM.chat(
provider: :red_candle,
model: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF'
)
schema = {
type: 'object',
properties: {
name: { type: 'string' },
age: { type: 'integer' },
occupation: { type: 'string' }
},
required: ['name', 'age', 'occupation']
}
response = chat.with_schema(schema).ask("Generate a profile for a 30-year-old software engineer named Alice")
# response.content is automatically parsed as a Hash
puts response.content
# => {"name"=>"Alice", "age"=>30, "occupation"=>"Software Engineer"}
puts "Name: #{response.content['name']}"
puts "Age: #{response.content['age']}"
How it works: Red Candle uses the Rust outlines-core crate to constrain token generation to only produce valid JSON matching your schema. This ensures 100% valid output structure.
Structured Output with Enums
Constrain values to specific options:
schema = {
type: 'object',
properties: {
sentiment: {
type: 'string',
enum: ['positive', 'negative', 'neutral']
},
confidence: { type: 'number' }
},
required: ['sentiment', 'confidence']
}
response = chat.with_schema(schema).ask("Analyze the sentiment: 'I love this product!'")
puts response.content['sentiment'] # => "positive"
Using with ruby_llm-schema
For complex schemas, use ruby_llm-schema:
require 'ruby_llm/schema'
class PersonProfile
include RubyLLM::Schema
schema do
string :name, description: "Person's full name"
integer :age, description: "Age in years"
string :occupation
array :skills, items: { type: 'string' }
end
end
chat = RubyLLM.chat(provider: :red_candle, model: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF')
response = chat.with_schema(PersonProfile).ask("Generate a Ruby developer profile")
Temperature Control
Adjust creativity vs determinism:
# More creative/varied responses (higher temperature)
chat = RubyLLM.chat(
provider: :red_candle,
model: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF'
)
chat.with_temperature(1.2).ask("Write a creative story opening")
# More focused/deterministic responses (lower temperature)
chat.with_temperature(0.3).ask("What is 15 + 27?")
Temperature range: 0.0 (deterministic) to 2.0 (very creative). Default is 0.7 for regular generation, 0.3 for structured output.
System Prompts
Set context for the conversation:
chat = RubyLLM.chat(
provider: :red_candle,
model: 'Qwen/Qwen2.5-1.5B-Instruct-GGUF'
)
chat.with_instructions("You are a helpful coding assistant specializing in Ruby. Always provide code examples.")
response = chat.ask("How do I read a file in Ruby?")
Supported Models
| Model | Context Window | Size | Best For |
|---|---|---|---|
TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF |
2,048 | ~600MB | Testing, quick prototypes |
Qwen/Qwen2.5-1.5B-Instruct-GGUF |
32,768 | ~900MB | General chat, long context |
google/gemma-3-4b-it-qat-q4_0-gguf |
8,192 | ~2.5GB | Balanced quality/speed |
microsoft/Phi-3-mini-4k-instruct |
4,096 | ~2GB | Reasoning tasks |
TheBloke/Mistral-7B-Instruct-v0.2-GGUF |
32,768 | ~4GB | High quality responses |
Models are automatically downloaded from HuggingFace on first use.
Listing Available Models
# Get all Red Candle models
models = RubyLLM.models.all.select { |m| m.provider == 'red_candle' }
models.each do |m|
puts "#{m.id} - #{m.context_window} tokens"
end
Configuration
Device Selection
By default, Red Candle uses the best available device:
- Metal on macOS (Apple Silicon)
- CUDA if NVIDIA GPU is available
- CPU as fallback
Override with:
RubyLLM.configure do |config|
config.red_candle_device = 'cpu' # Force CPU
config.red_candle_device = 'metal' # Force Metal (macOS)
config.red_candle_device = 'cuda' # Force CUDA (NVIDIA)
end
HuggingFace Authentication
Some models require HuggingFace authentication (especially gated models like Mistral):
# Install the HuggingFace CLI
pip install huggingface_hub
# Login (creates ~/.huggingface/token)
huggingface-cli login
See the Red Candle HuggingFace guide for details.
Custom JSON Instruction Template
By default, structured generation appends instructions to guide the model to output JSON. You can customize this template for different models or use cases:
# View the default template
RubyLLM::RedCandle::Configuration.json_instruction_template
# => "\n\nRespond with ONLY a valid JSON object containing: {schema_description}"
# Set a custom template (use {schema_description} as placeholder)
RubyLLM::RedCandle::Configuration.json_instruction_template = <<~TEMPLATE
You must respond with valid JSON matching this structure: {schema_description}
Do not include any other text, only the JSON object.
TEMPLATE
# Reset to default
RubyLLM::RedCandle::Configuration.reset!
Different models may respond better to different phrasings. Experiment with templates if you're getting inconsistent structured output.
Debug Logging
Enable debug logging to troubleshoot issues:
RubyLLM.logger.level = Logger::DEBUG
This shows:
- Schema normalization details
- Prompt construction
- Generation parameters
- Raw model outputs
Error Handling
begin
chat = RubyLLM.chat(provider: :red_candle, model: 'TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF')
response = chat.ask("Hello!")
rescue RubyLLM::Error => e
puts "Error: #{e.}"
end
Common errors:
- Model not found - Check model ID spelling
- Failed to load tokenizer - Model may require HuggingFace login
- Context length exceeded - Reduce conversation length or use model with larger context window
- Invalid schema - Schema must be
type: 'object'withpropertiesdefined - Structured generation failed - Schema may be too complex; try simplifying
Schema Validation
Schemas are validated before generation. Invalid schemas produce helpful error messages:
# This will fail with a descriptive error
invalid_schema = { type: "array" } # Must be 'object' with properties
chat.with_schema(invalid_schema).ask("...")
# => RubyLLM::Error: Invalid schema for structured generation:
# - Schema type must be 'object' for structured generation, got 'array'
# - Schema must have a 'properties' field...
Valid schemas must have:
type: 'object'propertieshash with at least one property- Each property must have a
typefield
Limitations
- No tool/function calling - Red Candle models don't support tool use
- No vision - Text-only input supported
- No embeddings - Chat models only (embedding support planned)
- No audio - Text-only modality
Performance Tips
- Choose the right model size - TinyLlama (1.1B) is fast but less capable; Mistral (7B) is slower but higher quality
- Use streaming for long responses - Better UX than waiting for full generation
- Lower temperature for structured output - More deterministic JSON generation
- Reuse chat instances - Model loading is expensive; reuse loaded models
Development
After checking out the repo, run bin/setup to install dependencies.
Running Tests
# Fast tests with mocked responses (default)
bundle exec rspec
# Real inference tests (slow, downloads models)
RED_CANDLE_REAL_INFERENCE=true bundle exec rspec
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/ruby_llm-red_candle.
License
The gem is available as open source under the terms of the MIT License.
Related Projects
- RubyLLM - The unified Ruby LLM interface
- Red Candle - Ruby bindings for the Candle ML framework
- ruby_llm-mcp - MCP protocol support for RubyLLM
- ruby_llm-schema - JSON Schema DSL for RubyLLM