Module: EvalRuby

Defined in:
lib/eval_ruby.rb,
lib/eval_ruby/rspec.rb,
lib/eval_ruby/report.rb,
lib/eval_ruby/result.rb,
lib/eval_ruby/dataset.rb,
lib/eval_ruby/version.rb,
lib/eval_ruby/minitest.rb,
lib/eval_ruby/evaluator.rb,
lib/eval_ruby/comparison.rb,
lib/eval_ruby/judges/base.rb,
lib/eval_ruby/metrics/mrr.rb,
lib/eval_ruby/metrics/base.rb,
lib/eval_ruby/metrics/ndcg.rb,
lib/eval_ruby/configuration.rb,
lib/eval_ruby/judges/openai.rb,
lib/eval_ruby/judges/anthropic.rb,
lib/eval_ruby/metrics/relevance.rb,
lib/eval_ruby/metrics/correctness.rb,
lib/eval_ruby/metrics/recall_at_k.rb,
lib/eval_ruby/metrics/faithfulness.rb,
lib/eval_ruby/metrics/context_recall.rb,
lib/eval_ruby/metrics/precision_at_k.rb,
lib/eval_ruby/metrics/context_precision.rb

Overview

Evaluation framework for LLM and RAG applications. Measures quality metrics like faithfulness, relevance, context precision, and answer correctness. Think Ragas or DeepEval for Ruby.

Examples:

Quick evaluation

result = EvalRuby.evaluate(
  question: "What is Ruby?",
  answer: "A programming language",
  context: ["Ruby is a dynamic, open source programming language."],
  ground_truth: "Ruby is a programming language created by Matz."
)
puts result.faithfulness  # => 0.95
puts result.overall       # => 0.87

Retrieval evaluation

result = EvalRuby.evaluate_retrieval(
  question: "What is Ruby?",
  retrieved: ["doc_a", "doc_b", "doc_c"],
  relevant: ["doc_a", "doc_c"]
)
puts result.precision_at_k(3) # => 0.67

Defined Under Namespace

Modules: Assertions, Judges, Metrics, RSpecMatchers Classes: APIError, Comparison, Configuration, Dataset, Error, Evaluator, InvalidResponseError, Report, Result, RetrievalResult, TimeoutError

Constant Summary collapse

VERSION =
"0.2.0"

Class Method Summary collapse

Class Method Details

.compare(report_a, report_b) ⇒ Comparison

Compares two evaluation reports with statistical significance testing.

Parameters:

  • report_a (Report)

    baseline report

  • report_b (Report)

    comparison report

Returns:



134
135
136
# File 'lib/eval_ruby.rb', line 134

def compare(report_a, report_b)
  Comparison.new(report_a, report_b)
end

.configurationConfiguration

Returns the current configuration.

Returns:



53
54
55
# File 'lib/eval_ruby.rb', line 53

def configuration
  @configuration ||= Configuration.new
end

.configure {|config| ... } ⇒ void

This method returns an undefined value.

Yields the configuration for modification.

Yield Parameters:



61
62
63
# File 'lib/eval_ruby.rb', line 61

def configure
  yield(configuration)
end

.evaluate(question:, answer:, context: [], ground_truth: nil) ⇒ Result

Evaluates an LLM response across multiple quality metrics.

Parameters:

  • question (String)

    the input question

  • answer (String)

    the LLM-generated answer

  • context (Array<String>) (defaults to: [])

    retrieved context chunks

  • ground_truth (String, nil) (defaults to: nil)

    expected correct answer

Returns:



79
80
81
82
83
84
85
86
# File 'lib/eval_ruby.rb', line 79

def evaluate(question:, answer:, context: [], ground_truth: nil)
  Evaluator.new.evaluate(
    question: question,
    answer: answer,
    context: context,
    ground_truth: ground_truth
  )
end

.evaluate_batch(dataset, pipeline: nil) ⇒ Report

Evaluates a batch of samples, optionally running them through a pipeline.

Parameters:

  • dataset (Dataset, Array<Hash>)

    samples to evaluate

  • pipeline (#query, nil) (defaults to: nil)

    optional RAG pipeline to run queries through

Returns:



107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# File 'lib/eval_ruby.rb', line 107

def evaluate_batch(dataset, pipeline: nil)
  samples = dataset.is_a?(Dataset) ? dataset.samples : dataset
  evaluator = Evaluator.new
  start_time = Time.now

  results = samples.map do |sample|
    if pipeline
      response = pipeline.query(sample[:question])
      evaluator.evaluate(
        question: sample[:question],
        answer: response.respond_to?(:text) ? response.text : response.to_s,
        context: response.respond_to?(:context) ? response.context : sample[:context],
        ground_truth: sample[:ground_truth]
      )
    else
      evaluator.evaluate(**sample.slice(:question, :answer, :context, :ground_truth))
    end
  end

  Report.new(results: results, samples: samples, duration: Time.now - start_time)
end

.evaluate_retrieval(question:, retrieved:, relevant:) ⇒ RetrievalResult

Evaluates retrieval quality using IR metrics.

Parameters:

  • question (String)

    the input question

  • retrieved (Array<String>)

    retrieved document IDs

  • relevant (Array<String>)

    ground-truth relevant document IDs

Returns:



94
95
96
97
98
99
100
# File 'lib/eval_ruby.rb', line 94

def evaluate_retrieval(question:, retrieved:, relevant:)
  Evaluator.new.evaluate_retrieval(
    question: question,
    retrieved: retrieved,
    relevant: relevant
  )
end

.reset_configuration!Configuration

Resets configuration to defaults.

Returns:



68
69
70
# File 'lib/eval_ruby.rb', line 68

def reset_configuration!
  @configuration = Configuration.new
end