Class: EvalRuby::Metrics::Correctness

Inherits:
Base
  • Object
show all
Defined in:
lib/eval_ruby/metrics/correctness.rb

Overview

Measures factual correctness of an answer against ground truth. Uses LLM judge when available, falls back to token overlap F1 score.

Examples:

With LLM judge

metric = Correctness.new(judge: judge)
result = metric.call(answer: "Paris", ground_truth: "Paris")

Without judge (string similarity)

metric = Correctness.new
result = metric.call(answer: "The capital is Paris", ground_truth: "Paris is the capital")

Constant Summary collapse

PROMPT_TEMPLATE =
<<~PROMPT
  Given the following answer and ground truth, evaluate whether the answer
  is factually correct.

  Answer:
  %{answer}

  Ground Truth:
  %{ground_truth}

  Evaluate correctness on a scale from 0.0 to 1.0 where:
  - 1.0 = the answer is completely correct and matches the ground truth
  - 0.5 = the answer is partially correct
  - 0.0 = the answer is completely wrong

  Consider both semantic meaning and factual accuracy, not just exact string matching.

  Respond in JSON: {"reasoning": "...", "score": 0.0}
PROMPT

Instance Attribute Summary

Attributes inherited from Base

#judge

Instance Method Summary collapse

Methods inherited from Base

#initialize

Constructor Details

This class inherits a constructor from EvalRuby::Metrics::Base

Instance Method Details

#call(answer:, ground_truth:, **_kwargs) ⇒ Hash

Returns :score (Float 0.0-1.0) and :details.

Parameters:

  • answer (String)

    the LLM-generated answer

  • ground_truth (String)

    the expected correct answer

Returns:

  • (Hash)

    :score (Float 0.0-1.0) and :details



39
40
41
42
43
44
45
# File 'lib/eval_ruby/metrics/correctness.rb', line 39

def call(answer:, ground_truth:, **_kwargs)
  if judge
    llm_score(answer, ground_truth)
  else
    string_similarity_score(answer, ground_truth)
  end
end