lex-eval

LLM output evaluation framework for LegionIO. Provides LLM-as-judge and code-based evaluators for scoring LLM outputs against expected results, with per-row results and summary statistics.

Overview

lex-eval runs structured evaluation suites against LLM outputs. Each evaluation takes a list of input/output/expected triples, scores them with the chosen evaluator, and returns a result set with pass/fail per row and an aggregate score.

Installation

gem 'lex-eval'

Usage

require 'legion/extensions/eval'

client = Legion::Extensions::Eval::Client.new

# Run an LLM-judge evaluation
result = client.run_evaluation(
  evaluator_name: 'accuracy',
  evaluator_config: { type: :llm_judge, criteria: 'factual correctness' },
  inputs: [
    { input: 'What is BGP?', output: 'Border Gateway Protocol', expected: 'Border Gateway Protocol' },
    { input: 'What is OSPF?', output: 'Open Shortest Path First', expected: 'Open Shortest Path First' }
  ]
)
# => { evaluator: 'accuracy',
#      results: [{ passed: true, score: 1.0, row_index: 0 }, ...],
#      summary: { total: 2, passed: 2, failed: 0, avg_score: 1.0 } }

# Run a code-based evaluation
client.run_evaluation(
  evaluator_name: 'json-validity',
  evaluator_config: { type: :code },
  inputs: [{ input: 'parse this', output: '{"valid": true}', expected: nil }]
)

# List built-in evaluator templates
client.list_evaluators

Evaluator Types

Type Description
:llm_judge Uses legion-llm to score output against expected using natural language criteria
:code Runs a Ruby proc or checks structural validity

Built-In Templates

12 YAML evaluator templates ship with the gem and are returned by list_evaluators:

hallucination, relevance, toxicity, faithfulness, qa_correctness, sql_generation, code_generation, code_readability, tool_calling, human_vs_ai, rag_relevancy, summarization

Annotation Queues

Human-in-the-loop annotation for labeling LLM outputs:

client = Legion::Extensions::Eval::Client.new(db: Sequel.sqlite)
Legion::Extensions::Eval::Helpers::AnnotationSchema.create_tables(client.instance_variable_get(:@db))

client.create_queue(name: 'review', description: 'Manual review queue')
client.enqueue_items(queue_name: 'review', items: [{ input: 'q', output: 'a' }])
client.assign_next(queue_name: 'review', annotator: 'alice', count: 5)
client.complete_annotation(item_id: 1, label_score: 0.9, label_category: 'correct')
client.queue_stats(queue_name: 'review')
client.export_to_dataset(queue_name: 'review')

Agentic Review

AI-reviews-AI with confidence-based escalation:

client = Legion::Extensions::Eval::Client.new
result = client.review_output(input: 'question', output: 'answer')
# => { confidence: 0.92, recommendation: 'approve', issues: [], explanation: '...' }

result = client.review_with_escalation(input: 'q', output: 'a')
# => { action: :auto_approve, escalated: false, ... }  (confidence > 0.9)
# => { action: :light_review, escalated: true, priority: :low, ... }  (0.6-0.9)
# => { action: :full_review, escalated: true, priority: :high, ... }  (< 0.6)

Development

bundle install
bundle exec rspec
bundle exec rubocop

License

MIT