static_genderizer

static_genderizer is a small Ruby library that detects probable gender and splits first/last name tokens using static CSV datasets loaded at startup. It's intentionally simple and fast — it uses in-memory lookups from CSV files that you provide.

Key features

  • Load static per-language CSV files (name,gender) from a configurable data directory.
  • Classify tokens: tokens that have a gender value (M/F) are treated as first names; tokens without gender are treated as last names.
  • Analyze a name string and return an object with:
    • first_names: Array of detected first-name tokens (preserve original token casing)
    • last_names: Array of detected last-name tokens
    • language: the language chosen for the result (symbol) or nil
    • gender: one of :male, :female or :unknown
  • Configure the set of languages to load; if no language is requested, the gem searches across all loaded languages and selects the best match.

Installation Add this gem to your Gemfile (local development):

# Gemfile
gem "static_genderizer"

Or build and install the gem locally:

gem build static_genderizer.gemspec
gem install ./static_genderizer-0.1.0.gem

Configuration & CSV files

  • Provide CSV files named xx.csv (where xx is the language code, e.g. pl, en) in a data directory.
  • CSV format: the file must have headers and at least two columns: name and gender.
    • name: the name token (e.g. "Anna", "Kowalski")
    • gender: M, F (case-insensitive) or empty
    • non-empty gender => token will be treated as a first name (and the gender recorded)
    • empty gender => token will be treated as a last name
  • Example spec/data/pl.csv:
name,gender
Jan,M
Anna,F

Quickstart (programmatic)

require "static_genderizer"

# configure and load CSVs
StaticGenderizer.configure do |c|
  c.data_path = File.expand_path("data", __dir__) # path containing pl.csv, en.csv ...
  c.languages = [:pl, :en]                              # languages to load
  c.case_sensitive = false
end

# analyze a name (language optional)
result = StaticGenderizer.analyze("Jan Kowalski", language: :pl)

puts result.first_names.inspect  # => ["Jan"]
puts result.last_names.inspect   # => ["Kowalski"]
puts result.language             # => :pl
puts result.gender               # => :male

Behavior & heuristics

  • Tokenization: input string is split on whitespace and punctuation (commas/semicolons). Apostrophes are preserved (e.g., "O'Connor").
  • Classification:
    • If a token is present in a language's first_names map (i.e. found in CSV with non-empty gender) — it is treated as a first name for that language.
    • Otherwise the token is treated as a last name.
  • Language selection:
    • If you pass a language: that is loaded, analysis is done only for that language.
    • If you do not pass a language, the gem analyzes across all configured languages and picks the language that produced the highest "match" score (based on tokens found).
  • Gender decision:
    • Derived from detected first-name tokens' recorded genders (M/F). Majority rule applies; ties or no matches => :unknown.

API

  • StaticGenderizer.configure { |c| ... } — configure data_path, languages, case_sensitive and load CSVs.
  • StaticGenderizer.analyze(name_string, language: nil) => returns a StaticGenderizer::Result
    • Result provides first_names, last_names, language, and gender.
    • Example: StaticGenderizer.analyze("Anna Nowak")

Testing Specs are included under spec/. To run:

bundle install
bundle exec rspec

Project layout (relevant)

  • lib/static_genderizer/*.rb — gem implementation
  • spec/ — RSpec tests and sample CSV data (spec/data/*.csv)
  • static_genderizer.gemspec, Gemfile, Rakefile

Notes, limitations & next steps

  • CSV-only: this gem is designed for static CSV lookup. It does not use external services.
  • Declension: current implementation focuses on name splitting and gender detection using xx.csv files. If you need declension/inflection data (e.g., xx_declination.csv) or morphological support, this can be added — the codebase is structured so a declinations loader and inclusion in Result can be implemented.
  • Ambiguity & heuristics: heuristics are intentionally simple. For better accuracy you can:
    • extend CSVs with frequency scores,
    • add a type column (first, last, both),
    • add language-specific rules or suffix heuristics.
  • Case sensitivity is configurable via configuration.case_sensitive.

Contributing

  • Add CSV data under spec/data for tests or in your own data/ folder when using the gem.
  • Add specs under spec/ and run bundle exec rspec.
  • Open pull requests for bug fixes or improvements.

License MIT — see LICENSE file.

Contact / support Open an issue or PR in the repository where this gem is hosted.