static_genderizer

static_genderizer is a small Ruby library that detects probable gender and splits first/last name tokens using static CSV datasets loaded at startup. It's intentionally simple and fast — it uses in-memory lookups from CSV files that you provide.

Key features

Load static per-language CSV files (name,gender) from a configurable data directory.
Classify tokens: tokens that have a gender value (M/F) are treated as first names; tokens without gender are treated as last names.
Analyze a name string and return an object with:
- first_names: Array of detected first-name tokens (preserve original token casing)
- last_names: Array of detected last-name tokens
- language: the language chosen for the result (symbol) or nil
- gender: one of :male, :female or :unknown
Configure the set of languages to load; if no language is requested, the gem searches across all loaded languages and selects the best match.

Installation Add this gem to your Gemfile (local development):

# Gemfile
gem "static_genderizer"

Or build and install the gem locally:

gem build static_genderizer.gemspec
gem install ./static_genderizer-0.1.0.gem

Configuration & CSV files

Provide CSV files named xx.csv (where xx is the language code, e.g. pl, en) in a data directory.
CSV format: the file must have headers and at least two columns: name and gender.
- name: the name token (e.g. "Anna", "Kowalski")
- gender: M, F (case-insensitive) or empty
- non-empty gender => token will be treated as a first name (and the gender recorded)
- empty gender => token will be treated as a last name
Example spec/data/pl.csv:

name,gender
Jan,M
Anna,F

Quickstart (programmatic)

require "static_genderizer"

# configure and load CSVs
StaticGenderizer.configure do |c|
  c.data_path = File.expand_path("data", __dir__) # path containing pl.csv, en.csv ...
  c.languages = [:pl, :en]                              # languages to load
  c.case_sensitive = false
end

# analyze a name (language optional)
result = StaticGenderizer.analyze("Jan Kowalski", language: :pl)

puts result.first_names.inspect  # => ["Jan"]
puts result.last_names.inspect   # => ["Kowalski"]
puts result.language             # => :pl
puts result.gender               # => :male

Behavior & heuristics

Tokenization: input string is split on whitespace and punctuation (commas/semicolons). Apostrophes are preserved (e.g., "O'Connor").
Classification:
- If a token is present in a language's first_names map (i.e. found in CSV with non-empty gender) — it is treated as a first name for that language.
- Otherwise the token is treated as a last name.
Language selection:
- If you pass a language: that is loaded, analysis is done only for that language.
- If you do not pass a language, the gem analyzes across all configured languages and picks the language that produced the highest "match" score (based on tokens found).
Gender decision:
- Derived from detected first-name tokens' recorded genders (M/F). Majority rule applies; ties or no matches => :unknown.

API

StaticGenderizer.configure { |c| ... } — configure data_path, languages, case_sensitive and load CSVs.
StaticGenderizer.analyze(name_string, language: nil) => returns a StaticGenderizer::Result
- Result provides first_names, last_names, language, and gender.
- Example: StaticGenderizer.analyze("Anna Nowak")

Testing Specs are included under spec/. To run:

bundle install
bundle exec rspec

Project layout (relevant)

lib/static_genderizer/*.rb — gem implementation
spec/ — RSpec tests and sample CSV data (spec/data/*.csv)
static_genderizer.gemspec, Gemfile, Rakefile

Notes, limitations & next steps

CSV-only: this gem is designed for static CSV lookup. It does not use external services.
Declension: current implementation focuses on name splitting and gender detection using xx.csv files. If you need declension/inflection data (e.g., xx_declination.csv) or morphological support, this can be added — the codebase is structured so a declinations loader and inclusion in Result can be implemented.
Ambiguity & heuristics: heuristics are intentionally simple. For better accuracy you can:
- extend CSVs with frequency scores,
- add a type column (first, last, both),
- add language-specific rules or suffix heuristics.
Case sensitivity is configurable via configuration.case_sensitive.

Contributing

Add CSV data under spec/data for tests or in your own data/ folder when using the gem.
Add specs under spec/ and run bundle exec rspec.
Open pull requests for bug fixes or improvements.

License MIT — see LICENSE file.

Contact / support Open an issue or PR in the repository where this gem is hosted.