static_genderizer
static_genderizer is a small Ruby library that detects probable gender and splits first/last name tokens using static CSV datasets loaded at startup. It's intentionally simple and fast — it uses in-memory lookups from CSV files that you provide.
Key features
- Load static per-language CSV files (name,gender) from a configurable data directory.
- Classify tokens: tokens that have a gender value (M/F) are treated as first names; tokens without gender are treated as last names.
- Analyze a name string and return an object with:
- first_names: Array of detected first-name tokens (preserve original token casing)
- last_names: Array of detected last-name tokens
- language: the language chosen for the result (symbol) or nil
- gender: one of :male, :female or :unknown
- Configure the set of languages to load; if no language is requested, the gem searches across all loaded languages and selects the best match.
Installation Add this gem to your Gemfile (local development):
# Gemfile
gem "static_genderizer"
Or build and install the gem locally:
gem build static_genderizer.gemspec
gem install ./static_genderizer-0.1.0.gem
Configuration & CSV files
- Provide CSV files named
xx.csv(wherexxis the language code, e.g.pl,en) in a data directory. - CSV format: the file must have headers and at least two columns:
nameandgender.name: the name token (e.g. "Anna", "Kowalski")gender:M,F(case-insensitive) or empty- non-empty gender => token will be treated as a first name (and the gender recorded)
- empty gender => token will be treated as a last name
- Example
spec/data/pl.csv:
name,gender
Jan,M
Anna,F
Quickstart (programmatic)
require "static_genderizer"
# configure and load CSVs
StaticGenderizer.configure do |c|
c.data_path = File.("data", __dir__) # path containing pl.csv, en.csv ...
c.languages = [:pl, :en] # languages to load
c.case_sensitive = false
end
# analyze a name (language optional)
result = StaticGenderizer.analyze("Jan Kowalski", language: :pl)
puts result.first_names.inspect # => ["Jan"]
puts result.last_names.inspect # => ["Kowalski"]
puts result.language # => :pl
puts result.gender # => :male
Behavior & heuristics
- Tokenization: input string is split on whitespace and punctuation (commas/semicolons). Apostrophes are preserved (e.g., "O'Connor").
- Classification:
- If a token is present in a language's
first_namesmap (i.e. found in CSV with non-empty gender) — it is treated as a first name for that language. - Otherwise the token is treated as a last name.
- If a token is present in a language's
- Language selection:
- If you pass a
language:that is loaded, analysis is done only for that language. - If you do not pass a language, the gem analyzes across all configured languages and picks the language that produced the highest "match" score (based on tokens found).
- If you pass a
- Gender decision:
- Derived from detected first-name tokens' recorded genders (M/F). Majority rule applies; ties or no matches =>
:unknown.
- Derived from detected first-name tokens' recorded genders (M/F). Majority rule applies; ties or no matches =>
API
- StaticGenderizer.configure { |c| ... } — configure data_path, languages, case_sensitive and load CSVs.
- StaticGenderizer.analyze(name_string, language: nil) => returns a StaticGenderizer::Result
- Result provides
first_names,last_names,language, andgender. - Example:
StaticGenderizer.analyze("Anna Nowak")
- Result provides
Testing
Specs are included under spec/. To run:
bundle install
bundle exec rspec
Project layout (relevant)
- lib/static_genderizer/*.rb — gem implementation
- spec/ — RSpec tests and sample CSV data (spec/data/*.csv)
- static_genderizer.gemspec, Gemfile, Rakefile
Notes, limitations & next steps
- CSV-only: this gem is designed for static CSV lookup. It does not use external services.
- Declension: current implementation focuses on name splitting and gender detection using
xx.csvfiles. If you need declension/inflection data (e.g.,xx_declination.csv) or morphological support, this can be added — the codebase is structured so adeclinationsloader and inclusion in Result can be implemented. - Ambiguity & heuristics: heuristics are intentionally simple. For better accuracy you can:
- extend CSVs with frequency scores,
- add a
typecolumn (first,last,both), - add language-specific rules or suffix heuristics.
- Case sensitivity is configurable via
configuration.case_sensitive.
Contributing
- Add CSV data under
spec/datafor tests or in your owndata/folder when using the gem. - Add specs under
spec/and runbundle exec rspec. - Open pull requests for bug fixes or improvements.
License MIT — see LICENSE file.
Contact / support Open an issue or PR in the repository where this gem is hosted.