eiwa / 英和

Parses the Japanese-English version of JMDict, a daily export of the WWWJDIC online Japanese dictionary.

Usage

Install

Install the gem:

gem install eiwa

Or add it to your Gemfile:

gem 'eiwa'

Download a supported dictionary

Get your hands on a supported dictionary. Right now eiwa only parses JMDict, which can be fetched from the Monash ftp site or with a script like this, for the Japanese-English export:

curl http://ftp.monash.edu/pub/nihongo/JMdict_e -o jmdict.xml

This file is updated daily, and is essentially an export of all vocabulary on the WWWJDIC application

Parse the dictionary

The eiwa gem implements an evented SAX parser via nokogiri to efficiently work through the very large XML file, as loading a full DOM into memory is very resource-intensive. In consideration of this, eiwa's parsing method provides two modes, one that will return every dictionary entry in an array and one that will invoke a provided block with each entry, but which won't retain a reference to the entries, allowing Ruby to garbage collect them as it goes.

Parsing the dictionary is CPU intensive, and takes about 13 seconds on my 2019 13" MacBook Pro.

Passing a block

If you just want to do some processing on each entry, it probably makes sense to invoke the library by passing a block

Eiwa.parse_file("path/to/some.xml", type: :jmdict_e) do |entry|
  # Do something with that entry
end

This approach can parse the entire JMDICT-E dictionary in a 15MB Ruby 2.6 process.

Return the results in an array

If you're just going to add all the entries to an array or otherwise retain them in memory, you can call the same method without a block, and it will return all the entries in an array.

entries = Eiwa.parse_file("path/to/some.xml", type: :jmdict_e)

Note that for the abridged Japanese-English dictionary, this will consume about 500MB of RAM.