Method: Extcite.extract_from_metadata

Defined in:: lib/extcite.rb

.extract_from_metadata(path:) ⇒ `Object`

Try to extract DOIs from one or more PDF metadata sections

Return: DOI string

Examples:

require 'extcite'
require 'faraday'
# get a paper in pdf format
path = '2068.pdf'
res = Faraday.new(:url => "https://peerj.com/articles/2068.pdf").get;
f = File.new(path, "wb");
f.write(res.body)
f.close()
# extract doi from the pdf
Extcite.extract_from_metadata(path: path)

Parameters:

path (String) —

Path to a pdf file, or a folder of PDF files

# File 'lib/extcite.rb', line 137

def self.extract_from_metadata(path:)
  path = make_paths(path)
  path.each do |x|
    # try PDF metadata first
    ids = nil
    rr = PDF::Reader.new(x)
    pdfmeta = rr.metadata
    if !pdfmeta.nil?
      begin
        xml = Oga.parse_xml(pdfmeta);
      rescue Exception => e
        xml = nil
      end

      if !xml.nil?
        begin
          tt = xml.xpath('//rdf:Description')
          # try dc:identifier attribute
          ss = tt.attr('dc:identifier')[0]
          if !ss.nil?
            ids = ss.text.sub(/doi:/, '')
          else
            # try prism:doi node
            pdoi = xml.xpath('//rdf:Description//prism:doi')
            if pdoi.length == 1
              ids = pdoi.text
            else
              # try pdf:WPS-ARTICLEDOI node
              wpsdoi = xml.xpath('//rdf:Description//pdf:WPS-ARTICLEDOI')
              if wpsdoi.length == 1
                ids = wpsdoi.text
              else
                # try pdfx:WPS-ARTICLEDOI node
                pdfxwpsdoi = xml.xpath('//rdf:Description//pdfx:WPS-ARTICLEDOI')
                if pdfxwpsdoi.length == 1
                  ids = pdfxwpsdoi.text
                else
                  ids = nil
                end
              end
            end
          end
        rescue
          ids = nil
        end
      end
    end

    # if not found, try regexing for DOI
    if ids.nil?
      ids = Extcite.get_ids(txt: Extcite.extract_text_one(x))
    end

    return ids
  end
end

Method: Extcite.extract_from_metadata

.extract_from_metadata(path:) ⇒ Object

Examples:

.extract_from_metadata(path:) ⇒ `Object`