Class: Stevedore::StevedoreBlob

Inherits:

Object

Object
Stevedore::StevedoreBlob

show all

Defined in:: lib/parsers/stevedore_blob.rb

Direct Known Subclasses

StevedoreCsvRow, StevedoreEmail, StevedoreHTML

Instance Attribute Summary collapse

#download_url ⇒ Object

Returns the value of attribute download_url.
#extra ⇒ Object

Returns the value of attribute extra.
#text ⇒ Object

Returns the value of attribute text.
#title ⇒ Object

Returns the value of attribute title.

Class Method Summary collapse

.new_from_tika(content, metadata, download_url, filename) ⇒ Object

Instance Method Summary collapse

#analyze! ⇒ Object
#clean_text ⇒ Object
#initialize(title, text, download_url = nil, extra = {}) ⇒ StevedoreBlob constructor

A new instance of StevedoreBlob.
#to_hash ⇒ Object

Constructor Details

#initialize(title, text, download_url = nil, extra = {}) ⇒ `StevedoreBlob`

Returns a new instance of StevedoreBlob.

Raises:

(ArgumentError)

# File 'lib/parsers/stevedore_blob.rb', line 7

def initialize(title, text, download_url=nil, extra={})
  self.title = title || download_url
  self.text = text
  self.download_url = download_url
  self.extra = extra
  raise ArgumentError, "StevedoreBlob extra support not yet implemented" if extra.keys.size > 0
end

Instance Attribute Details

#download_url ⇒ `Object`

Returns the value of attribute download_url.



6
7
8

# File 'lib/parsers/stevedore_blob.rb', line 6

def download_url
  @download_url
end

#extra ⇒ `Object`

Returns the value of attribute extra.



6
7
8

# File 'lib/parsers/stevedore_blob.rb', line 6

def extra
  @extra
end

#text ⇒ `Object`

Returns the value of attribute text.



6
7
8

# File 'lib/parsers/stevedore_blob.rb', line 6

def text
  @text
end

#title ⇒ `Object`

Returns the value of attribute title.



6
7
8

# File 'lib/parsers/stevedore_blob.rb', line 6

def title
  @title
end

Class Method Details

.new_from_tika(content, metadata, download_url, filename) ⇒ `Object`



19
20
21

# File 'lib/parsers/stevedore_blob.rb', line 19

def self.new_from_tika(content, metadata, download_url, filename)
  self.new( ((metadata["title"] && metadata["title"] != "Untitled") ? metadata["title"] : File.basename(filename)), content, download_url)
end

Instance Method Details

#analyze! ⇒ `Object`

# File 'lib/parsers/stevedore_blob.rb', line 23

def analyze!
  # probably does nothing on blobs.
  # this should do the HTML boilerplate extraction thingy on HTML.
end

#clean_text ⇒ `Object`



15
16
17

# File 'lib/parsers/stevedore_blob.rb', line 15

def clean_text
  @clean_text ||= text.gsub(/<\/?[^>]+>/, '') # removes all tags
end

#to_hash ⇒ `Object`

# File 'lib/parsers/stevedore_blob.rb', line 28

def to_hash
  sha =  Digest::SHA1.hexdigest(download_url)
  # TODO should merge in or something?
  {
    "sha1" => sha,
    "id" => sha,
    "_id" => sha,
    "title" => title.to_s || "Untitled Document: #{HumanHash::HumanHasher.new.humanize(sha)}",
    "source_url" => download_url.to_s,
    "file" => {
      "title" => title.to_s || "Untitled Document: #{HumanHash::HumanHasher.new.humanize(sha)}",
      "file" => clean_text.to_s
    },
    "analyzed" => {
      "body" => clean_text.to_s,
      "metadata" => {
        "Content-Type" => extra["Content-Type"] || "text/plain"
      }
    },
    "_updatedAt" => Time.now,   
  }
end

Class: Stevedore::StevedoreBlob

Direct Known Subclasses

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(title, text, download_url = nil, extra = {}) ⇒ StevedoreBlob

Instance Attribute Details

#download_url ⇒ Object

#extra ⇒ Object

#text ⇒ Object

#title ⇒ Object