Class: Spacy::Doc

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/ruby-spacy.rb

Overview

See also spaCy Python API document for Doc.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(nlp_id, text) ⇒ Doc

Creates a new instance of Spacy::Doc.

Parameters:

  • nlp_id (String)

    The id string of the nlp, an instance of Language class

  • text (String)

    The text string to be analyzed



285
286
287
288
289
290
291
292
293
# File 'lib/ruby-spacy.rb', line 285

def initialize(nlp_id, text)
  @text = text
  @spacy_nlp_id = nlp_id
  @spacy_doc_id = "doc_#{text.object_id}"
  quoted = text.gsub('"', '\"')
  PyCall.exec(%Q[text_#{text.object_id} = """#{quoted}"""])
  PyCall.exec("#{@spacy_doc_id} = #{nlp_id}(text_#{text.object_id})")
  @py_doc = PyCall.eval(@spacy_doc_id)
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *args) ⇒ Object

Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.



430
431
432
# File 'lib/ruby-spacy.rb', line 430

def method_missing(name, *args)
  @py_doc.send(name, *args)
end

Instance Attribute Details

#py_docObject (readonly)

Returns a Python Doc instance accessible via PyCall.

Returns:

  • (Object)

    a Python Doc instance accessible via PyCall



271
272
273
# File 'lib/ruby-spacy.rb', line 271

def py_doc
  @py_doc
end

#spacy_doc_idString (readonly)

Returns an identifier string that can be used when referring to the Python object inside PyCall::exec or PyCall::eval.

Returns:

  • (String)

    an identifier string that can be used when referring to the Python object inside PyCall::exec or PyCall::eval



268
269
270
# File 'lib/ruby-spacy.rb', line 268

def spacy_doc_id
  @spacy_doc_id
end

#spacy_nlp_idString (readonly)

Returns an identifier string that can be used when referring to the Python object inside PyCall::exec or PyCall::eval.

Returns:

  • (String)

    an identifier string that can be used when referring to the Python object inside PyCall::exec or PyCall::eval



265
266
267
# File 'lib/ruby-spacy.rb', line 265

def spacy_nlp_id
  @spacy_nlp_id
end

#textString (readonly)

Returns a text string of the document.

Returns:

  • (String)

    a text string of the document



274
275
276
# File 'lib/ruby-spacy.rb', line 274

def text
  @text
end

Instance Method Details

#[](range) ⇒ Object

Returns a span if given a range object; returns a token if given an integer representing a position in the doc.

Parameters:

  • range (Range)

    an ordinary Ruby's range object such as 0..3, 1...4, or 3 .. -1



405
406
407
408
409
410
411
412
# File 'lib/ruby-spacy.rb', line 405

def [](range)
  if range.is_a?(Range)
    py_span = @py_doc[range]
    return Span.new(self, start_index: py_span.start, end_index: py_span.end - 1)
  else
    return Token.new(@py_doc[range])
  end
end

#displacy(style: "dep", compact: false) ⇒ String

Visualize the document in one of two styles: dep (dependencies) or ent (named entities).

Parameters:

  • style (String) (defaults to: "dep")

    Either dep or ent

  • compact (Boolean) (defaults to: false)

    Only relevant to the `dep' style

Returns:

  • (String)

    in the case of dep, the output text is an SVG while in the ent style, the output text is an HTML.



425
426
427
# File 'lib/ruby-spacy.rb', line 425

def displacy(style: "dep", compact: false)
  PyCall.eval("displacy.render(#{@spacy_doc_id}, style='#{style}', options={'compact': #{compact.to_s.capitalize}}, jupyter=False)")
end

#eachObject

Iterates over the elements in the doc yielding a token instance.



344
345
346
347
348
# File 'lib/ruby-spacy.rb', line 344

def each
  PyCall::List.(@py_doc).each do |py_token|
    yield Token.new(py_token)
  end
end

#entsArray<Span>

Returns an array of spans representing named entities.

Returns:



394
395
396
397
398
399
400
401
# File 'lib/ruby-spacy.rb', line 394

def ents
  # so that ents canbe "each"-ed in Ruby
  ent_array = []
  PyCall::List.(@py_doc.ents).each do |ent|
    ent_array << ent
  end
  ent_array
end

#noun_chunksArray<Span>

Returns an array of spans representing noun chunks.

Returns:



372
373
374
375
376
377
378
379
# File 'lib/ruby-spacy.rb', line 372

def noun_chunks
  chunk_array = []
  py_chunks = PyCall::List.(@py_doc.noun_chunks)
  py_chunks.each do |py_chunk|
    chunk_array << Span.new(self, start_index: py_chunk.start, end_index: py_chunk.end - 1)
  end
  chunk_array
end

#retokenize(start_index, end_index, attributes = {}) ⇒ Object

Retokenizes the text merging a span into a single token.

Parameters:

  • start_index (Integer)

    The start position of the span to be retokenized in the document

  • end_index (Integer)

    The end position of the span to be retokenized in the document

  • attributes (Hash) (defaults to: {})

    Attributes to set on the merged token



300
301
302
303
304
305
306
307
# File 'lib/ruby-spacy.rb', line 300

def retokenize(start_index, end_index, attributes = {})
  py_attrs = PyCall::Dict.(attributes)
  PyCall.exec(<<PY)
with #{@spacy_doc_id}.retokenize() as retokenizer:
retokenizer.merge(#{@spacy_doc_id}[#{start_index} : #{end_index + 1}], attrs=#{py_attrs})
PY
  @py_doc = PyCall.eval(@spacy_doc_id)
end

#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object

Retokenizes the text splitting the specified token.

Parameters:

  • pos_in_doc (Integer)

    The position of the span to be retokenized in the document

  • split_array (Array<String>)

    text strings of the split results

  • ancestor_pos (Integer)

    The position of the immediate ancestor element of the split elements in the document

  • attributes (Hash) (defaults to: {})

    The attributes of the split elements



314
315
316
317
318
319
320
321
322
323
324
325
# File 'lib/ruby-spacy.rb', line 314

def retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {})
  py_attrs = PyCall::Dict.(attributes)
  py_split_array = PyCall::List.(split_array)
  PyCall.exec(<<PY)
with #{@spacy_doc_id}.retokenize() as retokenizer:
heads = [(#{@spacy_doc_id}[#{pos_in_doc}], #{head_pos_in_split}), #{@spacy_doc_id}[#{ancestor_pos}]]
attrs = #{py_attrs}
split_array = #{py_split_array}
retokenizer.split(#{@spacy_doc_id}[#{pos_in_doc}], split_array, heads=heads, attrs=attrs)
PY
  @py_doc = PyCall.eval(@spacy_doc_id)
end

#sentsArray<Span>

Returns an array of spans representing sentences.

Returns:



383
384
385
386
387
388
389
390
# File 'lib/ruby-spacy.rb', line 383

def sents
  sentence_array = []
  py_sentences = PyCall::List.(@py_doc.sents)
  py_sentences.each do |py_sent|
    sentence_array << Span.new(self, start_index: py_sent.start, end_index: py_sent.end - 1)
  end
  sentence_array
end

#similarity(other) ⇒ Float

Returns a semantic similarity estimate.

Parameters:

  • other (Doc)

    the other doc to which a similarity estimation is made

Returns:

  • (Float)


417
418
419
# File 'lib/ruby-spacy.rb', line 417

def similarity(other)
  PyCall.eval("#{@spacy_doc_id}.similarity(#{other.spacy_doc_id})")
end

#span(range_or_start, optional_size = nil) ⇒ Span

Returns a span of the specified range within the doc. The method should be used either of the two ways: Doc#span(range) or Doc#span{start_pos, size_of_span}.

Parameters:

  • range_or_start (Range, Integer)

    A range object, or, alternatively, an integer that represents the start position of the span

  • optional_size (Integer) (defaults to: nil)

    An integer representing the size of the span

Returns:



355
356
357
358
359
360
361
362
363
364
365
366
367
368
# File 'lib/ruby-spacy.rb', line 355

def span(range_or_start, optional_size = nil)
  if optional_size
    start_index = range_or_start
    temp = tokens[start_index ... start_index + optional_size]
  else
    start_index = range_or_start.first
    range = range_or_start
    temp = tokens[range]
  end

  end_index = start_index + temp.size - 1

  Span.new(self, start_index: start_index, end_index: end_index)
end

#to_sString

String representation of the token.

Returns:

  • (String)


329
330
331
# File 'lib/ruby-spacy.rb', line 329

def to_s
  @text
end

#tokensArray<Token>

Returns an array of tokens contained in the doc.

Returns:



335
336
337
338
339
340
341
# File 'lib/ruby-spacy.rb', line 335

def tokens
  results = []
  PyCall::List.(@py_doc).each do |py_token|
    results << Token.new(py_token)
  end
  results
end