Method: Aws::Textract::Client#get_document_analysis

Defined in:
lib/aws-sdk-textract/client.rb

#get_document_analysis(params = {}) ⇒ Types::GetDocumentAnalysisResponse

Gets the results for an Amazon Textract asynchronous operation that analyzes text in a document.

You start asynchronous text analysis by calling StartDocumentAnalysis, which returns a job identifier (‘JobId`). When the text analysis operation finishes, Amazon Textract publishes a completion status to the Amazon Simple Notification Service (Amazon SNS) topic that’s registered in the initial call to ‘StartDocumentAnalysis`. To get the results of the text-detection operation, first check that the status value published to the Amazon SNS topic is `SUCCEEDED`. If so, call `GetDocumentAnalysis`, and pass the job identifier (`JobId`) from the initial call to `StartDocumentAnalysis`.

‘GetDocumentAnalysis` returns an array of Block objects. The following types of information are returned:

  • Form data (key-value pairs). The related information is returned in two Block objects, each of type ‘KEY_VALUE_SET`: a KEY `Block` object and a VALUE `Block` object. For example, *Name: Ana Silva Carolina* contains a key and value. Name: is the key. *Ana Silva Carolina* is the value.

  • Table and table cell data. A TABLE ‘Block` object contains information about a detected table. A CELL `Block` object is returned for each cell in a table.

  • Lines and words of text. A LINE ‘Block` object contains one or more WORD `Block` objects. All lines and words that are detected in the document are returned (including text that doesn’t have a relationship with the value of the ‘StartDocumentAnalysis` `FeatureTypes` input parameter).

  • Query. A QUERY Block object contains the query text, alias and link to the associated Query results block object.

  • Query Results. A QUERY_RESULT Block object contains the answer to the query and an ID that connects it to the query asked. This Block also contains a confidence score.

<note markdown=“1”> While processing a document with queries, look out for ‘INVALID_REQUEST_PARAMETERS` output. This indicates that either the per page query limit has been exceeded or that the operation is trying to query a page in the document which doesn’t exist.

</note>

Selection elements such as check boxes and option buttons (radio buttons) can be detected in form data and in tables. A SELECTION_ELEMENT ‘Block` object contains information about a selection element, including the selection status.

Use the ‘MaxResults` parameter to limit the number of blocks that are returned. If there are more results than specified in `MaxResults`, the value of `NextToken` in the operation response contains a pagination token for getting the next set of results. To get the next page of results, call `GetDocumentAnalysis`, and populate the `NextToken` request parameter with the token value that’s returned from the previous call to ‘GetDocumentAnalysis`.

For more information, see [Document Text Analysis].

[1]: docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html

Examples:

Request syntax with placeholder values


resp = client.get_document_analysis({
  job_id: "JobId", # required
  max_results: 1,
  next_token: "PaginationToken",
})

Response structure


resp..pages #=> Integer
resp.job_status #=> String, one of "IN_PROGRESS", "SUCCEEDED", "FAILED", "PARTIAL_SUCCESS"
resp.next_token #=> String
resp.blocks #=> Array
resp.blocks[0].block_type #=> String, one of "KEY_VALUE_SET", "PAGE", "LINE", "WORD", "TABLE", "CELL", "SELECTION_ELEMENT", "MERGED_CELL", "TITLE", "QUERY", "QUERY_RESULT", "SIGNATURE", "TABLE_TITLE", "TABLE_FOOTER", "LAYOUT_TEXT", "LAYOUT_TITLE", "LAYOUT_HEADER", "LAYOUT_FOOTER", "LAYOUT_SECTION_HEADER", "LAYOUT_PAGE_NUMBER", "LAYOUT_LIST", "LAYOUT_FIGURE", "LAYOUT_TABLE", "LAYOUT_KEY_VALUE"
resp.blocks[0].confidence #=> Float
resp.blocks[0].text #=> String
resp.blocks[0].text_type #=> String, one of "HANDWRITING", "PRINTED"
resp.blocks[0].row_index #=> Integer
resp.blocks[0].column_index #=> Integer
resp.blocks[0].row_span #=> Integer
resp.blocks[0].column_span #=> Integer
resp.blocks[0].geometry.bounding_box.width #=> Float
resp.blocks[0].geometry.bounding_box.height #=> Float
resp.blocks[0].geometry.bounding_box.left #=> Float
resp.blocks[0].geometry.bounding_box.top #=> Float
resp.blocks[0].geometry.polygon #=> Array
resp.blocks[0].geometry.polygon[0].x #=> Float
resp.blocks[0].geometry.polygon[0].y #=> Float
resp.blocks[0].geometry.rotation_angle #=> Float
resp.blocks[0].id #=> String
resp.blocks[0].relationships #=> Array
resp.blocks[0].relationships[0].type #=> String, one of "VALUE", "CHILD", "COMPLEX_FEATURES", "MERGED_CELL", "TITLE", "ANSWER", "TABLE", "TABLE_TITLE", "TABLE_FOOTER"
resp.blocks[0].relationships[0].ids #=> Array
resp.blocks[0].relationships[0].ids[0] #=> String
resp.blocks[0].entity_types #=> Array
resp.blocks[0].entity_types[0] #=> String, one of "KEY", "VALUE", "COLUMN_HEADER", "TABLE_TITLE", "TABLE_FOOTER", "TABLE_SECTION_TITLE", "TABLE_SUMMARY", "STRUCTURED_TABLE", "SEMI_STRUCTURED_TABLE"
resp.blocks[0].selection_status #=> String, one of "SELECTED", "NOT_SELECTED"
resp.blocks[0].page #=> Integer
resp.blocks[0].query.text #=> String
resp.blocks[0].query.alias #=> String
resp.blocks[0].query.pages #=> Array
resp.blocks[0].query.pages[0] #=> String
resp.warnings #=> Array
resp.warnings[0].error_code #=> String
resp.warnings[0].pages #=> Array
resp.warnings[0].pages[0] #=> Integer
resp.status_message #=> String
resp.analyze_document_model_version #=> String

Parameters:

  • params (Hash) (defaults to: {})

    ({})

Options Hash (params):

  • :job_id (required, String)

    A unique identifier for the text-detection job. The ‘JobId` is returned from `StartDocumentAnalysis`. A `JobId` value is only valid for 7 days.

  • :max_results (Integer)

    The maximum number of results to return per paginated call. The largest value that you can specify is 1,000. If you specify a value greater than 1,000, a maximum of 1,000 results is returned. The default value is 1,000.

  • :next_token (String)

    If the previous response was incomplete (because there are more blocks to retrieve), Amazon Textract returns a pagination token in the response. You can use this pagination token to retrieve the next set of blocks.

Returns:

See Also:



1476
1477
1478
1479
# File 'lib/aws-sdk-textract/client.rb', line 1476

def get_document_analysis(params = {}, options = {})
  req = build_request(:get_document_analysis, params)
  req.send_request(options)
end