Class: Pdfcrowd::PdfToTextClient

Inherits:
Object
  • Object
show all
Defined in:
lib/pdfcrowd.rb

Overview

Conversion from PDF to text.

Instance Method Summary collapse

Constructor Details

#initialize(user_name, api_key) ⇒ PdfToTextClient

Constructor for the Pdfcrowd API client.

  • user_name - Your username at Pdfcrowd.

  • api_key - Your API key.



5839
5840
5841
5842
5843
5844
5845
5846
5847
5848
# File 'lib/pdfcrowd.rb', line 5839

def initialize(user_name, api_key)
    @helper = ConnectionHelper.new(user_name, api_key)
    @fields = {
        'input_format'=>'pdf',
        'output_format'=>'txt'
    }
    @file_id = 1
    @files = {}
    @raw_data = {}
end

Instance Method Details

#convertFile(file) ⇒ Object

Convert a local file.

  • file - The path to a local file to convert. The file must exist and not be empty.

  • Returns - Byte array containing the conversion output.



5900
5901
5902
5903
5904
5905
5906
5907
# File 'lib/pdfcrowd.rb', line 5900

def convertFile(file)
    if (!(File.file?(file) && !File.zero?(file)))
        raise Error.new(Pdfcrowd.create_invalid_value_message(file, "convertFile", "pdf-to-text", "The file must exist and not be empty.", "convert_file"), 470);
    end
    
    @files['file'] = file
    @helper.post(@fields, @files, @raw_data)
end

#convertFileToFile(file, file_path) ⇒ Object

Convert a local file and write the result to a local file.

  • file - The path to a local file to convert. The file must exist and not be empty.

  • file_path - The output file path. The string must not be empty.



5926
5927
5928
5929
5930
5931
5932
5933
5934
5935
5936
5937
5938
5939
5940
# File 'lib/pdfcrowd.rb', line 5926

def convertFileToFile(file, file_path)
    if (!(!file_path.nil? && !file_path.empty?))
        raise Error.new(Pdfcrowd.create_invalid_value_message(file_path, "convertFileToFile::file_path", "pdf-to-text", "The string must not be empty.", "convert_file_to_file"), 470);
    end
    
    output_file = open(file_path, "wb")
    begin
        convertFileToStream(file, output_file)
        output_file.close()
    rescue Error => why
        output_file.close()
        FileUtils.rm(file_path)
        raise
    end
end

#convertFileToStream(file, out_stream) ⇒ Object

Convert a local file and write the result to an output stream.

  • file - The path to a local file to convert. The file must exist and not be empty.

  • out_stream - The output stream that will contain the conversion output.



5913
5914
5915
5916
5917
5918
5919
5920
# File 'lib/pdfcrowd.rb', line 5913

def convertFileToStream(file, out_stream)
    if (!(File.file?(file) && !File.zero?(file)))
        raise Error.new(Pdfcrowd.create_invalid_value_message(file, "convertFileToStream::file", "pdf-to-text", "The file must exist and not be empty.", "convert_file_to_stream"), 470);
    end
    
    @files['file'] = file
    @helper.post(@fields, @files, @raw_data, out_stream)
end

#convertRawData(data) ⇒ Object

Convert raw data.

  • data - The raw content to be converted.

  • Returns - Byte array with the output.



5946
5947
5948
5949
# File 'lib/pdfcrowd.rb', line 5946

def convertRawData(data)
    @raw_data['file'] = data
    @helper.post(@fields, @files, @raw_data)
end

#convertRawDataToFile(data, file_path) ⇒ Object

Convert raw data to a file.

  • data - The raw content to be converted.

  • file_path - The output file path. The string must not be empty.



5964
5965
5966
5967
5968
5969
5970
5971
5972
5973
5974
5975
5976
5977
5978
# File 'lib/pdfcrowd.rb', line 5964

def convertRawDataToFile(data, file_path)
    if (!(!file_path.nil? && !file_path.empty?))
        raise Error.new(Pdfcrowd.create_invalid_value_message(file_path, "convertRawDataToFile::file_path", "pdf-to-text", "The string must not be empty.", "convert_raw_data_to_file"), 470);
    end
    
    output_file = open(file_path, "wb")
    begin
        convertRawDataToStream(data, output_file)
        output_file.close()
    rescue Error => why
        output_file.close()
        FileUtils.rm(file_path)
        raise
    end
end

#convertRawDataToStream(data, out_stream) ⇒ Object

Convert raw data and write the result to an output stream.

  • data - The raw content to be converted.

  • out_stream - The output stream that will contain the conversion output.



5955
5956
5957
5958
# File 'lib/pdfcrowd.rb', line 5955

def convertRawDataToStream(data, out_stream)
    @raw_data['file'] = data
    @helper.post(@fields, @files, @raw_data, out_stream)
end

#convertStream(in_stream) ⇒ Object

Convert the contents of an input stream.

  • in_stream - The input stream with source data.

  • Returns - Byte array containing the conversion output.



5984
5985
5986
5987
# File 'lib/pdfcrowd.rb', line 5984

def convertStream(in_stream)
    @raw_data['stream'] = in_stream.read
    @helper.post(@fields, @files, @raw_data)
end

#convertStreamToFile(in_stream, file_path) ⇒ Object

Convert the contents of an input stream and write the result to a local file.

  • in_stream - The input stream with source data.

  • file_path - The output file path. The string must not be empty.



6002
6003
6004
6005
6006
6007
6008
6009
6010
6011
6012
6013
6014
6015
6016
# File 'lib/pdfcrowd.rb', line 6002

def convertStreamToFile(in_stream, file_path)
    if (!(!file_path.nil? && !file_path.empty?))
        raise Error.new(Pdfcrowd.create_invalid_value_message(file_path, "convertStreamToFile::file_path", "pdf-to-text", "The string must not be empty.", "convert_stream_to_file"), 470);
    end
    
    output_file = open(file_path, "wb")
    begin
        convertStreamToStream(in_stream, output_file)
        output_file.close()
    rescue Error => why
        output_file.close()
        FileUtils.rm(file_path)
        raise
    end
end

#convertStreamToStream(in_stream, out_stream) ⇒ Object

Convert the contents of an input stream and write the result to an output stream.

  • in_stream - The input stream with source data.

  • out_stream - The output stream that will contain the conversion output.



5993
5994
5995
5996
# File 'lib/pdfcrowd.rb', line 5993

def convertStreamToStream(in_stream, out_stream)
    @raw_data['stream'] = in_stream.read
    @helper.post(@fields, @files, @raw_data, out_stream)
end

#convertUrl(url) ⇒ Object

Convert a PDF.

  • url - The address of the PDF to convert. The supported protocols are http:// and https://.

  • Returns - Byte array containing the conversion output.



5854
5855
5856
5857
5858
5859
5860
5861
# File 'lib/pdfcrowd.rb', line 5854

def convertUrl(url)
    unless /(?i)^https?:\/\/.*$/.match(url)
        raise Error.new(Pdfcrowd.create_invalid_value_message(url, "convertUrl", "pdf-to-text", "The supported protocols are http:// and https://.", "convert_url"), 470);
    end
    
    @fields['url'] = url
    @helper.post(@fields, @files, @raw_data)
end

#convertUrlToFile(url, file_path) ⇒ Object

Convert a PDF and write the result to a local file.

  • url - The address of the PDF to convert. The supported protocols are http:// and https://.

  • file_path - The output file path. The string must not be empty.



5880
5881
5882
5883
5884
5885
5886
5887
5888
5889
5890
5891
5892
5893
5894
# File 'lib/pdfcrowd.rb', line 5880

def convertUrlToFile(url, file_path)
    if (!(!file_path.nil? && !file_path.empty?))
        raise Error.new(Pdfcrowd.create_invalid_value_message(file_path, "convertUrlToFile::file_path", "pdf-to-text", "The string must not be empty.", "convert_url_to_file"), 470);
    end
    
    output_file = open(file_path, "wb")
    begin
        convertUrlToStream(url, output_file)
        output_file.close()
    rescue Error => why
        output_file.close()
        FileUtils.rm(file_path)
        raise
    end
end

#convertUrlToStream(url, out_stream) ⇒ Object

Convert a PDF and write the result to an output stream.

  • url - The address of the PDF to convert. The supported protocols are http:// and https://.

  • out_stream - The output stream that will contain the conversion output.



5867
5868
5869
5870
5871
5872
5873
5874
# File 'lib/pdfcrowd.rb', line 5867

def convertUrlToStream(url, out_stream)
    unless /(?i)^https?:\/\/.*$/.match(url)
        raise Error.new(Pdfcrowd.create_invalid_value_message(url, "convertUrlToStream::url", "pdf-to-text", "The supported protocols are http:// and https://.", "convert_url_to_stream"), 470);
    end
    
    @fields['url'] = url
    @helper.post(@fields, @files, @raw_data, out_stream)
end

#getConsumedCreditCountObject

Get the number of credits consumed by the last conversion.

  • Returns - The number of credits.



6221
6222
6223
# File 'lib/pdfcrowd.rb', line 6221

def getConsumedCreditCount()
    return @helper.getConsumedCreditCount()
end

#getDebugLogUrlObject

Get the URL of the debug log for the last conversion.

  • Returns - The link to the debug log.



6206
6207
6208
# File 'lib/pdfcrowd.rb', line 6206

def getDebugLogUrl()
    return @helper.getDebugLogUrl()
end

#getJobIdObject

Get the job id.

  • Returns - The unique job identifier.



6227
6228
6229
# File 'lib/pdfcrowd.rb', line 6227

def getJobId()
    return @helper.getJobId()
end

#getOutputSizeObject

Get the size of the output in bytes.

  • Returns - The count of bytes.



6239
6240
6241
# File 'lib/pdfcrowd.rb', line 6239

def getOutputSize()
    return @helper.getOutputSize()
end

#getPageCountObject

Get the number of pages in the output document.

  • Returns - The page count.



6233
6234
6235
# File 'lib/pdfcrowd.rb', line 6233

def getPageCount()
    return @helper.getPageCount()
end

#getRemainingCreditCountObject

Get the number of conversion credits available in your account. This method can only be called after a call to one of the convertXtoY methods. The returned value can differ from the actual count if you run parallel conversions. The special value 999999 is returned if the information is not available.

  • Returns - The number of credits.



6215
6216
6217
# File 'lib/pdfcrowd.rb', line 6215

def getRemainingCreditCount()
    return @helper.getRemainingCreditCount()
end

#getVersionObject

Get the version details.

  • Returns - API version, converter version, and client version.



6245
6246
6247
# File 'lib/pdfcrowd.rb', line 6245

def getVersion()
    return "client " + CLIENT_VERSION + ", API v2, converter " + @helper.getConverterVersion()
end

#setCropArea(x, y, width, height) ⇒ Object

Set the crop area. It allows to extract just a part of a PDF page.

  • x - Set the top left X coordinate of the crop area in points. Must be a positive integer number or 0.

  • y - Set the top left Y coordinate of the crop area in points. Must be a positive integer number or 0.

  • width - Set the width of the crop area in points. Must be a positive integer number or 0.

  • height - Set the height of the crop area in points. Must be a positive integer number or 0.

  • Returns - The converter object.



6187
6188
6189
6190
6191
6192
6193
# File 'lib/pdfcrowd.rb', line 6187

def setCropArea(x, y, width, height)
    setCropAreaX(x)
    setCropAreaY(y)
    setCropAreaWidth(width)
    setCropAreaHeight(height)
    self
end

#setCropAreaHeight(height) ⇒ Object

Set the height of the crop area in points.

  • height - Must be a positive integer number or 0.

  • Returns - The converter object.



6171
6172
6173
6174
6175
6176
6177
6178
# File 'lib/pdfcrowd.rb', line 6171

def setCropAreaHeight(height)
    if (!(Integer(height) >= 0))
        raise Error.new(Pdfcrowd.create_invalid_value_message(height, "setCropAreaHeight", "pdf-to-text", "Must be a positive integer number or 0.", "set_crop_area_height"), 470);
    end
    
    @fields['crop_area_height'] = height
    self
end

#setCropAreaWidth(width) ⇒ Object

Set the width of the crop area in points.

  • width - Must be a positive integer number or 0.

  • Returns - The converter object.



6158
6159
6160
6161
6162
6163
6164
6165
# File 'lib/pdfcrowd.rb', line 6158

def setCropAreaWidth(width)
    if (!(Integer(width) >= 0))
        raise Error.new(Pdfcrowd.create_invalid_value_message(width, "setCropAreaWidth", "pdf-to-text", "Must be a positive integer number or 0.", "set_crop_area_width"), 470);
    end
    
    @fields['crop_area_width'] = width
    self
end

#setCropAreaX(x) ⇒ Object

Set the top left X coordinate of the crop area in points.

  • x - Must be a positive integer number or 0.

  • Returns - The converter object.



6132
6133
6134
6135
6136
6137
6138
6139
# File 'lib/pdfcrowd.rb', line 6132

def setCropAreaX(x)
    if (!(Integer(x) >= 0))
        raise Error.new(Pdfcrowd.create_invalid_value_message(x, "setCropAreaX", "pdf-to-text", "Must be a positive integer number or 0.", "set_crop_area_x"), 470);
    end
    
    @fields['crop_area_x'] = x
    self
end

#setCropAreaY(y) ⇒ Object

Set the top left Y coordinate of the crop area in points.

  • y - Must be a positive integer number or 0.

  • Returns - The converter object.



6145
6146
6147
6148
6149
6150
6151
6152
# File 'lib/pdfcrowd.rb', line 6145

def setCropAreaY(y)
    if (!(Integer(y) >= 0))
        raise Error.new(Pdfcrowd.create_invalid_value_message(y, "setCropAreaY", "pdf-to-text", "Must be a positive integer number or 0.", "set_crop_area_y"), 470);
    end
    
    @fields['crop_area_y'] = y
    self
end

#setCustomPageBreak(page_break) ⇒ Object

Specify the custom page break.

  • page_break - String to insert between the pages.

  • Returns - The converter object.



6079
6080
6081
6082
# File 'lib/pdfcrowd.rb', line 6079

def setCustomPageBreak(page_break)
    @fields['custom_page_break'] = page_break
    self
end

#setDebugLog(value) ⇒ Object

Turn on the debug logging. Details about the conversion are stored in the debug log. The URL of the log can be obtained from the getDebugLogUrl method or available in conversion statistics.

  • value - Set to true to enable the debug logging.

  • Returns - The converter object.



6199
6200
6201
6202
# File 'lib/pdfcrowd.rb', line 6199

def setDebugLog(value)
    @fields['debug_log'] = value
    self
end

#setEol(eol) ⇒ Object

The end-of-line convention for the text output.

  • eol - Allowed values are unix, dos, mac.

  • Returns - The converter object.



6053
6054
6055
6056
6057
6058
6059
6060
# File 'lib/pdfcrowd.rb', line 6053

def setEol(eol)
    unless /(?i)^(unix|dos|mac)$/.match(eol)
        raise Error.new(Pdfcrowd.create_invalid_value_message(eol, "setEol", "pdf-to-text", "Allowed values are unix, dos, mac.", "set_eol"), 470);
    end
    
    @fields['eol'] = eol
    self
end

#setHttpProxy(proxy) ⇒ Object

A proxy server used by Pdfcrowd conversion process for accessing the source URLs with HTTP scheme. It can help to circumvent regional restrictions or provide limited access to your intranet.

  • proxy - The value must have format DOMAIN_OR_IP_ADDRESS:PORT.

  • Returns - The converter object.



6262
6263
6264
6265
6266
6267
6268
6269
# File 'lib/pdfcrowd.rb', line 6262

def setHttpProxy(proxy)
    unless /(?i)^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z0-9]{1,}:\d+$/.match(proxy)
        raise Error.new(Pdfcrowd.create_invalid_value_message(proxy, "setHttpProxy", "pdf-to-text", "The value must have format DOMAIN_OR_IP_ADDRESS:PORT.", "set_http_proxy"), 470);
    end
    
    @fields['http_proxy'] = proxy
    self
end

#setHttpsProxy(proxy) ⇒ Object

A proxy server used by Pdfcrowd conversion process for accessing the source URLs with HTTPS scheme. It can help to circumvent regional restrictions or provide limited access to your intranet.

  • proxy - The value must have format DOMAIN_OR_IP_ADDRESS:PORT.

  • Returns - The converter object.



6275
6276
6277
6278
6279
6280
6281
6282
# File 'lib/pdfcrowd.rb', line 6275

def setHttpsProxy(proxy)
    unless /(?i)^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z0-9]{1,}:\d+$/.match(proxy)
        raise Error.new(Pdfcrowd.create_invalid_value_message(proxy, "setHttpsProxy", "pdf-to-text", "The value must have format DOMAIN_OR_IP_ADDRESS:PORT.", "set_https_proxy"), 470);
    end
    
    @fields['https_proxy'] = proxy
    self
end

#setLineSpacingThreshold(threshold) ⇒ Object

Set the maximum line spacing when the paragraph detection mode is enabled.

  • threshold - The value must be a positive integer percentage.

  • Returns - The converter object.



6101
6102
6103
6104
6105
6106
6107
6108
# File 'lib/pdfcrowd.rb', line 6101

def setLineSpacingThreshold(threshold)
    unless /(?i)^0$|^[0-9]+%$/.match(threshold)
        raise Error.new(Pdfcrowd.create_invalid_value_message(threshold, "setLineSpacingThreshold", "pdf-to-text", "The value must be a positive integer percentage.", "set_line_spacing_threshold"), 470);
    end
    
    @fields['line_spacing_threshold'] = threshold
    self
end

#setNoLayout(value) ⇒ Object

Ignore the original PDF layout.

  • value - Set to true to ignore the layout.

  • Returns - The converter object.



6044
6045
6046
6047
# File 'lib/pdfcrowd.rb', line 6044

def setNoLayout(value)
    @fields['no_layout'] = value
    self
end

#setPageBreakMode(mode) ⇒ Object

Specify the page break mode for the text output.

  • mode - Allowed values are none, default, custom.

  • Returns - The converter object.



6066
6067
6068
6069
6070
6071
6072
6073
# File 'lib/pdfcrowd.rb', line 6066

def setPageBreakMode(mode)
    unless /(?i)^(none|default|custom)$/.match(mode)
        raise Error.new(Pdfcrowd.create_invalid_value_message(mode, "setPageBreakMode", "pdf-to-text", "Allowed values are none, default, custom.", "set_page_break_mode"), 470);
    end
    
    @fields['page_break_mode'] = mode
    self
end

#setParagraphMode(mode) ⇒ Object

Specify the paragraph detection mode.

  • mode - Allowed values are none, bounding-box, characters.

  • Returns - The converter object.



6088
6089
6090
6091
6092
6093
6094
6095
# File 'lib/pdfcrowd.rb', line 6088

def setParagraphMode(mode)
    unless /(?i)^(none|bounding-box|characters)$/.match(mode)
        raise Error.new(Pdfcrowd.create_invalid_value_message(mode, "setParagraphMode", "pdf-to-text", "Allowed values are none, bounding-box, characters.", "set_paragraph_mode"), 470);
    end
    
    @fields['paragraph_mode'] = mode
    self
end

#setPdfPassword(password) ⇒ Object

The password to open the encrypted PDF file.

  • password - The input PDF password.

  • Returns - The converter object.



6022
6023
6024
6025
# File 'lib/pdfcrowd.rb', line 6022

def setPdfPassword(password)
    @fields['pdf_password'] = password
    self
end

#setPrintPageRange(pages) ⇒ Object

Set the page range to print.

  • pages - A comma separated list of page numbers or ranges.

  • Returns - The converter object.



6031
6032
6033
6034
6035
6036
6037
6038
# File 'lib/pdfcrowd.rb', line 6031

def setPrintPageRange(pages)
    unless /^(?:\s*(?:\d+|(?:\d*\s*\-\s*\d+)|(?:\d+\s*\-\s*\d*))\s*,\s*)*\s*(?:\d+|(?:\d*\s*\-\s*\d+)|(?:\d+\s*\-\s*\d*))\s*$/.match(pages)
        raise Error.new(Pdfcrowd.create_invalid_value_message(pages, "setPrintPageRange", "pdf-to-text", "A comma separated list of page numbers or ranges.", "set_print_page_range"), 470);
    end
    
    @fields['print_page_range'] = pages
    self
end

#setProxy(host, port, user_name, password) ⇒ Object

Specifies an HTTP proxy that the API client library will use to connect to the internet.

  • host - The proxy hostname.

  • port - The proxy port.

  • user_name - The username.

  • password - The password.

  • Returns - The converter object.



6310
6311
6312
6313
# File 'lib/pdfcrowd.rb', line 6310

def setProxy(host, port, user_name, password)
    @helper.setProxy(host, port, user_name, password)
    self
end

#setRemoveEmptyLines(value) ⇒ Object

Remove empty lines from the text output.

  • value - Set to true to remove empty lines.

  • Returns - The converter object.



6123
6124
6125
6126
# File 'lib/pdfcrowd.rb', line 6123

def setRemoveEmptyLines(value)
    @fields['remove_empty_lines'] = value
    self
end

#setRemoveHyphenation(value) ⇒ Object

Remove the hyphen character from the end of lines.

  • value - Set to true to remove hyphens.

  • Returns - The converter object.



6114
6115
6116
6117
# File 'lib/pdfcrowd.rb', line 6114

def setRemoveHyphenation(value)
    @fields['remove_hyphenation'] = value
    self
end

#setRetryCount(count) ⇒ Object

Specifies the number of automatic retries when the 502 or 503 HTTP status code is received. The status code indicates a temporary network issue. This feature can be disabled by setting to 0.

  • count - Number of retries.

  • Returns - The converter object.



6319
6320
6321
6322
# File 'lib/pdfcrowd.rb', line 6319

def setRetryCount(count)
    @helper.setRetryCount(count)
    self
end

#setTag(tag) ⇒ Object

Tag the conversion with a custom value. The tag is used in conversion statistics. A value longer than 32 characters is cut off.

  • tag - A string with the custom tag.

  • Returns - The converter object.



6253
6254
6255
6256
# File 'lib/pdfcrowd.rb', line 6253

def setTag(tag)
    @fields['tag'] = tag
    self
end

#setUseHttp(value) ⇒ Object

Specifies if the client communicates over HTTP or HTTPS with Pdfcrowd API. Warning: Using HTTP is insecure as data sent over HTTP is not encrypted. Enable this option only if you know what you are doing.

  • value - Set to true to use HTTP.

  • Returns - The converter object.



6289
6290
6291
6292
# File 'lib/pdfcrowd.rb', line 6289

def setUseHttp(value)
    @helper.setUseHttp(value)
    self
end

#setUserAgent(agent) ⇒ Object

Set a custom user agent HTTP header. It can be useful if you are behind a proxy or a firewall.

  • agent - The user agent string.

  • Returns - The converter object.



6298
6299
6300
6301
# File 'lib/pdfcrowd.rb', line 6298

def setUserAgent(agent)
    @helper.setUserAgent(agent)
    self
end