Class: Pdfcrowd::PdfToTextClient

Inherits:
Object
  • Object
show all
Defined in:
lib/pdfcrowd.rb

Overview

Conversion from PDF to text.

Instance Method Summary collapse

Constructor Details

#initialize(user_name, api_key) ⇒ PdfToTextClient

Constructor for the Pdfcrowd API client.

  • user_name - Your username at Pdfcrowd.

  • api_key - Your API key.



6033
6034
6035
6036
6037
6038
6039
6040
6041
6042
# File 'lib/pdfcrowd.rb', line 6033

def initialize(user_name, api_key)
    @helper = ConnectionHelper.new(user_name, api_key)
    @fields = {
        'input_format'=>'pdf',
        'output_format'=>'txt'
    }
    @file_id = 1
    @files = {}
    @raw_data = {}
end

Instance Method Details

#convertFile(file) ⇒ Object

Convert a local file.

  • file - The path to a local file to convert. The file must exist and not be empty.

  • Returns - Byte array containing the conversion output.



6094
6095
6096
6097
6098
6099
6100
6101
# File 'lib/pdfcrowd.rb', line 6094

def convertFile(file)
    if (!(File.file?(file) && !File.zero?(file)))
        raise Error.new(Pdfcrowd.create_invalid_value_message(file, "convertFile", "pdf-to-text", "The file must exist and not be empty.", "convert_file"), 470);
    end
    
    @files['file'] = file
    @helper.post(@fields, @files, @raw_data)
end

#convertFileToFile(file, file_path) ⇒ Object

Convert a local file and write the result to a local file.

  • file - The path to a local file to convert. The file must exist and not be empty.

  • file_path - The output file path. The string must not be empty.



6120
6121
6122
6123
6124
6125
6126
6127
6128
6129
6130
6131
6132
6133
6134
# File 'lib/pdfcrowd.rb', line 6120

def convertFileToFile(file, file_path)
    if (!(!file_path.nil? && !file_path.empty?))
        raise Error.new(Pdfcrowd.create_invalid_value_message(file_path, "convertFileToFile::file_path", "pdf-to-text", "The string must not be empty.", "convert_file_to_file"), 470);
    end
    
    output_file = open(file_path, "wb")
    begin
        convertFileToStream(file, output_file)
        output_file.close()
    rescue Error => why
        output_file.close()
        FileUtils.rm(file_path)
        raise
    end
end

#convertFileToStream(file, out_stream) ⇒ Object

Convert a local file and write the result to an output stream.

  • file - The path to a local file to convert. The file must exist and not be empty.

  • out_stream - The output stream that will contain the conversion output.



6107
6108
6109
6110
6111
6112
6113
6114
# File 'lib/pdfcrowd.rb', line 6107

def convertFileToStream(file, out_stream)
    if (!(File.file?(file) && !File.zero?(file)))
        raise Error.new(Pdfcrowd.create_invalid_value_message(file, "convertFileToStream::file", "pdf-to-text", "The file must exist and not be empty.", "convert_file_to_stream"), 470);
    end
    
    @files['file'] = file
    @helper.post(@fields, @files, @raw_data, out_stream)
end

#convertRawData(data) ⇒ Object

Convert raw data.

  • data - The raw content to be converted.

  • Returns - Byte array with the output.



6140
6141
6142
6143
# File 'lib/pdfcrowd.rb', line 6140

def convertRawData(data)
    @raw_data['file'] = data
    @helper.post(@fields, @files, @raw_data)
end

#convertRawDataToFile(data, file_path) ⇒ Object

Convert raw data to a file.

  • data - The raw content to be converted.

  • file_path - The output file path. The string must not be empty.



6158
6159
6160
6161
6162
6163
6164
6165
6166
6167
6168
6169
6170
6171
6172
# File 'lib/pdfcrowd.rb', line 6158

def convertRawDataToFile(data, file_path)
    if (!(!file_path.nil? && !file_path.empty?))
        raise Error.new(Pdfcrowd.create_invalid_value_message(file_path, "convertRawDataToFile::file_path", "pdf-to-text", "The string must not be empty.", "convert_raw_data_to_file"), 470);
    end
    
    output_file = open(file_path, "wb")
    begin
        convertRawDataToStream(data, output_file)
        output_file.close()
    rescue Error => why
        output_file.close()
        FileUtils.rm(file_path)
        raise
    end
end

#convertRawDataToStream(data, out_stream) ⇒ Object

Convert raw data and write the result to an output stream.

  • data - The raw content to be converted.

  • out_stream - The output stream that will contain the conversion output.



6149
6150
6151
6152
# File 'lib/pdfcrowd.rb', line 6149

def convertRawDataToStream(data, out_stream)
    @raw_data['file'] = data
    @helper.post(@fields, @files, @raw_data, out_stream)
end

#convertStream(in_stream) ⇒ Object

Convert the contents of an input stream.

  • in_stream - The input stream with source data.

  • Returns - Byte array containing the conversion output.



6178
6179
6180
6181
# File 'lib/pdfcrowd.rb', line 6178

def convertStream(in_stream)
    @raw_data['stream'] = in_stream.read
    @helper.post(@fields, @files, @raw_data)
end

#convertStreamToFile(in_stream, file_path) ⇒ Object

Convert the contents of an input stream and write the result to a local file.

  • in_stream - The input stream with source data.

  • file_path - The output file path. The string must not be empty.



6196
6197
6198
6199
6200
6201
6202
6203
6204
6205
6206
6207
6208
6209
6210
# File 'lib/pdfcrowd.rb', line 6196

def convertStreamToFile(in_stream, file_path)
    if (!(!file_path.nil? && !file_path.empty?))
        raise Error.new(Pdfcrowd.create_invalid_value_message(file_path, "convertStreamToFile::file_path", "pdf-to-text", "The string must not be empty.", "convert_stream_to_file"), 470);
    end
    
    output_file = open(file_path, "wb")
    begin
        convertStreamToStream(in_stream, output_file)
        output_file.close()
    rescue Error => why
        output_file.close()
        FileUtils.rm(file_path)
        raise
    end
end

#convertStreamToStream(in_stream, out_stream) ⇒ Object

Convert the contents of an input stream and write the result to an output stream.

  • in_stream - The input stream with source data.

  • out_stream - The output stream that will contain the conversion output.



6187
6188
6189
6190
# File 'lib/pdfcrowd.rb', line 6187

def convertStreamToStream(in_stream, out_stream)
    @raw_data['stream'] = in_stream.read
    @helper.post(@fields, @files, @raw_data, out_stream)
end

#convertUrl(url) ⇒ Object

Convert a PDF.

  • url - The address of the PDF to convert. The supported protocols are http:// and https://.

  • Returns - Byte array containing the conversion output.



6048
6049
6050
6051
6052
6053
6054
6055
# File 'lib/pdfcrowd.rb', line 6048

def convertUrl(url)
    unless /(?i)^https?:\/\/.*$/.match(url)
        raise Error.new(Pdfcrowd.create_invalid_value_message(url, "convertUrl", "pdf-to-text", "The supported protocols are http:// and https://.", "convert_url"), 470);
    end
    
    @fields['url'] = url
    @helper.post(@fields, @files, @raw_data)
end

#convertUrlToFile(url, file_path) ⇒ Object

Convert a PDF and write the result to a local file.

  • url - The address of the PDF to convert. The supported protocols are http:// and https://.

  • file_path - The output file path. The string must not be empty.



6074
6075
6076
6077
6078
6079
6080
6081
6082
6083
6084
6085
6086
6087
6088
# File 'lib/pdfcrowd.rb', line 6074

def convertUrlToFile(url, file_path)
    if (!(!file_path.nil? && !file_path.empty?))
        raise Error.new(Pdfcrowd.create_invalid_value_message(file_path, "convertUrlToFile::file_path", "pdf-to-text", "The string must not be empty.", "convert_url_to_file"), 470);
    end
    
    output_file = open(file_path, "wb")
    begin
        convertUrlToStream(url, output_file)
        output_file.close()
    rescue Error => why
        output_file.close()
        FileUtils.rm(file_path)
        raise
    end
end

#convertUrlToStream(url, out_stream) ⇒ Object

Convert a PDF and write the result to an output stream.

  • url - The address of the PDF to convert. The supported protocols are http:// and https://.

  • out_stream - The output stream that will contain the conversion output.



6061
6062
6063
6064
6065
6066
6067
6068
# File 'lib/pdfcrowd.rb', line 6061

def convertUrlToStream(url, out_stream)
    unless /(?i)^https?:\/\/.*$/.match(url)
        raise Error.new(Pdfcrowd.create_invalid_value_message(url, "convertUrlToStream::url", "pdf-to-text", "The supported protocols are http:// and https://.", "convert_url_to_stream"), 470);
    end
    
    @fields['url'] = url
    @helper.post(@fields, @files, @raw_data, out_stream)
end

#getConsumedCreditCountObject

Get the number of credits consumed by the last conversion.

  • Returns - The number of credits.



6415
6416
6417
# File 'lib/pdfcrowd.rb', line 6415

def getConsumedCreditCount()
    return @helper.getConsumedCreditCount()
end

#getDebugLogUrlObject

Get the URL of the debug log for the last conversion.

  • Returns - The link to the debug log.



6400
6401
6402
# File 'lib/pdfcrowd.rb', line 6400

def getDebugLogUrl()
    return @helper.getDebugLogUrl()
end

#getJobIdObject

Get the job id.

  • Returns - The unique job identifier.



6421
6422
6423
# File 'lib/pdfcrowd.rb', line 6421

def getJobId()
    return @helper.getJobId()
end

#getOutputSizeObject

Get the size of the output in bytes.

  • Returns - The count of bytes.



6433
6434
6435
# File 'lib/pdfcrowd.rb', line 6433

def getOutputSize()
    return @helper.getOutputSize()
end

#getPageCountObject

Get the number of pages in the output document.

  • Returns - The page count.



6427
6428
6429
# File 'lib/pdfcrowd.rb', line 6427

def getPageCount()
    return @helper.getPageCount()
end

#getRemainingCreditCountObject

Get the number of conversion credits available in your account. This method can only be called after a call to one of the convertXtoY methods. The returned value can differ from the actual count if you run parallel conversions. The special value 999999 is returned if the information is not available.

  • Returns - The number of credits.



6409
6410
6411
# File 'lib/pdfcrowd.rb', line 6409

def getRemainingCreditCount()
    return @helper.getRemainingCreditCount()
end

#getVersionObject

Get the version details.

  • Returns - API version, converter version, and client version.



6439
6440
6441
# File 'lib/pdfcrowd.rb', line 6439

def getVersion()
    return "client " + CLIENT_VERSION + ", API v2, converter " + @helper.getConverterVersion()
end

#setClientUserAgent(agent) ⇒ Object

Specifies the User-Agent HTTP header that the client library will use when interacting with the API.

  • agent - The user agent string.

  • Returns - The converter object.



6492
6493
6494
6495
# File 'lib/pdfcrowd.rb', line 6492

def setClientUserAgent(agent)
    @helper.setUserAgent(agent)
    self
end

#setCropArea(x, y, width, height) ⇒ Object

Set the crop area. It allows to extract just a part of a PDF page.

  • x - Set the top left X coordinate of the crop area in points. Must be a positive integer number or 0.

  • y - Set the top left Y coordinate of the crop area in points. Must be a positive integer number or 0.

  • width - Set the width of the crop area in points. Must be a positive integer number or 0.

  • height - Set the height of the crop area in points. Must be a positive integer number or 0.

  • Returns - The converter object.



6381
6382
6383
6384
6385
6386
6387
# File 'lib/pdfcrowd.rb', line 6381

def setCropArea(x, y, width, height)
    setCropAreaX(x)
    setCropAreaY(y)
    setCropAreaWidth(width)
    setCropAreaHeight(height)
    self
end

#setCropAreaHeight(height) ⇒ Object

Set the height of the crop area in points.

  • height - Must be a positive integer number or 0.

  • Returns - The converter object.



6365
6366
6367
6368
6369
6370
6371
6372
# File 'lib/pdfcrowd.rb', line 6365

def setCropAreaHeight(height)
    if (!(Integer(height) >= 0))
        raise Error.new(Pdfcrowd.create_invalid_value_message(height, "setCropAreaHeight", "pdf-to-text", "Must be a positive integer number or 0.", "set_crop_area_height"), 470);
    end
    
    @fields['crop_area_height'] = height
    self
end

#setCropAreaWidth(width) ⇒ Object

Set the width of the crop area in points.

  • width - Must be a positive integer number or 0.

  • Returns - The converter object.



6352
6353
6354
6355
6356
6357
6358
6359
# File 'lib/pdfcrowd.rb', line 6352

def setCropAreaWidth(width)
    if (!(Integer(width) >= 0))
        raise Error.new(Pdfcrowd.create_invalid_value_message(width, "setCropAreaWidth", "pdf-to-text", "Must be a positive integer number or 0.", "set_crop_area_width"), 470);
    end
    
    @fields['crop_area_width'] = width
    self
end

#setCropAreaX(x) ⇒ Object

Set the top left X coordinate of the crop area in points.

  • x - Must be a positive integer number or 0.

  • Returns - The converter object.



6326
6327
6328
6329
6330
6331
6332
6333
# File 'lib/pdfcrowd.rb', line 6326

def setCropAreaX(x)
    if (!(Integer(x) >= 0))
        raise Error.new(Pdfcrowd.create_invalid_value_message(x, "setCropAreaX", "pdf-to-text", "Must be a positive integer number or 0.", "set_crop_area_x"), 470);
    end
    
    @fields['crop_area_x'] = x
    self
end

#setCropAreaY(y) ⇒ Object

Set the top left Y coordinate of the crop area in points.

  • y - Must be a positive integer number or 0.

  • Returns - The converter object.



6339
6340
6341
6342
6343
6344
6345
6346
# File 'lib/pdfcrowd.rb', line 6339

def setCropAreaY(y)
    if (!(Integer(y) >= 0))
        raise Error.new(Pdfcrowd.create_invalid_value_message(y, "setCropAreaY", "pdf-to-text", "Must be a positive integer number or 0.", "set_crop_area_y"), 470);
    end
    
    @fields['crop_area_y'] = y
    self
end

#setCustomPageBreak(page_break) ⇒ Object

Specify the custom page break.

  • page_break - String to insert between the pages.

  • Returns - The converter object.



6273
6274
6275
6276
# File 'lib/pdfcrowd.rb', line 6273

def setCustomPageBreak(page_break)
    @fields['custom_page_break'] = page_break
    self
end

#setDebugLog(value) ⇒ Object

Turn on the debug logging. Details about the conversion are stored in the debug log. The URL of the log can be obtained from the getDebugLogUrl method or available in conversion statistics.

  • value - Set to true to enable the debug logging.

  • Returns - The converter object.



6393
6394
6395
6396
# File 'lib/pdfcrowd.rb', line 6393

def setDebugLog(value)
    @fields['debug_log'] = value
    self
end

#setEol(eol) ⇒ Object

The end-of-line convention for the text output.

  • eol - Allowed values are unix, dos, mac.

  • Returns - The converter object.



6247
6248
6249
6250
6251
6252
6253
6254
# File 'lib/pdfcrowd.rb', line 6247

def setEol(eol)
    unless /(?i)^(unix|dos|mac)$/.match(eol)
        raise Error.new(Pdfcrowd.create_invalid_value_message(eol, "setEol", "pdf-to-text", "Allowed values are unix, dos, mac.", "set_eol"), 470);
    end
    
    @fields['eol'] = eol
    self
end

#setHttpProxy(proxy) ⇒ Object

A proxy server used by Pdfcrowd conversion process for accessing the source URLs with HTTP scheme. It can help to circumvent regional restrictions or provide limited access to your intranet.

  • proxy - The value must have format DOMAIN_OR_IP_ADDRESS:PORT.

  • Returns - The converter object.



6456
6457
6458
6459
6460
6461
6462
6463
# File 'lib/pdfcrowd.rb', line 6456

def setHttpProxy(proxy)
    unless /(?i)^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z0-9]{1,}:\d+$/.match(proxy)
        raise Error.new(Pdfcrowd.create_invalid_value_message(proxy, "setHttpProxy", "pdf-to-text", "The value must have format DOMAIN_OR_IP_ADDRESS:PORT.", "set_http_proxy"), 470);
    end
    
    @fields['http_proxy'] = proxy
    self
end

#setHttpsProxy(proxy) ⇒ Object

A proxy server used by Pdfcrowd conversion process for accessing the source URLs with HTTPS scheme. It can help to circumvent regional restrictions or provide limited access to your intranet.

  • proxy - The value must have format DOMAIN_OR_IP_ADDRESS:PORT.

  • Returns - The converter object.



6469
6470
6471
6472
6473
6474
6475
6476
# File 'lib/pdfcrowd.rb', line 6469

def setHttpsProxy(proxy)
    unless /(?i)^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z0-9]{1,}:\d+$/.match(proxy)
        raise Error.new(Pdfcrowd.create_invalid_value_message(proxy, "setHttpsProxy", "pdf-to-text", "The value must have format DOMAIN_OR_IP_ADDRESS:PORT.", "set_https_proxy"), 470);
    end
    
    @fields['https_proxy'] = proxy
    self
end

#setLineSpacingThreshold(threshold) ⇒ Object

Set the maximum line spacing when the paragraph detection mode is enabled.

  • threshold - The value must be a positive integer percentage.

  • Returns - The converter object.



6295
6296
6297
6298
6299
6300
6301
6302
# File 'lib/pdfcrowd.rb', line 6295

def setLineSpacingThreshold(threshold)
    unless /(?i)^0$|^[0-9]+%$/.match(threshold)
        raise Error.new(Pdfcrowd.create_invalid_value_message(threshold, "setLineSpacingThreshold", "pdf-to-text", "The value must be a positive integer percentage.", "set_line_spacing_threshold"), 470);
    end
    
    @fields['line_spacing_threshold'] = threshold
    self
end

#setNoLayout(value) ⇒ Object

Ignore the original PDF layout.

  • value - Set to true to ignore the layout.

  • Returns - The converter object.



6238
6239
6240
6241
# File 'lib/pdfcrowd.rb', line 6238

def setNoLayout(value)
    @fields['no_layout'] = value
    self
end

#setPageBreakMode(mode) ⇒ Object

Specify the page break mode for the text output.

  • mode - Allowed values are none, default, custom.

  • Returns - The converter object.



6260
6261
6262
6263
6264
6265
6266
6267
# File 'lib/pdfcrowd.rb', line 6260

def setPageBreakMode(mode)
    unless /(?i)^(none|default|custom)$/.match(mode)
        raise Error.new(Pdfcrowd.create_invalid_value_message(mode, "setPageBreakMode", "pdf-to-text", "Allowed values are none, default, custom.", "set_page_break_mode"), 470);
    end
    
    @fields['page_break_mode'] = mode
    self
end

#setParagraphMode(mode) ⇒ Object

Specify the paragraph detection mode.

  • mode - Allowed values are none, bounding-box, characters.

  • Returns - The converter object.



6282
6283
6284
6285
6286
6287
6288
6289
# File 'lib/pdfcrowd.rb', line 6282

def setParagraphMode(mode)
    unless /(?i)^(none|bounding-box|characters)$/.match(mode)
        raise Error.new(Pdfcrowd.create_invalid_value_message(mode, "setParagraphMode", "pdf-to-text", "Allowed values are none, bounding-box, characters.", "set_paragraph_mode"), 470);
    end
    
    @fields['paragraph_mode'] = mode
    self
end

#setPdfPassword(password) ⇒ Object

The password to open the encrypted PDF file.

  • password - The input PDF password.

  • Returns - The converter object.



6216
6217
6218
6219
# File 'lib/pdfcrowd.rb', line 6216

def setPdfPassword(password)
    @fields['pdf_password'] = password
    self
end

#setPrintPageRange(pages) ⇒ Object

Set the page range to print.

  • pages - A comma separated list of page numbers or ranges.

  • Returns - The converter object.



6225
6226
6227
6228
6229
6230
6231
6232
# File 'lib/pdfcrowd.rb', line 6225

def setPrintPageRange(pages)
    unless /^(?:\s*(?:\d+|(?:\d*\s*\-\s*\d+)|(?:\d+\s*\-\s*\d*))\s*,\s*)*\s*(?:\d+|(?:\d*\s*\-\s*\d+)|(?:\d+\s*\-\s*\d*))\s*$/.match(pages)
        raise Error.new(Pdfcrowd.create_invalid_value_message(pages, "setPrintPageRange", "pdf-to-text", "A comma separated list of page numbers or ranges.", "set_print_page_range"), 470);
    end
    
    @fields['print_page_range'] = pages
    self
end

#setProxy(host, port, user_name, password) ⇒ Object

Specifies an HTTP proxy that the API client library will use to connect to the internet.

  • host - The proxy hostname.

  • port - The proxy port.

  • user_name - The username.

  • password - The password.

  • Returns - The converter object.



6513
6514
6515
6516
# File 'lib/pdfcrowd.rb', line 6513

def setProxy(host, port, user_name, password)
    @helper.setProxy(host, port, user_name, password)
    self
end

#setRemoveEmptyLines(value) ⇒ Object

Remove empty lines from the text output.

  • value - Set to true to remove empty lines.

  • Returns - The converter object.



6317
6318
6319
6320
# File 'lib/pdfcrowd.rb', line 6317

def setRemoveEmptyLines(value)
    @fields['remove_empty_lines'] = value
    self
end

#setRemoveHyphenation(value) ⇒ Object

Remove the hyphen character from the end of lines.

  • value - Set to true to remove hyphens.

  • Returns - The converter object.



6308
6309
6310
6311
# File 'lib/pdfcrowd.rb', line 6308

def setRemoveHyphenation(value)
    @fields['remove_hyphenation'] = value
    self
end

#setRetryCount(count) ⇒ Object

Specifies the number of automatic retries when the 502 or 503 HTTP status code is received. The status code indicates a temporary network issue. This feature can be disabled by setting to 0.

  • count - Number of retries.

  • Returns - The converter object.



6522
6523
6524
6525
# File 'lib/pdfcrowd.rb', line 6522

def setRetryCount(count)
    @helper.setRetryCount(count)
    self
end

#setTag(tag) ⇒ Object

Tag the conversion with a custom value. The tag is used in conversion statistics. A value longer than 32 characters is cut off.

  • tag - A string with the custom tag.

  • Returns - The converter object.



6447
6448
6449
6450
# File 'lib/pdfcrowd.rb', line 6447

def setTag(tag)
    @fields['tag'] = tag
    self
end

#setUseHttp(value) ⇒ Object

Specifies if the client communicates over HTTP or HTTPS with Pdfcrowd API. Warning: Using HTTP is insecure as data sent over HTTP is not encrypted. Enable this option only if you know what you are doing.

  • value - Set to true to use HTTP.

  • Returns - The converter object.



6483
6484
6485
6486
# File 'lib/pdfcrowd.rb', line 6483

def setUseHttp(value)
    @helper.setUseHttp(value)
    self
end

#setUserAgent(agent) ⇒ Object

Set a custom user agent HTTP header. It can be useful if you are behind a proxy or a firewall.

  • agent - The user agent string.

  • Returns - The converter object.



6501
6502
6503
6504
# File 'lib/pdfcrowd.rb', line 6501

def setUserAgent(agent)
    @helper.setUserAgent(agent)
    self
end