Class: SelectPdf::PdfToTextClient

Inherits:
ApiClient
  • Object
show all
Defined in:
lib/selectpdf.rb

Overview

Pdf To Text Conversion with SelectPdf Online API.

Code Sample for PDF To Text

require 'selectpdf'

$stdout.sync = true

print "This is SelectPdf-#{SelectPdf::CLIENT_VERSION}\n"

test_url = 'https://selectpdf.com/demo/files/selectpdf.pdf'
test_pdf = 'Input.pdf'
local_file = 'Result.txt'
api_key = 'Your API key here'

begin
  client = SelectPdf::PdfToTextClient.new(api_key)

  # set parameters - see full list at https://selectpdf.com/pdf-to-text-api/
  client.start_page = 1 # start page (processing starts from here)
  client.end_page = 0 # end page (set 0 to process file til the end)
  client.output_format = SelectPdf::OutputFormat::TEXT # set output format (Text or HTML)

  print "Starting pdf to text ...\n"

  # convert local pdf to local text file
  client.text_from_file_to_file(test_pdf, local_file)

  # extract text from local pdf to memory
  # text = client.text_from_file(test_pdf)
  # print text

  # convert pdf from public url to local text file
  # client.text_from_url_to_file(test_url, local_file)

  # extract text from pdf from public url to memory
  # text = client.text_from_url(test_url)
  # print text

  print "Finished! Number of pages processed: #{client.number_of_pages}.\n"

  # get API usage
  usage_client = SelectPdf::UsageClient.new(api_key)
  usage = usage_client.get_usage(FALSE)
  print("Usage: #{usage}\n")
  print('Conversions remained this month: ', usage['available'], "\n")
rescue SelectPdf::ApiException => e
  print("An error occurred: #{e}")
end

Code Sample for Search Pdf

require 'selectpdf'

$stdout.sync = true

print "This is SelectPdf-#{SelectPdf::CLIENT_VERSION}\n"

test_url = 'https://selectpdf.com/demo/files/selectpdf.pdf'
test_pdf = 'Input.pdf'
api_key = 'Your API key here'

begin
  client = SelectPdf::PdfToTextClient.new(api_key)

  # set parameters - see full list at https://selectpdf.com/pdf-to-text-api/
  client.start_page = 1 # start page (processing starts from here)
  client.end_page = 0 # end page (set 0 to process file til the end)
  client.output_format = SelectPdf::OutputFormat::TEXT # set output format (Text or HTML)

  print "Starting search pdf ...\n"

  # search local pdf
  results = client.search_file(test_pdf, 'pdf')

  # search pdf from public url
  # results = client.search_url(test_url, 'pdf')

  print "Search results: #{results}.\nSearch results count: #{results.length}\n"

  print "Finished! Number of pages processed: #{client.number_of_pages}.\n"

  # get API usage
  usage_client = SelectPdf::UsageClient.new(api_key)
  usage = usage_client.get_usage(FALSE)
  print("Usage: #{usage}\n")
  print('Conversions remained this month: ', usage['available'], "\n")
rescue SelectPdf::ApiException => e
  print("An error occurred: #{e}")
end

Instance Attribute Summary

Attributes inherited from ApiClient

#api_async_endpoint, #api_endpoint, #api_web_elements_endpoint, #async_calls_max_pings, #async_calls_ping_interval, #number_of_pages

Instance Method Summary collapse

Constructor Details

#initialize(api_key) ⇒ PdfToTextClient

Construct the Pdf To Text Client.

Parameters:

  • api_key

    API Key.


2163
2164
2165
2166
2167
2168
2169
# File 'lib/selectpdf.rb', line 2163

def initialize(api_key)
  super()
  @api_endpoint = 'https://selectpdf.com/api2/pdftotext/'
  @parameters['key'] = api_key

  @file_idx = 0
end

Instance Method Details

#end_page=(end_page) ⇒ Object

Set End Page number. Default value is 0 (process till the last page of the document).

Parameters:

  • end_page

    End page number (1-based).


2574
2575
2576
# File 'lib/selectpdf.rb', line 2574

def end_page=(end_page)
  @parameters['end_page'] = end_page
end

#output_format=(output_format) ⇒ Object

Set the output format. The default value is SelectPdf::OutputFormat::TEXT.

Parameters:

  • output_format

    The output format. Possible values: Text, Html. Use constants from SelectPdf::OutputFormat class.


2599
2600
2601
2602
2603
2604
2605
# File 'lib/selectpdf.rb', line 2599

def output_format=(output_format)
  unless [0, 1].include?(output_format)
    raise ApiException.new('Allowed values for Output Format: 0 (Text), 1 (Html).'), 'Allowed values for Output Format: 0 (Text), 1 (Html).'
  end

  @parameters['output_format'] = output_format
end

#search_file(input_pdf, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) ⇒ Object

Search for a specific text in a PDF document. Pages that participate to this operation are specified by start_page and end_page methods.

Parameters:

  • input_pdf

    Path to a local PDF file.

  • text_to_search

    Text to search.

  • case_sensitive (defaults to: FALSE)

    If the search is case sensitive or not.

  • whole_words_only (defaults to: FALSE)

    If the search works on whole words or not.

Returns:

  • List with text positions in the current PDF document.


2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
# File 'lib/selectpdf.rb', line 2400

def search_file(input_pdf, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE)
  if text_to_search.nil? || text_to_search.empty?
    raise ApiException.new('Search text cannot be empty.'), 'Search text cannot be empty.'
  end

  @parameters['async'] = 'False'
  @parameters['action'] = 'Search'
  @parameters.delete('url')
  @parameters['search_text'] = text_to_search
  @parameters['case_sensitive'] = case_sensitive
  @parameters['whole_words_only'] = whole_words_only

  @files = {}
  @files['inputPdf'] = input_pdf

  @headers['Accept'] = 'text/json'

  result = perform_post_as_multipart_formdata
  return [] if result.nil? || result.empty?

  JSON.parse(result)
end

#search_file_async(input_pdf, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) ⇒ Object

Search for a specific text in a PDF document with an asynchronous call. Pages that participate to this operation are specified by start_page and end_page methods.

Parameters:

  • input_pdf

    Path to a local PDF file.

  • text_to_search

    Text to search.

  • case_sensitive (defaults to: FALSE)

    If the search is case sensitive or not.

  • whole_words_only (defaults to: FALSE)

    If the search works on whole words or not.

Returns:

  • List with text positions in the current PDF document.

Raises:

  • (ApiException.new('Asynchronous call did not finish in expected timeframe.'))

2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
# File 'lib/selectpdf.rb', line 2431

def search_file_async(input_pdf, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE)
  if text_to_search.nil? || text_to_search.empty?
    raise ApiException.new('Search text cannot be empty.'), 'Search text cannot be empty.'
  end

  @parameters['action'] = 'Search'
  @parameters.delete('url')
  @parameters['search_text'] = text_to_search
  @parameters['case_sensitive'] = case_sensitive
  @parameters['whole_words_only'] = whole_words_only

  @files = {}
  @files['inputPdf'] = input_pdf

  @headers['Accept'] = 'text/json'

  job_id = start_async_job_multipart_form_data

  if job_id.nil? || job_id.empty?
    raise ApiException.new('An error occurred launching the asynchronous call.'),
          'An error occurred launching the asynchronous call.'
  end

  no_pings = 0

  while no_pings < @async_calls_max_pings
    no_pings += 1

    # sleep for a few seconds before next ping
    sleep(@async_calls_ping_interval)

    async_job_client = AsyncJobClient.new(@parameters['key'], @job_id)
    async_job_client.api_endpoint = @api_async_endpoint

    result = async_job_client.result

    next if result.nil?

    @number_of_pages = async_job_client.number_of_pages
    return [] if result.empty?

    return JSON.parse(result)
  end

  raise ApiException.new('Asynchronous call did not finish in expected timeframe.'),
        'Asynchronous call did not finish in expected timeframe.'
end

#search_url(url, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) ⇒ Object

Search for a specific text in a PDF document. Pages that participate to this operation are specified by start_page and end_page methods.

Parameters:

  • url

    Address of the PDF file.

  • text_to_search

    Text to search.

  • case_sensitive (defaults to: FALSE)

    If the search is case sensitive or not.

  • whole_words_only (defaults to: FALSE)

    If the search works on whole words or not.

Returns:

  • List with text positions in the current PDF document.


2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
# File 'lib/selectpdf.rb', line 2487

def search_url(url, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE)
  if text_to_search.nil? || text_to_search.empty?
    raise ApiException.new('Search text cannot be empty.'), 'Search text cannot be empty.'
  end

  @parameters['async'] = 'False'
  @parameters['action'] = 'Search'
  @parameters['search_text'] = text_to_search
  @parameters['case_sensitive'] = case_sensitive
  @parameters['whole_words_only'] = whole_words_only

  @files = {}
  @parameters['url'] = url

  @headers['Accept'] = 'text/json'

  result = perform_post_as_multipart_formdata
  return [] if result.nil? || result.empty?

  JSON.parse(result)
end

#search_url_async(url, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) ⇒ Object

Search for a specific text in a PDF document with an asynchronous call. Pages that participate to this operation are specified by start_page and end_page methods.

Parameters:

  • url

    Address of the PDF file.

  • text_to_search

    Text to search.

  • case_sensitive (defaults to: FALSE)

    If the search is case sensitive or not.

  • whole_words_only (defaults to: FALSE)

    If the search works on whole words or not.

Returns:

  • List with text positions in the current PDF document.

Raises:

  • (ApiException.new('Asynchronous call did not finish in expected timeframe.'))

2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
# File 'lib/selectpdf.rb', line 2517

def search_url_async(url, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE)
  if text_to_search.nil? || text_to_search.empty?
    raise ApiException.new('Search text cannot be empty.'), 'Search text cannot be empty.'
  end

  @parameters['action'] = 'Search'
  @parameters['search_text'] = text_to_search
  @parameters['case_sensitive'] = case_sensitive
  @parameters['whole_words_only'] = whole_words_only

  @files = {}
  @parameters['url'] = url

  @headers['Accept'] = 'text/json'

  job_id = start_async_job_multipart_form_data

  if job_id.nil? || job_id.empty?
    raise ApiException.new('An error occurred launching the asynchronous call.'),
          'An error occurred launching the asynchronous call.'
  end

  no_pings = 0

  while no_pings < @async_calls_max_pings
    no_pings += 1

    # sleep for a few seconds before next ping
    sleep(@async_calls_ping_interval)

    async_job_client = AsyncJobClient.new(@parameters['key'], @job_id)
    async_job_client.api_endpoint = @api_async_endpoint

    result = async_job_client.result

    next if result.nil?

    @number_of_pages = async_job_client.number_of_pages
    return [] if result.empty?

    return JSON.parse(result)
  end

  raise ApiException.new('Asynchronous call did not finish in expected timeframe.'),
        'Asynchronous call did not finish in expected timeframe.'
end

#set_custom_parameter(parameter_name, parameter_value) ⇒ Object

Set a custom parameter. Do not use this method unless advised by SelectPdf.

Parameters:

  • parameter_name

    Parameter name.

  • parameter_value

    Parameter value.


2619
2620
2621
# File 'lib/selectpdf.rb', line 2619

def set_custom_parameter(parameter_name, parameter_value)
  @parameters[parameter_name] = parameter_value
end

#start_page=(start_page) ⇒ Object

Set Start Page number. Default value is 1 (first page of the document).

Parameters:

  • start_page

    Start page number (1-based).


2567
2568
2569
# File 'lib/selectpdf.rb', line 2567

def start_page=(start_page)
  @parameters['start_page'] = start_page
end

#text_from_file(input_pdf) ⇒ Object

Get the text from the specified pdf.

Parameters:

  • input_pdf

    Path to a local PDF file.

Returns:

  • Extracted text.


2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
# File 'lib/selectpdf.rb', line 2175

def text_from_file(input_pdf)
  @parameters['async'] = 'False'
  @parameters['action'] = 'Convert'
  @parameters.delete('url')

  @files = {}
  @files['inputPdf'] = input_pdf

  perform_post_as_multipart_formdata
end

#text_from_file_async(input_pdf) ⇒ Object

Get the text from the specified pdf with an asynchronous call.

Parameters:

  • input_pdf

    Path to a local PDF file.

Returns:

  • Extracted text.

Raises:

  • (ApiException.new('Asynchronous call did not finish in expected timeframe.'))

2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
# File 'lib/selectpdf.rb', line 2213

def text_from_file_async(input_pdf)
  @parameters['action'] = 'Convert'
  @parameters.delete('url')

  @files = {}
  @files['inputPdf'] = input_pdf

  job_id = start_async_job_multipart_form_data

  if job_id.nil? || job_id.empty?
    raise ApiException.new('An error occurred launching the asynchronous call.'), 'An error occurred launching the asynchronous call.'
  end

  no_pings = 0

  while no_pings < @async_calls_max_pings
    no_pings += 1

    # sleep for a few seconds before next ping
    sleep(@async_calls_ping_interval)

    async_job_client = AsyncJobClient.new(@parameters['key'], @job_id)
    async_job_client.api_endpoint = @api_async_endpoint

    result = async_job_client.result

    next if result.nil?

    @number_of_pages = async_job_client.number_of_pages

    return result
  end

  raise ApiException.new('Asynchronous call did not finish in expected timeframe.'), 'Asynchronous call did not finish in expected timeframe.'
end

#text_from_file_to_file(input_pdf, output_file_path) ⇒ Object

Get the text from the specified pdf and write it to the specified text file.

Parameters:

  • input_pdf

    Path to a local PDF file.

  • output_file_path

    The output file where the resulted text will be written.


2190
2191
2192
2193
2194
2195
2196
2197
2198
# File 'lib/selectpdf.rb', line 2190

def text_from_file_to_file(input_pdf, output_file_path)
  result = text_from_file(input_pdf)
  File.open(output_file_path, 'wb') do |file|
    file.write(result)
  end
rescue ApiException
  FileUtils.rm(output_file_path) if File.exist?(output_file_path)
  raise
end

#text_from_file_to_file_async(input_pdf, output_file_path) ⇒ Object

Get the text from the specified pdf with an asynchronous call and write it to the specified text file.

Parameters:

  • input_pdf

    Path to a local PDF file.

  • output_file_path

    The output file where the resulted text will be written.


2253
2254
2255
2256
2257
2258
2259
2260
2261
# File 'lib/selectpdf.rb', line 2253

def text_from_file_to_file_async(input_pdf, output_file_path)
  result = text_from_file_async(input_pdf)
  File.open(output_file_path, 'wb') do |file|
    file.write(result)
  end
rescue ApiException
  FileUtils.rm(output_file_path) if File.exist?(output_file_path)
  raise
end

#text_from_file_to_stream(input_pdf, stream) ⇒ Object

Get the text from the specified pdf and write it to the specified stream.

Parameters:

  • input_pdf

    Path to a local PDF file.

  • stream

    The output stream where the resulted PDF will be written.


2204
2205
2206
2207
# File 'lib/selectpdf.rb', line 2204

def text_from_file_to_stream(input_pdf, stream)
  result = text_from_file(input_pdf)
  stream.write(result)
end

#text_from_file_to_stream_async(input_pdf, stream) ⇒ Object

Get the text from the specified pdf with an asynchronous call and write it to the specified stream.

Parameters:

  • input_pdf

    Path to a local PDF file.

  • stream

    The output stream where the resulted PDF will be written.


2267
2268
2269
2270
# File 'lib/selectpdf.rb', line 2267

def text_from_file_to_stream_async(input_pdf, stream)
  result = text_from_file_async(input_pdf)
  stream.write(result)
end

#text_from_url(url) ⇒ Object

Get the text from the specified pdf.

Parameters:

  • url

    Address of the PDF file.

Returns:

  • Extracted text.


2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
# File 'lib/selectpdf.rb', line 2276

def text_from_url(url)
  if !url.downcase.start_with?('http://') && !url.downcase.start_with?('https://')
    raise ApiException.new('The supported protocols for the PDFs available online are http:// and https://.'),
          'The supported protocols for the PDFs available online are http:// and https://.'
  end

  if url.downcase.start_with?('http://localhost')
    raise ApiException.new('Cannot convert local urls via this method. Use getTextFromFile instead.'),
          'Cannot convert local urls via this method. Use text_from_file instead.'
  end

  @parameters['async'] = 'False'
  @parameters['action'] = 'Convert'

  @files = {}
  @parameters['url'] = url

  perform_post_as_multipart_formdata
end

#text_from_url_async(url) ⇒ Object

Get the text from the specified pdf with an asynchronous call.

Parameters:

  • url

    Address of the PDF file.

Returns:

  • Extracted text.

Raises:

  • (ApiException.new('Asynchronous call did not finish in expected timeframe.'))

2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
# File 'lib/selectpdf.rb', line 2323

def text_from_url_async(url)
  if !url.downcase.start_with?('http://') && !url.downcase.start_with?('https://')
    raise ApiException.new('The supported protocols for the PDFs available online are http:// and https://.'),
          'The supported protocols for the PDFs available online are http:// and https://.'
  end

  if url.downcase.start_with?('http://localhost')
    raise ApiException.new('Cannot convert local urls via this method. Use getTextFromFile instead.'),
          'Cannot convert local urls via this method. Use text_from_file_async instead.'
  end

  @parameters['action'] = 'Convert'

  @files = {}
  @parameters['url'] = url

  job_id = start_async_job_multipart_form_data

  if job_id.nil? || job_id.empty?
    raise ApiException.new('An error occurred launching the asynchronous call.'), 'An error occurred launching the asynchronous call.'
  end

  no_pings = 0

  while no_pings < @async_calls_max_pings
    no_pings += 1

    # sleep for a few seconds before next ping
    sleep(@async_calls_ping_interval)

    async_job_client = AsyncJobClient.new(@parameters['key'], @job_id)
    async_job_client.api_endpoint = @api_async_endpoint

    result = async_job_client.result

    next if result.nil?

    @number_of_pages = async_job_client.number_of_pages

    return result
  end

  raise ApiException.new('Asynchronous call did not finish in expected timeframe.'),
        'Asynchronous call did not finish in expected timeframe.'
end

#text_from_url_to_file(url, output_file_path) ⇒ Object

Get the text from the specified pdf and write it to the specified text file.

Parameters:

  • url

    Address of the PDF file.

  • output_file_path

    The output file where the resulted text will be written.


2300
2301
2302
2303
2304
2305
2306
2307
2308
# File 'lib/selectpdf.rb', line 2300

def text_from_url_to_file(url, output_file_path)
  result = text_from_url(url)
  File.open(output_file_path, 'wb') do |file|
    file.write(result)
  end
rescue ApiException
  FileUtils.rm(output_file_path) if File.exist?(output_file_path)
  raise
end

#text_from_url_to_file_async(url, output_file_path) ⇒ Object

Get the text from the specified pdf with an asynchronous call and write it to the specified text file.

Parameters:

  • url

    Address of the PDF file.

  • output_file_path

    The output file where the resulted text will be written.


2373
2374
2375
2376
2377
2378
2379
2380
2381
# File 'lib/selectpdf.rb', line 2373

def text_from_url_to_file_async(url, output_file_path)
  result = text_from_url_async(url)
  File.open(output_file_path, 'wb') do |file|
    file.write(result)
  end
rescue ApiException
  FileUtils.rm(output_file_path) if File.exist?(output_file_path)
  raise
end

#text_from_url_to_stream(url, stream) ⇒ Object

Get the text from the specified pdf and write it to the specified stream.

Parameters:

  • url

    Address of the PDF file.

  • stream

    The output stream where the resulted PDF will be written.


2314
2315
2316
2317
# File 'lib/selectpdf.rb', line 2314

def text_from_url_to_stream(url, stream)
  result = text_from_url(url)
  stream.write(result)
end

#text_from_url_to_stream_async(url, stream) ⇒ Object

Get the text from the specified pdf with an asynchronous call and write it to the specified stream.

Parameters:

  • url

    Address of the PDF file.

  • stream

    The output stream where the resulted PDF will be written.


2387
2388
2389
2390
# File 'lib/selectpdf.rb', line 2387

def text_from_url_to_stream_async(url, stream)
  result = text_from_url_async(url)
  stream.write(result)
end

#text_layout=(text_layout) ⇒ Object

Set the text layout. The default value is SelectPdf::TextLayout::ORIGINAL.

Parameters:

  • text_layout

    The text layout. Possible values: Original, Reading. Use constants from SelectPdf::TextLayout class.


2588
2589
2590
2591
2592
2593
2594
# File 'lib/selectpdf.rb', line 2588

def text_layout=(text_layout)
  unless [0, 1].include?(text_layout)
    raise ApiException.new('Allowed values for Text Layout: 0 (Original), 1 (Reading).'), 'Allowed values for Text Layout: 0 (Original), 1 (Reading).'
  end

  @parameters['text_layout'] = text_layout
end

#timeout=(timeout) ⇒ Object

Set the maximum amount of time (in seconds) for this job. The default value is 30 seconds. Use a larger value (up to 120 seconds allowed) for large documents.

Parameters:

  • timeout

    Timeout in seconds.


2611
2612
2613
# File 'lib/selectpdf.rb', line 2611

def timeout=(timeout)
  @parameters['timeout'] = timeout
end

#user_password=(user_password) ⇒ Object

Set PDF user password.

Parameters:

  • user_password

    PDF user password.


2581
2582
2583
# File 'lib/selectpdf.rb', line 2581

def user_password=(user_password)
  @parameters['user_password'] = user_password
end