Class: SelectPdf::PdfToTextClient
- Defined in:
- lib/selectpdf.rb
Overview
Pdf To Text Conversion with SelectPdf Online API.
Code Sample for PDF To Text
require 'selectpdf'
$stdout.sync = true
print "This is SelectPdf-#{SelectPdf::CLIENT_VERSION}\n"
test_url = 'https://selectpdf.com/demo/files/selectpdf.pdf'
test_pdf = 'Input.pdf'
local_file = 'Result.txt'
api_key = 'Your API key here'
begin
client = SelectPdf::PdfToTextClient.new(api_key)
# set parameters - see full list at https://selectpdf.com/pdf-to-text-api/
client.start_page = 1 # start page (processing starts from here)
client.end_page = 0 # end page (set 0 to process file til the end)
client.output_format = SelectPdf::OutputFormat::TEXT # set output format (Text or HTML)
print "Starting pdf to text ...\n"
# convert local pdf to local text file
client.text_from_file_to_file(test_pdf, local_file)
# extract text from local pdf to memory
# text = client.text_from_file(test_pdf)
# print text
# convert pdf from public url to local text file
# client.text_from_url_to_file(test_url, local_file)
# extract text from pdf from public url to memory
# text = client.text_from_url(test_url)
# print text
print "Finished! Number of pages processed: #{client.number_of_pages}.\n"
# get API usage
usage_client = SelectPdf::UsageClient.new(api_key)
usage = usage_client.get_usage(FALSE)
print("Usage: #{usage}\n")
print('Conversions remained this month: ', usage['available'], "\n")
rescue SelectPdf::ApiException => e
print("An error occurred: #{e}")
end
Code Sample for Search Pdf
require 'selectpdf'
$stdout.sync = true
print "This is SelectPdf-#{SelectPdf::CLIENT_VERSION}\n"
test_url = 'https://selectpdf.com/demo/files/selectpdf.pdf'
test_pdf = 'Input.pdf'
api_key = 'Your API key here'
begin
client = SelectPdf::PdfToTextClient.new(api_key)
# set parameters - see full list at https://selectpdf.com/pdf-to-text-api/
client.start_page = 1 # start page (processing starts from here)
client.end_page = 0 # end page (set 0 to process file til the end)
client.output_format = SelectPdf::OutputFormat::TEXT # set output format (Text or HTML)
print "Starting search pdf ...\n"
# search local pdf
results = client.search_file(test_pdf, 'pdf')
# search pdf from public url
# results = client.search_url(test_url, 'pdf')
print "Search results: #{results}.\nSearch results count: #{results.length}\n"
print "Finished! Number of pages processed: #{client.number_of_pages}.\n"
# get API usage
usage_client = SelectPdf::UsageClient.new(api_key)
usage = usage_client.get_usage(FALSE)
print("Usage: #{usage}\n")
print('Conversions remained this month: ', usage['available'], "\n")
rescue SelectPdf::ApiException => e
print("An error occurred: #{e}")
end
Instance Attribute Summary
Attributes inherited from ApiClient
#api_async_endpoint, #api_endpoint, #api_web_elements_endpoint, #async_calls_max_pings, #async_calls_ping_interval, #number_of_pages
Instance Method Summary collapse
-
#end_page=(end_page) ⇒ Object
Set End Page number.
-
#initialize(api_key) ⇒ PdfToTextClient
constructor
Construct the Pdf To Text Client.
-
#output_format=(output_format) ⇒ Object
Set the output format.
-
#search_file(input_pdf, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) ⇒ Object
Search for a specific text in a PDF document.
-
#search_file_async(input_pdf, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) ⇒ Object
Search for a specific text in a PDF document with an asynchronous call.
-
#search_url(url, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) ⇒ Object
Search for a specific text in a PDF document.
-
#search_url_async(url, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) ⇒ Object
Search for a specific text in a PDF document with an asynchronous call.
-
#set_custom_parameter(parameter_name, parameter_value) ⇒ Object
Set a custom parameter.
-
#start_page=(start_page) ⇒ Object
Set Start Page number.
-
#text_from_file(input_pdf) ⇒ Object
Get the text from the specified pdf.
-
#text_from_file_async(input_pdf) ⇒ Object
Get the text from the specified pdf with an asynchronous call.
-
#text_from_file_to_file(input_pdf, output_file_path) ⇒ Object
Get the text from the specified pdf and write it to the specified text file.
-
#text_from_file_to_file_async(input_pdf, output_file_path) ⇒ Object
Get the text from the specified pdf with an asynchronous call and write it to the specified text file.
-
#text_from_file_to_stream(input_pdf, stream) ⇒ Object
Get the text from the specified pdf and write it to the specified stream.
-
#text_from_file_to_stream_async(input_pdf, stream) ⇒ Object
Get the text from the specified pdf with an asynchronous call and write it to the specified stream.
-
#text_from_url(url) ⇒ Object
Get the text from the specified pdf.
-
#text_from_url_async(url) ⇒ Object
Get the text from the specified pdf with an asynchronous call.
-
#text_from_url_to_file(url, output_file_path) ⇒ Object
Get the text from the specified pdf and write it to the specified text file.
-
#text_from_url_to_file_async(url, output_file_path) ⇒ Object
Get the text from the specified pdf with an asynchronous call and write it to the specified text file.
-
#text_from_url_to_stream(url, stream) ⇒ Object
Get the text from the specified pdf and write it to the specified stream.
-
#text_from_url_to_stream_async(url, stream) ⇒ Object
Get the text from the specified pdf with an asynchronous call and write it to the specified stream.
-
#text_layout=(text_layout) ⇒ Object
Set the text layout.
-
#timeout=(timeout) ⇒ Object
Set the maximum amount of time (in seconds) for this job.
-
#user_password=(user_password) ⇒ Object
Set PDF user password.
Constructor Details
#initialize(api_key) ⇒ PdfToTextClient
Construct the Pdf To Text Client.
2163 2164 2165 2166 2167 2168 2169 |
# File 'lib/selectpdf.rb', line 2163 def initialize(api_key) super() @api_endpoint = 'https://selectpdf.com/api2/pdftotext/' @parameters['key'] = api_key @file_idx = 0 end |
Instance Method Details
#end_page=(end_page) ⇒ Object
Set End Page number. Default value is 0 (process till the last page of the document).
2574 2575 2576 |
# File 'lib/selectpdf.rb', line 2574 def end_page=(end_page) @parameters['end_page'] = end_page end |
#output_format=(output_format) ⇒ Object
Set the output format. The default value is SelectPdf::OutputFormat::TEXT.
2599 2600 2601 2602 2603 2604 2605 |
# File 'lib/selectpdf.rb', line 2599 def output_format=(output_format) unless [0, 1].include?(output_format) raise ApiException.new('Allowed values for Output Format: 0 (Text), 1 (Html).'), 'Allowed values for Output Format: 0 (Text), 1 (Html).' end @parameters['output_format'] = output_format end |
#search_file(input_pdf, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) ⇒ Object
Search for a specific text in a PDF document. Pages that participate to this operation are specified by start_page and end_page methods.
2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 |
# File 'lib/selectpdf.rb', line 2400 def search_file(input_pdf, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) if text_to_search.nil? || text_to_search.empty? raise ApiException.new('Search text cannot be empty.'), 'Search text cannot be empty.' end @parameters['async'] = 'False' @parameters['action'] = 'Search' @parameters.delete('url') @parameters['search_text'] = text_to_search @parameters['case_sensitive'] = case_sensitive @parameters['whole_words_only'] = whole_words_only @files = {} @files['inputPdf'] = input_pdf @headers['Accept'] = 'text/json' result = perform_post_as_multipart_formdata return [] if result.nil? || result.empty? JSON.parse(result) end |
#search_file_async(input_pdf, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) ⇒ Object
Search for a specific text in a PDF document with an asynchronous call. Pages that participate to this operation are specified by start_page and end_page methods.
2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 |
# File 'lib/selectpdf.rb', line 2431 def search_file_async(input_pdf, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) if text_to_search.nil? || text_to_search.empty? raise ApiException.new('Search text cannot be empty.'), 'Search text cannot be empty.' end @parameters['action'] = 'Search' @parameters.delete('url') @parameters['search_text'] = text_to_search @parameters['case_sensitive'] = case_sensitive @parameters['whole_words_only'] = whole_words_only @files = {} @files['inputPdf'] = input_pdf @headers['Accept'] = 'text/json' job_id = start_async_job_multipart_form_data if job_id.nil? || job_id.empty? raise ApiException.new('An error occurred launching the asynchronous call.'), 'An error occurred launching the asynchronous call.' end no_pings = 0 while no_pings < @async_calls_max_pings no_pings += 1 # sleep for a few seconds before next ping sleep(@async_calls_ping_interval) async_job_client = AsyncJobClient.new(@parameters['key'], @job_id) async_job_client.api_endpoint = @api_async_endpoint result = async_job_client.result next if result.nil? @number_of_pages = async_job_client.number_of_pages return [] if result.empty? return JSON.parse(result) end raise ApiException.new('Asynchronous call did not finish in expected timeframe.'), 'Asynchronous call did not finish in expected timeframe.' end |
#search_url(url, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) ⇒ Object
Search for a specific text in a PDF document. Pages that participate to this operation are specified by start_page and end_page methods.
2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 |
# File 'lib/selectpdf.rb', line 2487 def search_url(url, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) if text_to_search.nil? || text_to_search.empty? raise ApiException.new('Search text cannot be empty.'), 'Search text cannot be empty.' end @parameters['async'] = 'False' @parameters['action'] = 'Search' @parameters['search_text'] = text_to_search @parameters['case_sensitive'] = case_sensitive @parameters['whole_words_only'] = whole_words_only @files = {} @parameters['url'] = url @headers['Accept'] = 'text/json' result = perform_post_as_multipart_formdata return [] if result.nil? || result.empty? JSON.parse(result) end |
#search_url_async(url, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) ⇒ Object
Search for a specific text in a PDF document with an asynchronous call. Pages that participate to this operation are specified by start_page and end_page methods.
2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 |
# File 'lib/selectpdf.rb', line 2517 def search_url_async(url, text_to_search, case_sensitive = FALSE, whole_words_only = FALSE) if text_to_search.nil? || text_to_search.empty? raise ApiException.new('Search text cannot be empty.'), 'Search text cannot be empty.' end @parameters['action'] = 'Search' @parameters['search_text'] = text_to_search @parameters['case_sensitive'] = case_sensitive @parameters['whole_words_only'] = whole_words_only @files = {} @parameters['url'] = url @headers['Accept'] = 'text/json' job_id = start_async_job_multipart_form_data if job_id.nil? || job_id.empty? raise ApiException.new('An error occurred launching the asynchronous call.'), 'An error occurred launching the asynchronous call.' end no_pings = 0 while no_pings < @async_calls_max_pings no_pings += 1 # sleep for a few seconds before next ping sleep(@async_calls_ping_interval) async_job_client = AsyncJobClient.new(@parameters['key'], @job_id) async_job_client.api_endpoint = @api_async_endpoint result = async_job_client.result next if result.nil? @number_of_pages = async_job_client.number_of_pages return [] if result.empty? return JSON.parse(result) end raise ApiException.new('Asynchronous call did not finish in expected timeframe.'), 'Asynchronous call did not finish in expected timeframe.' end |
#set_custom_parameter(parameter_name, parameter_value) ⇒ Object
Set a custom parameter. Do not use this method unless advised by SelectPdf.
2619 2620 2621 |
# File 'lib/selectpdf.rb', line 2619 def set_custom_parameter(parameter_name, parameter_value) @parameters[parameter_name] = parameter_value end |
#start_page=(start_page) ⇒ Object
Set Start Page number. Default value is 1 (first page of the document).
2567 2568 2569 |
# File 'lib/selectpdf.rb', line 2567 def start_page=(start_page) @parameters['start_page'] = start_page end |
#text_from_file(input_pdf) ⇒ Object
Get the text from the specified pdf.
2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 |
# File 'lib/selectpdf.rb', line 2175 def text_from_file(input_pdf) @parameters['async'] = 'False' @parameters['action'] = 'Convert' @parameters.delete('url') @files = {} @files['inputPdf'] = input_pdf perform_post_as_multipart_formdata end |
#text_from_file_async(input_pdf) ⇒ Object
Get the text from the specified pdf with an asynchronous call.
2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 |
# File 'lib/selectpdf.rb', line 2213 def text_from_file_async(input_pdf) @parameters['action'] = 'Convert' @parameters.delete('url') @files = {} @files['inputPdf'] = input_pdf job_id = start_async_job_multipart_form_data if job_id.nil? || job_id.empty? raise ApiException.new('An error occurred launching the asynchronous call.'), 'An error occurred launching the asynchronous call.' end no_pings = 0 while no_pings < @async_calls_max_pings no_pings += 1 # sleep for a few seconds before next ping sleep(@async_calls_ping_interval) async_job_client = AsyncJobClient.new(@parameters['key'], @job_id) async_job_client.api_endpoint = @api_async_endpoint result = async_job_client.result next if result.nil? @number_of_pages = async_job_client.number_of_pages return result end raise ApiException.new('Asynchronous call did not finish in expected timeframe.'), 'Asynchronous call did not finish in expected timeframe.' end |
#text_from_file_to_file(input_pdf, output_file_path) ⇒ Object
Get the text from the specified pdf and write it to the specified text file.
2190 2191 2192 2193 2194 2195 2196 2197 2198 |
# File 'lib/selectpdf.rb', line 2190 def text_from_file_to_file(input_pdf, output_file_path) result = text_from_file(input_pdf) File.open(output_file_path, 'wb') do |file| file.write(result) end rescue ApiException FileUtils.rm(output_file_path) if File.exist?(output_file_path) raise end |
#text_from_file_to_file_async(input_pdf, output_file_path) ⇒ Object
Get the text from the specified pdf with an asynchronous call and write it to the specified text file.
2253 2254 2255 2256 2257 2258 2259 2260 2261 |
# File 'lib/selectpdf.rb', line 2253 def text_from_file_to_file_async(input_pdf, output_file_path) result = text_from_file_async(input_pdf) File.open(output_file_path, 'wb') do |file| file.write(result) end rescue ApiException FileUtils.rm(output_file_path) if File.exist?(output_file_path) raise end |
#text_from_file_to_stream(input_pdf, stream) ⇒ Object
Get the text from the specified pdf and write it to the specified stream.
2204 2205 2206 2207 |
# File 'lib/selectpdf.rb', line 2204 def text_from_file_to_stream(input_pdf, stream) result = text_from_file(input_pdf) stream.write(result) end |
#text_from_file_to_stream_async(input_pdf, stream) ⇒ Object
Get the text from the specified pdf with an asynchronous call and write it to the specified stream.
2267 2268 2269 2270 |
# File 'lib/selectpdf.rb', line 2267 def text_from_file_to_stream_async(input_pdf, stream) result = text_from_file_async(input_pdf) stream.write(result) end |
#text_from_url(url) ⇒ Object
Get the text from the specified pdf.
2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 |
# File 'lib/selectpdf.rb', line 2276 def text_from_url(url) if !url.downcase.start_with?('http://') && !url.downcase.start_with?('https://') raise ApiException.new('The supported protocols for the PDFs available online are http:// and https://.'), 'The supported protocols for the PDFs available online are http:// and https://.' end if url.downcase.start_with?('http://localhost') raise ApiException.new('Cannot convert local urls via this method. Use getTextFromFile instead.'), 'Cannot convert local urls via this method. Use text_from_file instead.' end @parameters['async'] = 'False' @parameters['action'] = 'Convert' @files = {} @parameters['url'] = url perform_post_as_multipart_formdata end |
#text_from_url_async(url) ⇒ Object
Get the text from the specified pdf with an asynchronous call.
2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 |
# File 'lib/selectpdf.rb', line 2323 def text_from_url_async(url) if !url.downcase.start_with?('http://') && !url.downcase.start_with?('https://') raise ApiException.new('The supported protocols for the PDFs available online are http:// and https://.'), 'The supported protocols for the PDFs available online are http:// and https://.' end if url.downcase.start_with?('http://localhost') raise ApiException.new('Cannot convert local urls via this method. Use getTextFromFile instead.'), 'Cannot convert local urls via this method. Use text_from_file_async instead.' end @parameters['action'] = 'Convert' @files = {} @parameters['url'] = url job_id = start_async_job_multipart_form_data if job_id.nil? || job_id.empty? raise ApiException.new('An error occurred launching the asynchronous call.'), 'An error occurred launching the asynchronous call.' end no_pings = 0 while no_pings < @async_calls_max_pings no_pings += 1 # sleep for a few seconds before next ping sleep(@async_calls_ping_interval) async_job_client = AsyncJobClient.new(@parameters['key'], @job_id) async_job_client.api_endpoint = @api_async_endpoint result = async_job_client.result next if result.nil? @number_of_pages = async_job_client.number_of_pages return result end raise ApiException.new('Asynchronous call did not finish in expected timeframe.'), 'Asynchronous call did not finish in expected timeframe.' end |
#text_from_url_to_file(url, output_file_path) ⇒ Object
Get the text from the specified pdf and write it to the specified text file.
2300 2301 2302 2303 2304 2305 2306 2307 2308 |
# File 'lib/selectpdf.rb', line 2300 def text_from_url_to_file(url, output_file_path) result = text_from_url(url) File.open(output_file_path, 'wb') do |file| file.write(result) end rescue ApiException FileUtils.rm(output_file_path) if File.exist?(output_file_path) raise end |
#text_from_url_to_file_async(url, output_file_path) ⇒ Object
Get the text from the specified pdf with an asynchronous call and write it to the specified text file.
2373 2374 2375 2376 2377 2378 2379 2380 2381 |
# File 'lib/selectpdf.rb', line 2373 def text_from_url_to_file_async(url, output_file_path) result = text_from_url_async(url) File.open(output_file_path, 'wb') do |file| file.write(result) end rescue ApiException FileUtils.rm(output_file_path) if File.exist?(output_file_path) raise end |
#text_from_url_to_stream(url, stream) ⇒ Object
Get the text from the specified pdf and write it to the specified stream.
2314 2315 2316 2317 |
# File 'lib/selectpdf.rb', line 2314 def text_from_url_to_stream(url, stream) result = text_from_url(url) stream.write(result) end |
#text_from_url_to_stream_async(url, stream) ⇒ Object
Get the text from the specified pdf with an asynchronous call and write it to the specified stream.
2387 2388 2389 2390 |
# File 'lib/selectpdf.rb', line 2387 def text_from_url_to_stream_async(url, stream) result = text_from_url_async(url) stream.write(result) end |
#text_layout=(text_layout) ⇒ Object
Set the text layout. The default value is SelectPdf::TextLayout::ORIGINAL.
2588 2589 2590 2591 2592 2593 2594 |
# File 'lib/selectpdf.rb', line 2588 def text_layout=(text_layout) unless [0, 1].include?(text_layout) raise ApiException.new('Allowed values for Text Layout: 0 (Original), 1 (Reading).'), 'Allowed values for Text Layout: 0 (Original), 1 (Reading).' end @parameters['text_layout'] = text_layout end |
#timeout=(timeout) ⇒ Object
Set the maximum amount of time (in seconds) for this job. The default value is 30 seconds. Use a larger value (up to 120 seconds allowed) for large documents.
2611 2612 2613 |
# File 'lib/selectpdf.rb', line 2611 def timeout=(timeout) @parameters['timeout'] = timeout end |
#user_password=(user_password) ⇒ Object
Set PDF user password.
2581 2582 2583 |
# File 'lib/selectpdf.rb', line 2581 def user_password=(user_password) @parameters['user_password'] = user_password end |