Module: Sluice::Storage::S3

Includes:
Contracts
Defined in:
lib/sluice/storage/s3/s3.rb,
lib/sluice/storage/s3/location.rb,
lib/sluice/storage/s3/manifest.rb,
lib/sluice/storage/s3/contracts.rb

Defined Under Namespace

Classes: Location, Manifest, ManifestScope

Constant Summary collapse

CONCURRENCY =

Constants

10
RETRIES =

Threads

3
RETRY_WAIT =

Attempts

10
TIMEOUT_WAIT =

Seconds

1800
FogStorage =

Aliases for Contracts

Fog::Storage::AWS::Real
FogFile =
Fog::Storage::AWS::File

Class Method Summary collapse

Class Method Details

.copy_files(s3, from_files_or_loc, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object

Copies files between S3 locations concurrently

Parameters:

s3

A Fog::Storage s3 connection

from_files_or_loc

Array of filepaths or Fog::Storage::AWS::File objects, or S3Location to copy files from

to_location

S3Location to copy files to

match_regex

a regex string to match the files to copy

alter_filename_lambda

lambda to alter the written filename

flatten

strips off any sub-folders below the from_location



198
199
200
201
202
# File 'lib/sluice/storage/s3/s3.rb', line 198

def self.copy_files(s3, from_files_or_loc, to_location, match_regex='.+', alter_filename_lambda=nil, flatten=false)

  puts "  copying #{describe_from(from_files_or_loc)} to #{to_location}"
  process_files(:copy, s3, from_files_or_loc, [], match_regex, to_location, alter_filename_lambda, flatten)
end

.copy_files_inter(from_s3, to_s3, from_location, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object

Copies files between S3 locations in two different accounts

Implementation is as follows:

  1. Concurrent download of all files from S3 source to local tmpdir

  2. Concurrent upload of all files from local tmpdir to S3 target

In other words, the download and upload are not interleaved (which is inefficient because upload speeds are much lower than download speeds)

In other words, the download and upload are not interleaved (which is inefficient because upload speeds are much lower than download speeds)

from_s3

A Fog::Storage s3 connection for accessing the from S3Location

to_s3

A Fog::Storage s3 connection for accessing the to S3Location

from_location

S3Location to copy files from

to_location

S3Location to copy files to

match_regex

a regex string to match the files to move

alter_filename_lambda

lambda to alter the written filename

flatten

strips off any sub-folders below the from_location



176
177
178
179
180
181
182
183
184
185
186
187
# File 'lib/sluice/storage/s3/s3.rb', line 176

def self.copy_files_inter(from_s3, to_s3, from_location, to_location, match_regex='.+', alter_filename_lambda=nil, flatten=false)

  puts "  copying inter-account #{describe_from(from_location)} to #{to_location}"
  processed = []
  Dir.mktmpdir do |t|
    tmp = Sluice::Storage.trail_slash(t)
    processed = download_files(from_s3, from_location, tmp, match_regex)
    upload_files(to_s3, tmp, to_location, '**/*') # Upload all files we downloaded
  end

  processed
end

.copy_files_manifest(s3, manifest, from_files_or_loc, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object

Copies files between S3 locations maintaining a manifest to avoid copying a file which was copied previously.

Useful in scenarios such as:

  1. You would like to do a move but only have read permission on the source bucket

  2. You would like to do a move but some other process needs to use the files after you

s3

A Fog::Storage s3 connection

manifest

A Sluice::Storage::S3::Manifest object

from_files_or_loc

Array of filepaths or Fog::Storage::AWS::File objects, or S3Location to copy files from

to_location

S3Location to copy files to

match_regex

a regex string to match the files to copy

alter_filename_lambda

lambda to alter the written filename

flatten

strips off any sub-folders below the from_location



220
221
222
223
224
225
226
227
228
# File 'lib/sluice/storage/s3/s3.rb', line 220

def self.copy_files_manifest(s3, manifest, from_files_or_loc, to_location, match_regex='.+', alter_filename_lambda=nil, flatten=false)

  puts "  copying with manifest #{describe_from(from_files_or_loc)} to #{to_location}"
  ignore = manifest.get_entries(s3) # Files to leave untouched
  processed = process_files(:copy, s3, from_files_or_loc, ignore, match_regex, to_location, alter_filename_lambda, flatten)
  manifest.add_entries(s3, processed)

  processed
end

.delete_files(s3, from_files_or_loc, match_regex = '.+') ⇒ Object

Delete files from S3 locations concurrently

Parameters:

s3

A Fog::Storage s3 connection

from_files_or_loc

Array of filepaths or Fog::Storage::AWS::File objects, or S3Location to delete files from

match_regex

a regex string to match the files to delete



151
152
153
154
155
# File 'lib/sluice/storage/s3/s3.rb', line 151

def self.delete_files(s3, from_files_or_loc, match_regex='.+')

  puts "  deleting #{describe_from(from_files_or_loc)}"
  process_files(:delete, s3, from_files_or_loc, [], match_regex)
end

.download_file(s3, from_file, to_file) ⇒ Object

Download a single file to the exact path specified Has no intelligence around filenaming. Makes sure to create the path as needed.

Parameters:

s3

A Fog::Storage s3 connection

+from_file

A Fog::Storage::AWS::File to download

+to_file

A local file path



318
319
320
321
322
323
324
325
326
327
# File 'lib/sluice/storage/s3/s3.rb', line 318

def self.download_file(s3, from_file, to_file)

  FileUtils.mkdir_p(File.dirname(to_file))

  # TODO: deal with bug where Fog hangs indefinitely if network connection dies during download

  local_file = File.open(to_file, "w")
  local_file.write(from_file.body)
  local_file.close
end

.download_files(s3, from_files_or_loc, to_directory, match_regex = '.+') ⇒ Object

Download files from an S3 location to local storage, concurrently

Parameters:

s3

A Fog::Storage s3 connection

from_files_or_loc

Array of filepaths or Fog::Storage::AWS::File objects, or S3Location to download files from

to_directory

Local directory to copy files to

match_regex

a regex string to match the files to delete



139
140
141
142
143
# File 'lib/sluice/storage/s3/s3.rb', line 139

def self.download_files(s3, from_files_or_loc, to_directory, match_regex='.+')

  puts "  downloading #{describe_from(from_files_or_loc)} to #{to_directory}"
  process_files(:download, s3, from_files_or_loc, [], match_regex, to_directory)
end

.get_basename(path) ⇒ Object



108
109
110
111
112
113
114
115
116
117
118
119
# File 'lib/sluice/storage/s3/s3.rb', line 108

def self.get_basename(path)
  if is_folder?(path)
    nil
  else
    match = path.match('([^/]+)$')
    if match
      match[1]
    else
      nil
    end
  end
end

.is_empty?(s3, location) ⇒ Boolean

Returns:

  • (Boolean)


127
128
129
# File 'lib/sluice/storage/s3/s3.rb', line 127

def self.is_empty?(s3, location)
  list_files(s3, location).length == 0
end

.is_file?(path) ⇒ Boolean

Returns:

  • (Boolean)


96
97
98
# File 'lib/sluice/storage/s3/s3.rb', line 96

def self.is_file?(path)
  !is_folder?(path)
end

.is_folder?(path) ⇒ Boolean

Returns:

  • (Boolean)


84
85
86
87
# File 'lib/sluice/storage/s3/s3.rb', line 84

def self.is_folder?(path)
  (path.end_with?('_$folder$') || # EMR-created
    path.end_with?('/'))
end

.list_files(s3, location) ⇒ Object



65
66
67
68
69
70
71
72
73
74
75
# File 'lib/sluice/storage/s3/s3.rb', line 65

def self.list_files(s3, location)
  files_and_dirs = s3.directories.get(location.bucket, prefix: location.dir_as_path).files

  files = [] # Can't use a .select because of Ruby deep copy issues (array of non-POROs)
  files_and_dirs.each { |f|
    if is_file?(f.key)
      files << f.dup
    end
  }
  files
end

.move_files(s3, from_files_or_loc, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object

Moves files between S3 locations concurrently

Parameters:

s3

A Fog::Storage s3 connection

from_files_or_loc

Array of filepaths or Fog::Storage::AWS::File objects, or S3Location to move files from

to_location

S3Location to move files to

match_regex

a regex string to match the files to move

alter_filename_lambda

lambda to alter the written filename

flatten

strips off any sub-folders below the from_location



270
271
272
273
274
# File 'lib/sluice/storage/s3/s3.rb', line 270

def self.move_files(s3, from_files_or_loc, to_location, match_regex='.+', alter_filename_lambda=nil, flatten=false)

  puts "  moving #{describe_from(from_files_or_loc)} to #{to_location}"
  process_files(:move, s3, from_files_or_loc, [], match_regex, to_location, alter_filename_lambda, flatten)
end

.move_files_inter(from_s3, to_s3, from_location, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object

Moves files between S3 locations in two different accounts

Implementation is as follows:

  1. Concurrent download of all files from S3 source to local tmpdir

  2. Concurrent upload of all files from local tmpdir to S3 target

  3. Concurrent deletion of all files from S3 source

In other words, the three operations are not interleaved (which is inefficient because upload speeds are much lower than download speeds)

from_s3

A Fog::Storage s3 connection for accessing the from S3Location

to_s3

A Fog::Storage s3 connection for accessing the to S3Location

from_location

S3Location to move files from

to_location

S3Location to move files to

match_regex

a regex string to match the files to move

alter_filename_lambda

lambda to alter the written filename

flatten

strips off any sub-folders below the from_location



247
248
249
250
251
252
253
254
255
256
257
258
259
# File 'lib/sluice/storage/s3/s3.rb', line 247

def self.move_files_inter(from_s3, to_s3, from_location, to_location, match_regex='.+', alter_filename_lambda=nil, flatten=false)

  puts "  moving inter-account #{describe_from(from_location)} to #{to_location}"
  processed = []
  Dir.mktmpdir do |t|
    tmp = Sluice::Storage.trail_slash(t)
    processed = download_files(from_s3, from_location, tmp, match_regex)
    upload_files(to_s3, tmp, to_location, '**/*') # Upload all files we downloaded
    delete_files(from_s3, from_location, '.+') # Delete all files we downloaded
  end

  processed
end

.new_fog_s3_from(region, access_key_id, secret_access_key) ⇒ Object



46
47
48
49
50
51
52
53
54
55
# File 'lib/sluice/storage/s3/s3.rb', line 46

def self.new_fog_s3_from(region, access_key_id, secret_access_key)
  fog = Fog::Storage.new({
    :provider => 'AWS',
    :region => region,
    :aws_access_key_id => access_key_id,
    :aws_secret_access_key => secret_access_key
  })
  fog.sync_clock
  fog
end

.upload_file(s3, from_file, to_bucket, to_file) ⇒ Object

Upload a single file to the exact location specified Has no intelligence around filenaming.

Parameters:

s3

A Fog::Storage s3 connection

+from_file

A local file path

+to_bucket

The Fog::Directory to upload to

+to_file

The file path to upload to



297
298
299
300
301
302
303
304
305
306
307
308
# File 'lib/sluice/storage/s3/s3.rb', line 297

def self.upload_file(s3, from_file, to_bucket, to_file)

  local_file = File.open(from_file)

  dir = s3.directories.new(:key => to_bucket) # No request made
  file = dir.files.create(
    :key    => to_file,
    :body   => local_file
  )

  local_file.close
end

.upload_files(s3, from_files_or_dir, to_location, match_glob = '*') ⇒ Object

Uploads files to S3 locations concurrently

Parameters:

s3

A Fog::Storage s3 connection

from_files_or_dir

Local array of files or local directory to upload files from

to_location

S3Location to upload files to

match_glob

a filesystem glob to match the files to upload



283
284
285
286
287
# File 'lib/sluice/storage/s3/s3.rb', line 283

def self.upload_files(s3, from_files_or_dir, to_location, match_glob='*')

  puts "  uploading #{describe_from(from_files_or_dir)} to #{to_location}"
  process_files(:upload, s3, from_files_or_dir, [], match_glob, to_location)
end