Module: Sluice::Storage::S3
- Includes:
- Contracts
- Defined in:
- lib/sluice/storage/s3/s3.rb,
lib/sluice/storage/s3/location.rb,
lib/sluice/storage/s3/manifest.rb,
lib/sluice/storage/s3/contracts.rb
Defined Under Namespace
Classes: Location, Manifest, ManifestScope
Constant Summary collapse
- CONCURRENCY =
Constants
10
- RETRIES =
Threads
3
- RETRY_WAIT =
Attempts
10
- TIMEOUT_WAIT =
Seconds
1800
- FogStorage =
Aliases for Contracts
Fog::Storage::AWS::Real
- FogFile =
Fog::Storage::AWS::File
Class Method Summary collapse
-
.copy_files(s3, from_files_or_loc, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object
Copies files between S3 locations concurrently.
-
.copy_files_inter(from_s3, to_s3, from_location, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object
Copies files between S3 locations in two different accounts.
-
.copy_files_manifest(s3, manifest, from_files_or_loc, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object
Copies files between S3 locations maintaining a manifest to avoid copying a file which was copied previously.
-
.delete_files(s3, from_files_or_loc, match_regex = '.+') ⇒ Object
Delete files from S3 locations concurrently.
-
.download_file(s3, from_file, to_file) ⇒ Object
Download a single file to the exact path specified Has no intelligence around filenaming.
-
.download_files(s3, from_files_or_loc, to_directory, match_regex = '.+') ⇒ Object
Download files from an S3 location to local storage, concurrently.
- .get_basename(path) ⇒ Object
- .is_empty?(s3, location) ⇒ Boolean
- .is_file?(path) ⇒ Boolean
- .is_folder?(path) ⇒ Boolean
- .list_files(s3, location) ⇒ Object
-
.move_files(s3, from_files_or_loc, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object
Moves files between S3 locations concurrently.
-
.move_files_inter(from_s3, to_s3, from_location, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object
Moves files between S3 locations in two different accounts.
- .new_fog_s3_from(region, access_key_id, secret_access_key) ⇒ Object
-
.upload_file(s3, from_file, to_bucket, to_file) ⇒ Object
Upload a single file to the exact location specified Has no intelligence around filenaming.
-
.upload_files(s3, from_files_or_dir, to_location, match_glob = '*') ⇒ Object
Uploads files to S3 locations concurrently.
Class Method Details
.copy_files(s3, from_files_or_loc, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object
Copies files between S3 locations concurrently
Parameters:
s3
-
A Fog::Storage s3 connection
from_files_or_loc
-
Array of filepaths or Fog::Storage::AWS::File objects, or S3Location to copy files from
to_location
-
S3Location to copy files to
match_regex
-
a regex string to match the files to copy
alter_filename_lambda
-
lambda to alter the written filename
flatten
-
strips off any sub-folders below the from_location
198 199 200 201 202 |
# File 'lib/sluice/storage/s3/s3.rb', line 198 def self.copy_files(s3, from_files_or_loc, to_location, match_regex='.+', alter_filename_lambda=nil, flatten=false) puts " copying #{describe_from(from_files_or_loc)} to #{to_location}" process_files(:copy, s3, from_files_or_loc, [], match_regex, to_location, alter_filename_lambda, flatten) end |
.copy_files_inter(from_s3, to_s3, from_location, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object
Copies files between S3 locations in two different accounts
Implementation is as follows:
-
Concurrent download of all files from S3 source to local tmpdir
-
Concurrent upload of all files from local tmpdir to S3 target
In other words, the download and upload are not interleaved (which is inefficient because upload speeds are much lower than download speeds)
In other words, the download and upload are not interleaved (which is inefficient because upload speeds are much lower than download speeds)
from_s3
-
A Fog::Storage s3 connection for accessing the from S3Location
to_s3
-
A Fog::Storage s3 connection for accessing the to S3Location
from_location
-
S3Location to copy files from
to_location
-
S3Location to copy files to
match_regex
-
a regex string to match the files to move
alter_filename_lambda
-
lambda to alter the written filename
flatten
-
strips off any sub-folders below the from_location
176 177 178 179 180 181 182 183 184 185 186 187 |
# File 'lib/sluice/storage/s3/s3.rb', line 176 def self.copy_files_inter(from_s3, to_s3, from_location, to_location, match_regex='.+', alter_filename_lambda=nil, flatten=false) puts " copying inter-account #{describe_from(from_location)} to #{to_location}" processed = [] Dir.mktmpdir do |t| tmp = Sluice::Storage.trail_slash(t) processed = download_files(from_s3, from_location, tmp, match_regex) upload_files(to_s3, tmp, to_location, '**/*') # Upload all files we downloaded end processed end |
.copy_files_manifest(s3, manifest, from_files_or_loc, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object
Copies files between S3 locations maintaining a manifest to avoid copying a file which was copied previously.
Useful in scenarios such as:
-
You would like to do a move but only have read permission on the source bucket
-
You would like to do a move but some other process needs to use the files after you
s3
-
A Fog::Storage s3 connection
manifest
-
A Sluice::Storage::S3::Manifest object
from_files_or_loc
-
Array of filepaths or Fog::Storage::AWS::File objects, or S3Location to copy files from
to_location
-
S3Location to copy files to
match_regex
-
a regex string to match the files to copy
alter_filename_lambda
-
lambda to alter the written filename
flatten
-
strips off any sub-folders below the from_location
220 221 222 223 224 225 226 227 228 |
# File 'lib/sluice/storage/s3/s3.rb', line 220 def self.copy_files_manifest(s3, manifest, from_files_or_loc, to_location, match_regex='.+', alter_filename_lambda=nil, flatten=false) puts " copying with manifest #{describe_from(from_files_or_loc)} to #{to_location}" ignore = manifest.get_entries(s3) # Files to leave untouched processed = process_files(:copy, s3, from_files_or_loc, ignore, match_regex, to_location, alter_filename_lambda, flatten) manifest.add_entries(s3, processed) processed end |
.delete_files(s3, from_files_or_loc, match_regex = '.+') ⇒ Object
Delete files from S3 locations concurrently
Parameters:
s3
-
A Fog::Storage s3 connection
from_files_or_loc
-
Array of filepaths or Fog::Storage::AWS::File objects, or S3Location to delete files from
match_regex
-
a regex string to match the files to delete
151 152 153 154 155 |
# File 'lib/sluice/storage/s3/s3.rb', line 151 def self.delete_files(s3, from_files_or_loc, match_regex='.+') puts " deleting #{describe_from(from_files_or_loc)}" process_files(:delete, s3, from_files_or_loc, [], match_regex) end |
.download_file(s3, from_file, to_file) ⇒ Object
Download a single file to the exact path specified Has no intelligence around filenaming. Makes sure to create the path as needed.
Parameters:
s3
-
A Fog::Storage s3 connection
- +from_file
-
A Fog::Storage::AWS::File to download
- +to_file
-
A local file path
318 319 320 321 322 323 324 325 326 327 |
# File 'lib/sluice/storage/s3/s3.rb', line 318 def self.download_file(s3, from_file, to_file) FileUtils.mkdir_p(File.dirname(to_file)) # TODO: deal with bug where Fog hangs indefinitely if network connection dies during download local_file = File.open(to_file, "w") local_file.write(from_file.body) local_file.close end |
.download_files(s3, from_files_or_loc, to_directory, match_regex = '.+') ⇒ Object
Download files from an S3 location to local storage, concurrently
Parameters:
s3
-
A Fog::Storage s3 connection
from_files_or_loc
-
Array of filepaths or Fog::Storage::AWS::File objects, or S3Location to download files from
to_directory
-
Local directory to copy files to
match_regex
-
a regex string to match the files to delete
139 140 141 142 143 |
# File 'lib/sluice/storage/s3/s3.rb', line 139 def self.download_files(s3, from_files_or_loc, to_directory, match_regex='.+') puts " downloading #{describe_from(from_files_or_loc)} to #{to_directory}" process_files(:download, s3, from_files_or_loc, [], match_regex, to_directory) end |
.get_basename(path) ⇒ Object
108 109 110 111 112 113 114 115 116 117 118 119 |
# File 'lib/sluice/storage/s3/s3.rb', line 108 def self.get_basename(path) if is_folder?(path) nil else match = path.match('([^/]+)$') if match match[1] else nil end end end |
.is_empty?(s3, location) ⇒ Boolean
127 128 129 |
# File 'lib/sluice/storage/s3/s3.rb', line 127 def self.is_empty?(s3, location) list_files(s3, location).length == 0 end |
.is_file?(path) ⇒ Boolean
96 97 98 |
# File 'lib/sluice/storage/s3/s3.rb', line 96 def self.is_file?(path) !is_folder?(path) end |
.is_folder?(path) ⇒ Boolean
84 85 86 87 |
# File 'lib/sluice/storage/s3/s3.rb', line 84 def self.is_folder?(path) (path.end_with?('_$folder$') || # EMR-created path.end_with?('/')) end |
.list_files(s3, location) ⇒ Object
65 66 67 68 69 70 71 72 73 74 75 |
# File 'lib/sluice/storage/s3/s3.rb', line 65 def self.list_files(s3, location) files_and_dirs = s3.directories.get(location.bucket, prefix: location.dir_as_path).files files = [] # Can't use a .select because of Ruby deep copy issues (array of non-POROs) files_and_dirs.each { |f| if is_file?(f.key) files << f.dup end } files end |
.move_files(s3, from_files_or_loc, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object
Moves files between S3 locations concurrently
Parameters:
s3
-
A Fog::Storage s3 connection
from_files_or_loc
-
Array of filepaths or Fog::Storage::AWS::File objects, or S3Location to move files from
to_location
-
S3Location to move files to
match_regex
-
a regex string to match the files to move
alter_filename_lambda
-
lambda to alter the written filename
flatten
-
strips off any sub-folders below the from_location
270 271 272 273 274 |
# File 'lib/sluice/storage/s3/s3.rb', line 270 def self.move_files(s3, from_files_or_loc, to_location, match_regex='.+', alter_filename_lambda=nil, flatten=false) puts " moving #{describe_from(from_files_or_loc)} to #{to_location}" process_files(:move, s3, from_files_or_loc, [], match_regex, to_location, alter_filename_lambda, flatten) end |
.move_files_inter(from_s3, to_s3, from_location, to_location, match_regex = '.+', alter_filename_lambda = nil, flatten = false) ⇒ Object
Moves files between S3 locations in two different accounts
Implementation is as follows:
-
Concurrent download of all files from S3 source to local tmpdir
-
Concurrent upload of all files from local tmpdir to S3 target
-
Concurrent deletion of all files from S3 source
In other words, the three operations are not interleaved (which is inefficient because upload speeds are much lower than download speeds)
from_s3
-
A Fog::Storage s3 connection for accessing the from S3Location
to_s3
-
A Fog::Storage s3 connection for accessing the to S3Location
from_location
-
S3Location to move files from
to_location
-
S3Location to move files to
match_regex
-
a regex string to match the files to move
alter_filename_lambda
-
lambda to alter the written filename
flatten
-
strips off any sub-folders below the from_location
247 248 249 250 251 252 253 254 255 256 257 258 259 |
# File 'lib/sluice/storage/s3/s3.rb', line 247 def self.move_files_inter(from_s3, to_s3, from_location, to_location, match_regex='.+', alter_filename_lambda=nil, flatten=false) puts " moving inter-account #{describe_from(from_location)} to #{to_location}" processed = [] Dir.mktmpdir do |t| tmp = Sluice::Storage.trail_slash(t) processed = download_files(from_s3, from_location, tmp, match_regex) upload_files(to_s3, tmp, to_location, '**/*') # Upload all files we downloaded delete_files(from_s3, from_location, '.+') # Delete all files we downloaded end processed end |
.new_fog_s3_from(region, access_key_id, secret_access_key) ⇒ Object
46 47 48 49 50 51 52 53 54 55 |
# File 'lib/sluice/storage/s3/s3.rb', line 46 def self.new_fog_s3_from(region, access_key_id, secret_access_key) fog = Fog::Storage.new({ :provider => 'AWS', :region => region, :aws_access_key_id => access_key_id, :aws_secret_access_key => secret_access_key }) fog.sync_clock fog end |
.upload_file(s3, from_file, to_bucket, to_file) ⇒ Object
Upload a single file to the exact location specified Has no intelligence around filenaming.
Parameters:
s3
-
A Fog::Storage s3 connection
- +from_file
-
A local file path
- +to_bucket
-
The Fog::Directory to upload to
- +to_file
-
The file path to upload to
297 298 299 300 301 302 303 304 305 306 307 308 |
# File 'lib/sluice/storage/s3/s3.rb', line 297 def self.upload_file(s3, from_file, to_bucket, to_file) local_file = File.open(from_file) dir = s3.directories.new(:key => to_bucket) # No request made file = dir.files.create( :key => to_file, :body => local_file ) local_file.close end |
.upload_files(s3, from_files_or_dir, to_location, match_glob = '*') ⇒ Object
Uploads files to S3 locations concurrently
Parameters:
s3
-
A Fog::Storage s3 connection
from_files_or_dir
-
Local array of files or local directory to upload files from
to_location
-
S3Location to upload files to
match_glob
-
a filesystem glob to match the files to upload
283 284 285 286 287 |
# File 'lib/sluice/storage/s3/s3.rb', line 283 def self.upload_files(s3, from_files_or_dir, to_location, match_glob='*') puts " uploading #{describe_from(from_files_or_dir)} to #{to_location}" process_files(:upload, s3, from_files_or_dir, [], match_glob, to_location) end |