Class: MiGA::RemoteDataset
- Defined in:
- lib/miga/remote_dataset.rb,
lib/miga/remote_dataset/base.rb,
lib/miga/remote_dataset/download.rb
Overview
MiGA representation of datasets with data in remote locations.
Defined Under Namespace
Constant Summary
Constants included from MiGA
CITATION, VERSION, VERSION_DATE, VERSION_NAME
Instance Attribute Summary collapse
-
#db ⇒ Object
readonly
Database storing the dataset.
-
#ids ⇒ Object
readonly
Array of IDs of the entries composing the dataset.
-
#metadata ⇒ Object
readonly
Internal metadata hash.
-
#universe ⇒ Object
readonly
Universe of the dataset.
Class Method Summary collapse
-
.download(universe, db, ids, format, file = nil, extra = [], obj = nil) ⇒ Object
Download data from the
universe
in the databasedb
with IDsids
and informat
. -
.download_rest(opts) ⇒ Object
(also: download_net)
Download data using the REST method.
-
.download_url(url) ⇒ Object
Download the given
url
and return the result regardless of response code. -
.ncbi_asm_acc2id(acc, retrials = 3) ⇒ Object
Translate an NCBI Assembly Accession (
acc
) to corresponding internal NCBI ID, with up toretrials
retrials if the returned JSON document does not conform to the expected format. -
.ncbi_asm_rest(opts) ⇒ Object
Download data from NCBI Assembly database using the REST method.
-
.ncbi_gb_rest(opts) ⇒ Object
Download data from NCBI GenBank (nuccore) database using the REST method.
-
.ncbi_map(id, dbfrom, db) ⇒ Object
Looks for the entry
id
indbfrom
, and returns the linked identifier indb
(or nil). - .UNIVERSE ⇒ Object
Instance Method Summary collapse
-
#get_gtdb_taxonomy ⇒ Object
Get GTDB taxonomy as MiGA::Taxonomy.
-
#get_metadata(metadata_def = {}) ⇒ Object
Get metadata from the remote location.
-
#get_ncbi_taxid ⇒ Object
Get NCBI Taxonomy ID.
-
#get_ncbi_taxonomy ⇒ Object
Get NCBI taxonomy as MiGA::Taxonomy.
-
#get_type_status(metadata) ⇒ Object
Get the type material status and return an (updated)
metadata
hash. -
#initialize(ids, db, universe) ⇒ RemoteDataset
constructor
Initialize MiGA::RemoteDataset with
ids
in databasedb
fromuniverse
. -
#ncbi_asm_json_doc ⇒ Object
Get the JSON document describing an NCBI assembly entry.
-
#save_to(project, name = nil, is_ref = true, metadata_def = {}) ⇒ Object
Save dataset to the MiGA::Project
project
identified withname
. -
#update_metadata(dataset, metadata = {}) ⇒ Object
Updates the MiGA::Dataset
dataset
with the remotely available metadata, and optionally the Hashmetadata
.
Methods included from Download
Methods inherited from MiGA
CITATION, CITATION_ARRAY, DEBUG, DEBUG_OFF, DEBUG_ON, DEBUG_TRACE_OFF, DEBUG_TRACE_ON, FULL_VERSION, LONG_VERSION, VERSION, VERSION_DATE, #advance, debug?, debug_trace?, initialized?, #like_io?, #num_suffix, rc_path, #result_files_exist?, #say
Methods included from Common::Path
Methods included from Common::Format
#clean_fasta_file, #seqs_length, #tabulate
Methods included from Common::Net
#download_file_ftp, #known_hosts, #main_server, #remote_connection
Methods included from Common::SystemCall
Constructor Details
#initialize(ids, db, universe) ⇒ RemoteDataset
Initialize MiGA::RemoteDataset with ids
in database db
from universe
.
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
# File 'lib/miga/remote_dataset.rb', line 67 def initialize(ids, db, universe) ids = [ids] unless ids.is_a? Array @ids = (ids.is_a?(Array) ? ids : [ids]) @db = db.to_sym @universe = universe.to_sym = {} [:"#{universe}_#{db}"] = ids.join(',') @@UNIVERSE.keys.include?(@universe) or raise "Unknown Universe: #{@universe}. Try: #{@@UNIVERSE.keys}" @@UNIVERSE[@universe][:dbs].include?(@db) or raise "Unknown Database: #{@db}. Try: #{@@UNIVERSE[@universe][:dbs].keys}" @_ncbi_asm_json_doc = nil # FIXME: Part of the +map_to+ support: # unless @@UNIVERSE[@universe][:dbs][@db][:map_to].nil? # MiGA::RemoteDataset.download # end end |
Instance Attribute Details
#db ⇒ Object (readonly)
Database storing the dataset.
57 58 59 |
# File 'lib/miga/remote_dataset.rb', line 57 def db @db end |
#ids ⇒ Object (readonly)
Array of IDs of the entries composing the dataset.
59 60 61 |
# File 'lib/miga/remote_dataset.rb', line 59 def ids @ids end |
#metadata ⇒ Object (readonly)
Internal metadata hash
61 62 63 |
# File 'lib/miga/remote_dataset.rb', line 61 def end |
#universe ⇒ Object (readonly)
Universe of the dataset.
55 56 57 |
# File 'lib/miga/remote_dataset.rb', line 55 def universe @universe end |
Class Method Details
.download(universe, db, ids, format, file = nil, extra = [], obj = nil) ⇒ Object
Download data from the universe
in the database db
with IDs ids
and in format
. If passed, it saves the result in file
. Additional parameters specific to the download method can be passed using extra
. Returns String. The obj
can also be passed as MiGA::RemoteDataset or MiGA::Dataset.
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# File 'lib/miga/remote_dataset/download.rb', line 14 def download(universe, db, ids, format, file = nil, extra = [], obj = nil) ids = [ids] unless ids.is_a? Array getter = @@UNIVERSE[universe][:dbs][db][:getter] || :download method = @@UNIVERSE[universe][:method] opts = { universe: universe, db: db, ids: ids, format: format, file: file, extra: extra, obj: obj } doc = send("#{getter}_#{method}", opts) unless opts[:file].nil? ofh = File.open(opts[:file], 'w') ofh.print doc.force_encoding('UTF-8') ofh.close end doc end |
.download_rest(opts) ⇒ Object Also known as: download_net
Download data using the REST method. Supported opts
(Hash) include: universe
(mandatory): Symbol db
(mandatory): Symbol ids
(mandatory): Array of String format
: String extra
: Array
83 84 85 86 87 88 89 90 |
# File 'lib/miga/remote_dataset/download.rb', line 83 def download_rest(opts) u = @@UNIVERSE[opts[:universe]] url = sprintf( u[:url], opts[:db], opts[:ids].join(','), opts[:format], *opts[:extra] ) url = u[:api_key][url] unless u[:api_key].nil? download_url url end |
.download_url(url) ⇒ Object
Download the given url
and return the result regardless of response code. Attempts download up to three times before raising Net::ReadTimeout.
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
# File 'lib/miga/remote_dataset/download.rb', line 99 def download_url(url) doc = '' @timeout_try = 0 begin DEBUG 'GET: ' + url URI.parse(url).open(read_timeout: 600) { |f| doc = f.read } rescue => e @timeout_try += 1 raise e if @timeout_try >= 3 sleep 5 # <- For: 429 Too Many Requests DEBUG "RETRYING after: #{e}" retry end doc end |
.ncbi_asm_acc2id(acc, retrials = 3) ⇒ Object
Translate an NCBI Assembly Accession (acc
) to corresponding internal NCBI ID, with up to retrials
retrials if the returned JSON document does not conform to the expected format
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# File 'lib/miga/remote_dataset.rb', line 19 def ncbi_asm_acc2id(acc, retrials = 3) return acc if acc =~ /^\d+$/ search_doc = MiGA::Json.parse( download(:ncbi_search, :assembly, acc, :json), symbolize: false, contents: true ) out = (search_doc['esearchresult']['idlist'] || []).first if out.nil? raise MiGA::RemoteDataMissingError.new( "NCBI Assembly Accession not found: #{acc}" ) end return out rescue JSON::ParserError, MiGA::RemoteDataMissingError => e # Note that +JSON::ParserError+ is being rescued because the NCBI backend # may in some cases return a malformed JSON response indicating that the # "Search Backend failed". The issue with the JSON payload is that it # includes two tab characters (\t\t) in the error message, which is not # allowed by the JSON specification and causes a parsing error # (see https://www.rfc-editor.org/rfc/rfc4627#page-4) if retrials <= 0 raise e else MiGA::MiGA.DEBUG("#{self}.ncbi_asm_acc2id - RETRY #{retrials}") retrials -= 1 retry end end |
.ncbi_asm_rest(opts) ⇒ Object
Download data from NCBI Assembly database using the REST method. Supported opts
(Hash) include: obj
(mandatory): MiGA::RemoteDataset ids
(mandatory): String or Array of String file
: String, passed to download extra
: Array, passed to download format
: String, passed to download
44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/miga/remote_dataset/download.rb', line 44 def ncbi_asm_rest(opts) url_dir = opts[:obj].ncbi_asm_json_doc&.dig('ftppath_genbank') if url_dir.nil? || url_dir.empty? raise MiGA::RemoteDataMissingError.new( "Missing ftppath_genbank in NCBI Assembly JSON" ) end url = '%s/%s_genomic.fna.gz' % [url_dir, File.basename(url_dir)] download( :web, :assembly_gz, url, opts[:format], opts[:file], opts[:extra], opts[:obj] ) end |
.ncbi_gb_rest(opts) ⇒ Object
Download data from NCBI GenBank (nuccore) database using the REST method. Supported opts
(Hash) are the same as #download_rest and #ncbi_asm_rest.
62 63 64 65 66 67 68 69 70 71 72 73 74 |
# File 'lib/miga/remote_dataset/download.rb', line 62 def ncbi_gb_rest(opts) # Simply use defaults, but ensure that the URL can be properly formed o = download_rest(opts.merge(universe: :ncbi, db: :nuccore)) return o unless o.strip.empty? MiGA::MiGA.DEBUG 'Empty sequence, attempting download from NCBI assembly' opts[:format] = :fasta_gz if opts[:file] File.unlink(opts[:file]) if File.exist? opts[:file] opts[:file] = "#{opts[:file]}.gz" end ncbi_asm_rest(opts) end |
.ncbi_map(id, dbfrom, db) ⇒ Object
Looks for the entry id
in dbfrom
, and returns the linked identifier in db
(or nil).
119 120 121 122 123 124 125 126 127 128 129 |
# File 'lib/miga/remote_dataset/download.rb', line 119 def ncbi_map(id, dbfrom, db) doc = download(:ncbi_map, dbfrom, id, :json, nil, [db]) return if doc.empty? tree = MiGA::Json.parse(doc, contents: true) [:linksets, 0, :linksetdbs, 0, :links, 0].each do |i| tree = tree[i] break if tree.nil? end tree end |
.UNIVERSE ⇒ Object
7 8 9 |
# File 'lib/miga/remote_dataset/base.rb', line 7 def UNIVERSE @@UNIVERSE end |
Instance Method Details
#get_gtdb_taxonomy ⇒ Object
Get GTDB taxonomy as MiGA::Taxonomy
191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
# File 'lib/miga/remote_dataset.rb', line 191 def get_gtdb_taxonomy gtdb_genome = [:gtdb_assembly] or return doc = MiGA::Json.parse( MiGA::RemoteDataset.download( :gtdb, :genome, gtdb_genome, 'taxon-history', nil, [''] ), contents: true ) lineage = { ns: 'gtdb' } lineage.merge!(doc.first) # Get only the latest available classification release = lineage.delete(:release) [:gtdb_release] = release lineage.transform_values! { |v| v.gsub(/^\S__/, '') } MiGA.DEBUG "Got lineage from #{release}: #{lineage}" MiGA::Taxonomy.new(lineage) end |
#get_metadata(metadata_def = {}) ⇒ Object
Get metadata from the remote location.
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
# File 'lib/miga/remote_dataset.rb', line 126 def ( = {}) .each { |k, v| [k] = v } case universe when :ebi, :ncbi, :web # Get taxonomy [:tax] = get_ncbi_taxonomy when :gtdb # Get taxonomy [:tax] = get_gtdb_taxonomy when :seqcode # Taxonomy already defined # Copy IDs over to allow additional metadata linked [:ncbi_asm] = [:seqcode_asm] [:ncbi_nuccore] = [:seqcode_nuccore] end if [:get_ncbi_taxonomy] tax = get_ncbi_taxonomy tax&.add_alternative([:tax].dup, false) if [:tax] [:tax] = tax end [:get_ncbi_taxonomy] = nil = get_type_status() end |
#get_ncbi_taxid ⇒ Object
Get NCBI Taxonomy ID.
154 155 156 |
# File 'lib/miga/remote_dataset.rb', line 154 def get_ncbi_taxid send("get_ncbi_taxid_from_#{universe}") end |
#get_ncbi_taxonomy ⇒ Object
Get NCBI taxonomy as MiGA::Taxonomy
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
# File 'lib/miga/remote_dataset.rb', line 173 def get_ncbi_taxonomy tax_id = get_ncbi_taxid or return lineage = { ns: 'ncbi' } doc = MiGA::RemoteDataset.download(:ncbi, :taxonomy, tax_id, :xml) doc.scan(%r{<Taxon>(.*?)</Taxon>}m).map(&:first).each do |i| name = i.scan(%r{<ScientificName>(.*)</ScientificName>}).first.to_a.first rank = i.scan(%r{<Rank>(.*)</Rank>}).first.to_a.first rank = nil if rank == 'no rank' or rank.empty? rank = 'dataset' if lineage.empty? and rank.nil? lineage[rank] = name unless rank.nil? or rank.nil? end MiGA.DEBUG "Got lineage: #{lineage}" MiGA::Taxonomy.new(lineage) end |
#get_type_status(metadata) ⇒ Object
Get the type material status and return an (updated) metadata
hash.
161 162 163 164 165 166 167 168 169 |
# File 'lib/miga/remote_dataset.rb', line 161 def get_type_status() if [:ncbi_asm] get_type_status_ncbi_asm() elsif [:ncbi_nuccore] get_type_status_ncbi_nuccore() else end end |
#ncbi_asm_json_doc ⇒ Object
Get the JSON document describing an NCBI assembly entry.
211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
# File 'lib/miga/remote_dataset.rb', line 211 def ncbi_asm_json_doc return @_ncbi_asm_json_doc unless @_ncbi_asm_json_doc.nil? if db == :assembly && i[ncbi gtdb seqcode].include?(universe) [:ncbi_asm] ||= ids.first end return nil unless [:ncbi_asm] ncbi_asm_id = self.class.ncbi_asm_acc2id([:ncbi_asm]) txt = nil 3.times do txt = self.class.download(:ncbi_summary, :assembly, ncbi_asm_id, :json) txt.empty? ? sleep(1) : break end doc = MiGA::Json.parse(txt, symbolize: false, contents: true) return if doc.nil? || doc['result'].nil? || doc['result'].empty? @_ncbi_asm_json_doc = doc['result'][ doc['result']['uids'].first ] url_dir = @_ncbi_asm_json_doc['ftppath_genbank'] if url_dir [:web_assembly_gz] ||= '%s/%s_genomic.fna.gz' % [url_dir, File.basename(url_dir)] end @_ncbi_asm_json_doc end |
#save_to(project, name = nil, is_ref = true, metadata_def = {}) ⇒ Object
Save dataset to the MiGA::Project project
identified with name
. is_ref
indicates if it should be a reference dataset, and contains metadata_def
. If metadata_def
includes metadata_only: true, no input data is downloaded.
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
# File 'lib/miga/remote_dataset.rb', line 90 def save_to(project, name = nil, is_ref = true, = {}) name ||= ids.join('_').miga_name project = MiGA::Project.new(project) if project.is_a? String MiGA::Dataset.exist?(project, name) and raise "Dataset #{name} exists in the project, aborting..." = () udb = @@UNIVERSE[universe][:dbs][db] ["#{universe}_#{db}"] = ids.join(',') unless [:metadata_only] respond_to?("save_#{udb[:stage]}_to", true) or raise "Unexpected error: Unsupported stage #{udb[:stage]} for #{db}." send "save_#{udb[:stage]}_to", project, name, udb end dataset = MiGA::Dataset.new(project, name, is_ref, ) project.add_dataset(dataset.name) unless [:metadata_only] result = dataset.add_result(udb[:stage], true, is_clean: true) result.nil? and raise 'Empty dataset: seed result not added due to incomplete files.' result.clean! result.save end dataset end |
#update_metadata(dataset, metadata = {}) ⇒ Object
Updates the MiGA::Dataset dataset
with the remotely available metadata, and optionally the Hash metadata
.
118 119 120 121 122 |
# File 'lib/miga/remote_dataset.rb', line 118 def (dataset, = {}) = () .each { |k, v| dataset.[k] = v } dataset.save end |