Class: GeneValidator::LengthClusterValidation

Inherits:
ValidationTest show all
Defined in:
lib/genevalidator/validation_length_cluster.rb

Overview

This class contains the methods necessary for length validation by hit length clusterization

Instance Attribute Summary collapse

Attributes inherited from ValidationTest

#cli_name, #description, #header, #hits, #prediction, #running_time, #short_header, #type, #validation_report

Instance Method Summary collapse

Constructor Details

#initialize(type, prediction, hits, filename) ⇒ LengthClusterValidation

Initilizes the object Params: type: type of the predicted sequence (:nucleotide or :protein) prediction: a Sequence object representing the blast query hits: a vector of Sequence objects (representing blast hits) dilename: String with the name of the fasta file



79
80
81
82
83
84
85
86
87
88
89
# File 'lib/genevalidator/validation_length_cluster.rb', line 79

def initialize(type, prediction, hits, filename)
  super
  @filename     = filename
  @short_header = 'LengthCluster'
  @header       = 'Length Cluster'
  @description  = 'Check whether the prediction length fits most of the' \
                  ' BLAST hit lengths, by 1D hierarchical clusterization.' \
                  ' Meaning of the output displayed: Query_length' \
                  ' [Main Cluster Length Interval]'
  @cli_name     = 'lenc'
end

Instance Attribute Details

#clustersObject (readonly)

Returns the value of attribute clusters.



69
70
71
# File 'lib/genevalidator/validation_length_cluster.rb', line 69

def clusters
  @clusters
end

#filenameObject (readonly)

Returns the value of attribute filename.



68
69
70
# File 'lib/genevalidator/validation_length_cluster.rb', line 68

def filename
  @filename
end

#max_density_clusterObject (readonly)

Returns the value of attribute max_density_cluster.



70
71
72
# File 'lib/genevalidator/validation_length_cluster.rb', line 70

def max_density_cluster
  @max_density_cluster
end

Instance Method Details

#clusterization_by_length(_debug = false, lst = @hits, predicted_seq = @prediction) ⇒ Object

Clusterization by length from a list of sequences Params:

debug (optional)

true to display debug information, false by default

lst

array of Sequence objects

predicted_seq

Sequence objetc

Output

output 1

array of Cluster objects

output 2

the index of the most dense cluster



146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
# File 'lib/genevalidator/validation_length_cluster.rb', line 146

def clusterization_by_length(_debug = false,
                             lst = @hits,
                             predicted_seq = @prediction)
  fail TypeError unless lst[0].is_a?(Sequence) &&
                        predicted_seq.is_a?(Sequence)

  contents = lst.map { |x| x.length_protein.to_i }.sort { |a, b| a <=> b }

  hc = HierarchicalClusterization.new(contents)
  clusters = hc.hierarchical_clusterization

  max_density             = 0
  max_density_cluster_idx = 0
  clusters.each_with_index do |item, i|
    next unless item.density > max_density
    max_density             = item.density
    max_density_cluster_idx = i
  end

  [clusters, max_density_cluster_idx]

rescue TypeError => error
  error_location = error.backtrace[0].scan(%r{([^/]+:\d+):.*})[0][0]
  $stderr.puts "Type error at #{error_location}."
  $stderr.puts ' Possible cause: one of the arguments of the' \
               ' "clusterization_by_length" method has not the proper type.'
  exit 1
end

#plot_histo_clusters(output = "#{@filename}_len_clusters.json", clusters = @clusters, max_density_cluster = @max_density_cluster, prediction = @prediction) ⇒ Object

Generates a json file containing data used for plotting the histogram of the length distribution given a lust of Cluster objects output: filename where to save the graph clusters: array of Cluster objects max_density_cluster: index of the most dense cluster prediction: Sequence object Output: Plot object



184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
# File 'lib/genevalidator/validation_length_cluster.rb', line 184

def plot_histo_clusters(output = "#{@filename}_len_clusters.json",
                      clusters = @clusters,
                      max_density_cluster = @max_density_cluster,
                      prediction = @prediction)

  f = File.open(output, 'w')
  f.write(clusters.each_with_index.map { |cluster, i|
    cluster.lengths.collect { |k, v|
      { 'key' => k, 'value' => v, 'main' => (i == max_density_cluster) }
    }
  }.to_json)
  f.close
  Plot.new(output.scan(%r{([^/]+)$})[0][0],
           :bars,
           'Length Cluster Validation: Distribution of BLAST hit lengths',
           'Query Sequence, black;Most Dense Cluster,red;Other Hits, blue',
           'Sequence Length',
           'Number of Sequences',
           prediction.length_protein)
end

#plot_len_clusters(output = "#{@filename}_len.json", _hits = @hits) ⇒ Object

Generates a json file cotaining data used for plotting lines corresponding to the start and end hit offsets Params: output: filename where to save the graph hits: array of Sequence objects Output: Plot object



213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
# File 'lib/genevalidator/validation_length_cluster.rb', line 213

def plot_len_clusters(output = "#{@filename}_len.json", _hits = @hits)
  f = File.open(output, 'w')
  lst = @hits.sort { |a, b| a.length_protein <=> b.length_protein }

  no_lines = 100

  lst_less = lst[0..[no_lines, lst.length - 1].min]

  f.write((lst_less.each_with_index.map { |hit, i|
    { 'y' => i, 'start' => 0, 'stop' => hit.length_protein,
      'color' => 'gray' }
  } + lst_less.each_with_index.map { |hit, i|
    hit.hsp_list.map { |hsp|
      { 'y' => i, 'start' => hsp.hit_from, 'stop' => hsp.hit_to,
        'color' => 'red' }
    }
  }.flatten).to_json)

  f.close
  Plot.new(output.scan(%r{([^/]+)$})[0][0],
           :lines,
           '[Length Cluster] Matched regions in hits',
           'hit, gray;high-scoring segment pairs (hsp), red',
           'offset in the hit',
           'number of the hit',
           lst_less.length)
end

#runObject

Validates the length of the predicted gene by comparing the length of the prediction to the most dense cluster The most dense cluster is obtained by hierarchical clusterization Plots are generated if required (see plot variable) Output: LengthClusterValidationOutput object



98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# File 'lib/genevalidator/validation_length_cluster.rb', line 98

def run
  fail NotEnoughHitsError unless hits.length >= 5
  fail Exception unless prediction.is_a?(Sequence) &&
                        hits[0].is_a?(Sequence)

  start = Time.now
  # get [clusters, max_density_cluster_idx]
  clusterization = clusterization_by_length

  @clusters = clusterization[0]
  @max_density_cluster = clusterization[1]
  limits = @clusters[@max_density_cluster].get_limits
  query_length = @prediction.length_protein

  @validation_report = LengthClusterValidationOutput.new(@short_header,
                                                         @header,
                                                         @description,
                                                         query_length,
                                                         limits)
  plot1 = plot_histo_clusters
  @validation_report.plot_files.push(plot1)

  @validation_report.running_time = Time.now - start

  @validation_report

rescue NotEnoughHitsError
  @validation_report = ValidationReport.new('Not enough evidence', :warning,
                                            @short_header, @header,
                                            @description, @approach,
                                            @explanation, @conclusion)
rescue Exception
  @validation_report = ValidationReport.new('Unexpected error', :error,
                                            @short_header, @header,
                                            @description, @approach,
                                            @explanation, @conclusion)
  @validation_report.errors.push 'Unexpected Error'
end