Module: Statsample::Bivariate
- Defined in:
- lib/statsample/bivariate.rb,
lib/statsample/bivariate/pearson.rb
Overview
Diverse methods and classes to calculate bivariate relations Specific classes:
-
Statsample::Bivariate::Pearson : Pearson correlation coefficient ®
-
Statsample::Bivariate::Tetrachoric : Tetrachoric correlation
-
Statsample::Bivariate::Polychoric : Polychoric correlation (using joint, two-step and polychoric series)
Defined Under Namespace
Classes: Pearson
Class Method Summary collapse
-
.correlation_matrix(ds) ⇒ Object
Correlation matrix.
- .correlation_matrix_optimized(ds) ⇒ Object
- .correlation_matrix_pairwise(ds) ⇒ Object
-
.correlation_probability_matrix(ds, tails = :both) ⇒ Object
Matrix of correlation probabilities.
-
.covariance(v1, v2) ⇒ Object
Covariance between two vectors.
-
.covariance_matrix(ds) ⇒ Object
Covariance matrix.
- .covariance_matrix_optimized(ds) ⇒ Object
- .covariance_matrix_pairwise(ds) ⇒ Object
-
.covariance_slow(v1, v2) ⇒ Object
:nodoc:.
-
.gamma(matrix) ⇒ Object
Calculates Goodman and Kruskal’s gamma.
-
.maximum_likehood_dichotomic(pred, real) ⇒ Object
Estimate the ML between two dichotomic vectors.
-
.min_n_valid(ds) ⇒ Object
Report the minimum number of cases valid of a covariate matrix based on a dataset.
-
.n_valid_matrix(ds) ⇒ Object
Retrieves the n valid pairwise.
- .ordered_pairs(vector) ⇒ Object
-
.pairs(matrix) ⇒ Object
Calculate indexes for a matrix the rows and cols has to be ordered.
-
.partial_correlation(v1, v2, control) ⇒ Object
Correlation between v1 and v2, controling the effect of control on both.
-
.pearson(v1, v2) ⇒ Object
(also: correlation)
Calculate Pearson correlation coefficient ® between 2 vectors.
-
.pearson_slow(v1, v2) ⇒ Object
:nodoc:.
-
.point_biserial(dichotomous, continous) ⇒ Object
Calculate Point biserial correlation.
-
.prediction_optimized(vars, cases) ⇒ Object
Predicted time for optimized correlation matrix, in miliseconds See benchmarks/correlation_matrix.rb to see mode of calculation.
-
.prediction_pairwise(vars, cases) ⇒ Object
Predicted time for pairwise correlation matrix, in miliseconds See benchmarks/correlation_matrix.rb to see mode of calculation.
-
.prop_pearson(t, size, tails = :both) ⇒ Object
Retrieves the probability value (a la SPSS) for a given t, size and number of tails.
-
.residuals(from, del) ⇒ Object
Returns residual score after delete variance from another variable.
-
.spearman(v1, v2) ⇒ Object
Spearman ranked correlation coefficient (rho) between 2 vectors.
- .sum_of_squares(v1, v2) ⇒ Object
-
.t_pearson(v1, v2) ⇒ Object
Retrieves the value for t test for a pearson correlation between two vectors to test the null hipothesis of r=0.
-
.t_r(r, size) ⇒ Object
Retrieves the value for t test for a pearson correlation giving r and vector size Source : faculty.chass.ncsu.edu/garson/PA765/correl.htm.
-
.tau_a(v1, v2) ⇒ Object
Kendall Rank Correlation Coefficient (Tau a) Based on Hervé Adbi article.
-
.tau_b(matrix) ⇒ Object
Calculates Goodman and Kruskal’s Tau b correlation.
Class Method Details
.correlation_matrix(ds) ⇒ Object
Correlation matrix. Order of rows and columns depends on Dataset#fields order
191 192 193 194 195 196 197 198 199 200 201 |
# File 'lib/statsample/bivariate.rb', line 191 def correlation_matrix(ds) vars,cases=ds.fields.size,ds.cases if !ds.has_missing_data? and Statsample.has_gsl? and prediction_optimized(vars,cases) < prediction_pairwise(vars,cases) cm=correlation_matrix_optimized(ds) else cm=correlation_matrix_pairwise(ds) end cm.extend(Statsample::CovariateMatrix) cm.fields=ds.fields cm end |
.correlation_matrix_optimized(ds) ⇒ Object
203 204 205 206 207 208 209 210 211 212 |
# File 'lib/statsample/bivariate.rb', line 203 def correlation_matrix_optimized(ds) s=covariance_matrix_optimized(ds) sds=GSL::Matrix.diagonal(s.diagonal.sqrt.pow(-1)) cm=sds*s*sds # Fix diagonal s.row_size.times {|i| cm[i,i]=1.0 } cm end |
.correlation_matrix_pairwise(ds) ⇒ Object
213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 |
# File 'lib/statsample/bivariate.rb', line 213 def correlation_matrix_pairwise(ds) cache={} cm=ds.collect_matrix do |row,col| if row==col 1.0 elsif (ds[row].type!=:numeric or ds[col].type!=:numeric) nil else if cache[[col,row]].nil? r=pearson(ds[row],ds[col]) cache[[row,col]]=r r else cache[[col,row]] end end end end |
.correlation_probability_matrix(ds, tails = :both) ⇒ Object
Matrix of correlation probabilities. Order of rows and columns depends on Dataset#fields order
247 248 249 250 251 252 253 254 255 |
# File 'lib/statsample/bivariate.rb', line 247 def correlation_probability_matrix(ds, tails=:both) rows=ds.fields.collect do |row| ds.fields.collect do |col| v1a,v2a=Statsample.only_valid_clone(ds[row],ds[col]) (row==col or ds[row].type!=:numeric or ds[col].type!=:numeric) ? nil : prop_pearson(t_pearson(ds[row],ds[col]), v1a.size, tails) end end Matrix.rows(rows) end |
.covariance(v1, v2) ⇒ Object
Covariance between two vectors
13 14 15 16 17 18 19 20 21 |
# File 'lib/statsample/bivariate.rb', line 13 def covariance(v1,v2) v1a,v2a=Statsample.only_valid_clone(v1,v2) return nil if v1a.size==0 if Statsample.has_gsl? GSL::Stats::covariance(v1a.gsl, v2a.gsl) else covariance_slow(v1a,v2a) end end |
.covariance_matrix(ds) ⇒ Object
Covariance matrix. Order of rows and columns depends on Dataset#fields order
155 156 157 158 159 160 161 162 163 164 165 166 |
# File 'lib/statsample/bivariate.rb', line 155 def covariance_matrix(ds) vars,cases=ds.fields.size,ds.cases if !ds.has_missing_data? and Statsample.has_gsl? and prediction_optimized(vars,cases) < prediction_pairwise(vars,cases) cm=covariance_matrix_optimized(ds) else cm=covariance_matrix_pairwise(ds) end cm.extend(Statsample::CovariateMatrix) cm.fields=ds.fields cm end |
.covariance_matrix_optimized(ds) ⇒ Object
141 142 143 144 145 146 147 148 149 150 |
# File 'lib/statsample/bivariate.rb', line 141 def covariance_matrix_optimized(ds) x=ds.to_gsl n=x.row_size m=x.column_size means=((1/n.to_f)*GSL::Matrix.ones(1,n)*x).row(0) centered=x-(GSL::Matrix.ones(n,m)*GSL::Matrix.diag(means)) ss=centered.transpose*centered s=((1/(n-1).to_f))*ss s end |
.covariance_matrix_pairwise(ds) ⇒ Object
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
# File 'lib/statsample/bivariate.rb', line 169 def covariance_matrix_pairwise(ds) cache={} matrix=ds.collect_matrix do |row,col| if (ds[row].type!=:numeric or ds[col].type!=:numeric) nil elsif row==col ds[row].variance else if cache[[col,row]].nil? cov=covariance(ds[row],ds[col]) cache[[row,col]]=cov cov else cache[[col,row]] end end end matrix end |
.covariance_slow(v1, v2) ⇒ Object
:nodoc:
32 33 34 35 |
# File 'lib/statsample/bivariate.rb', line 32 def covariance_slow(v1,v2) # :nodoc: v1a,v2a=Statsample.only_valid(v1,v2) sum_of_squares(v1a,v2a) / (v1a.size-1) end |
.gamma(matrix) ⇒ Object
Calculates Goodman and Kruskal’s gamma.
Gamma is the surplus of concordant pairs over discordant pairs, as a percentage of all pairs ignoring ties.
Source: faculty.chass.ncsu.edu/garson/PA765/assocordinal.htm
305 306 307 308 |
# File 'lib/statsample/bivariate.rb', line 305 def gamma(matrix) v=pairs(matrix) (v['P']-v['Q']).to_f / (v['P']+v['Q']).to_f end |
.maximum_likehood_dichotomic(pred, real) ⇒ Object
Estimate the ML between two dichotomic vectors
23 24 25 26 27 28 29 30 |
# File 'lib/statsample/bivariate.rb', line 23 def maximum_likehood_dichotomic(pred,real) preda,reala=Statsample.only_valid_clone(pred,real) sum=0 preda.each_index{|i| sum+=(reala[i]*Math::log(preda[i])) + ((1-reala[i])*Math::log(1-preda[i])) } sum end |
.min_n_valid(ds) ⇒ Object
Report the minimum number of cases valid of a covariate matrix based on a dataset
373 374 375 376 377 378 379 380 381 382 |
# File 'lib/statsample/bivariate.rb', line 373 def min_n_valid(ds) min=ds.cases m=n_valid_matrix(ds) for x in 0...m.row_size for y in 0...m.column_size min=m[x,y] if m[x,y] < min end end min end |
.n_valid_matrix(ds) ⇒ Object
Retrieves the n valid pairwise.
233 234 235 236 237 238 239 240 241 242 |
# File 'lib/statsample/bivariate.rb', line 233 def n_valid_matrix(ds) ds.collect_matrix do |row,col| if row==col ds[row].valid_data.size else rowa,rowb=Statsample.only_valid_clone(ds[row],ds[col]) rowa.size end end end |
.ordered_pairs(vector) ⇒ Object
351 352 353 354 355 356 357 358 359 360 |
# File 'lib/statsample/bivariate.rb', line 351 def ordered_pairs(vector) d=vector.data a=[] (0...(d.size-1)).each{|i| ((i+1)...(d.size)).each {|j| a.push([d[i],d[j]]) } } a end |
.pairs(matrix) ⇒ Object
Calculate indexes for a matrix the rows and cols has to be ordered
310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 |
# File 'lib/statsample/bivariate.rb', line 310 def pairs(matrix) # calculate concordant #p matrix rs=matrix.row_size cs=matrix.column_size conc=disc=ties_x=ties_y=0 (0...(rs-1)).each do |x| (0...(cs-1)).each do |y| ((x+1)...rs).each do |x2| ((y+1)...cs).each do |y2| # #p sprintf("%d:%d,%d:%d",x,y,x2,y2) conc+=matrix[x,y]*matrix[x2,y2] end end end end (0...(rs-1)).each {|x| (1...(cs)).each{|y| ((x+1)...rs).each{|x2| (0...y).each{|y2| # #p sprintf("%d:%d,%d:%d",x,y,x2,y2) disc+=matrix[x,y]*matrix[x2,y2] } } } } (0...(rs-1)).each {|x| (0...(cs)).each{|y| ((x+1)...(rs)).each{|x2| ties_x+=matrix[x,y]*matrix[x2,y] } } } (0...rs).each {|x| (0...(cs-1)).each{|y| ((y+1)...(cs)).each{|y2| ties_y+=matrix[x,y]*matrix[x,y2] } } } {'P'=>conc,'Q'=>disc,'Y'=>ties_y,'X'=>ties_x} end |
.partial_correlation(v1, v2, control) ⇒ Object
Correlation between v1 and v2, controling the effect of control on both.
132 133 134 135 136 137 138 139 |
# File 'lib/statsample/bivariate.rb', line 132 def partial_correlation(v1,v2,control) v1a,v2a,cona=Statsample.only_valid_clone(v1,v2,control) rv1v2=pearson(v1a,v2a) rv1con=pearson(v1a,cona) rv2con=pearson(v2a,cona) (rv1v2-(rv1con*rv2con)).quo(Math::sqrt(1-rv1con**2) * Math::sqrt(1-rv2con**2)) end |
.pearson(v1, v2) ⇒ Object Also known as: correlation
Calculate Pearson correlation coefficient ® between 2 vectors
43 44 45 46 47 48 49 50 51 |
# File 'lib/statsample/bivariate.rb', line 43 def pearson(v1,v2) v1a,v2a=Statsample.only_valid_clone(v1,v2) return nil if v1a.size ==0 if Statsample.has_gsl? GSL::Stats::correlation(v1a.gsl, v2a.gsl) else pearson_slow(v1a,v2a) end end |
.pearson_slow(v1, v2) ⇒ Object
:nodoc:
52 53 54 55 56 57 |
# File 'lib/statsample/bivariate.rb', line 52 def pearson_slow(v1,v2) # :nodoc: v1a,v2a=Statsample.only_valid_clone(v1,v2) # Calculate sum of squares ss=sum_of_squares(v1a,v2a) ss.quo(Math::sqrt(v1a.sum_of_squares) * Math::sqrt(v2a.sum_of_squares)) end |
.point_biserial(dichotomous, continous) ⇒ Object
Calculate Point biserial correlation. Equal to Pearson correlation, with one dichotomous value replaced by “0” and the other by “1”
265 266 267 268 269 270 271 272 273 |
# File 'lib/statsample/bivariate.rb', line 265 def point_biserial(dichotomous,continous) ds={'d'=>dichotomous,'c'=>continous}.to_dataset.dup_only_valid raise(TypeError, "First vector should be dichotomous") if ds['d'].factors.size!=2 raise(TypeError, "Second vector should be continous") if ds['c'].type!=:numeric f0=ds['d'].factors.sort[0] m0=ds.filter_field('c') {|c| c['d']==f0} m1=ds.filter_field('c') {|c| c['d']!=f0} ((m1.mean-m0.mean).to_f / ds['c'].sdp) * Math::sqrt(m0.size*m1.size.to_f / ds.cases**2) end |
.prediction_optimized(vars, cases) ⇒ Object
Predicted time for optimized correlation matrix, in miliseconds See benchmarks/correlation_matrix.rb to see mode of calculation
111 112 113 |
# File 'lib/statsample/bivariate.rb', line 111 def prediction_optimized(vars,cases) ((4+0.018128*cases+0.246871*vars+0.001169*vars*cases)**2) / 100 end |
.prediction_pairwise(vars, cases) ⇒ Object
Predicted time for pairwise correlation matrix, in miliseconds See benchmarks/correlation_matrix.rb to see mode of calculation
105 106 107 |
# File 'lib/statsample/bivariate.rb', line 105 def prediction_pairwise(vars,cases) ((-0.518111-0.000746*cases+1.235608*vars+0.000740*cases*vars)**2) / 100 end |
.prop_pearson(t, size, tails = :both) ⇒ Object
Retrieves the probability value (a la SPSS) for a given t, size and number of tails. Uses a second parameter
-
:both or 2 : for r!=0 (default)
-
:right, :positive or 1 : for r > 0
-
:left, :negative : for r < 0
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
# File 'lib/statsample/bivariate.rb', line 83 def prop_pearson(t, size, tails=:both) tails=:both if tails==2 tails=:right if tails==1 or tails==:positive tails=:left if tails==:negative n_tails=case tails when :both then 2 else 1 end t=-t if t>0 and (tails==:both) cdf=Distribution::T.cdf(t, size-2) if(tails==:right) 1.0-(cdf*n_tails) else cdf*n_tails end end |
.residuals(from, del) ⇒ Object
Returns residual score after delete variance from another variable
117 118 119 120 121 122 123 124 125 126 127 128 129 |
# File 'lib/statsample/bivariate.rb', line 117 def residuals(from,del) r=Statsample::Bivariate.pearson(from,del) froms, dels = from.vector_standarized, del.vector_standarized nv=[] froms.data_with_nils.each_index do |i| if froms[i].nil? or dels[i].nil? nv.push(nil) else nv.push(froms[i]-r*dels[i]) end end nv.to_vector(:numeric) end |
.spearman(v1, v2) ⇒ Object
Spearman ranked correlation coefficient (rho) between 2 vectors
258 259 260 261 262 |
# File 'lib/statsample/bivariate.rb', line 258 def spearman(v1,v2) v1a,v2a=Statsample.only_valid_clone(v1,v2) v1r,v2r=v1a.ranked(:numeric),v2a.ranked(:numeric) pearson(v1r,v2r) end |
.sum_of_squares(v1, v2) ⇒ Object
36 37 38 39 40 41 |
# File 'lib/statsample/bivariate.rb', line 36 def sum_of_squares(v1,v2) v1a,v2a=Statsample.only_valid_clone(v1,v2) m1=v1a.mean m2=v2a.mean (v1a.size).times.inject(0) {|ac,i| ac+(v1a[i]-m1)*(v2a[i]-m2)} end |
.t_pearson(v1, v2) ⇒ Object
Retrieves the value for t test for a pearson correlation between two vectors to test the null hipothesis of r=0
61 62 63 64 65 66 67 68 69 |
# File 'lib/statsample/bivariate.rb', line 61 def t_pearson(v1,v2) v1a,v2a=Statsample.only_valid_clone(v1,v2) r=pearson(v1a,v2a) if(r==1.0) 0 else t_r(r,v1a.size) end end |
.t_r(r, size) ⇒ Object
Retrieves the value for t test for a pearson correlation giving r and vector size Source : faculty.chass.ncsu.edu/garson/PA765/correl.htm
73 74 75 |
# File 'lib/statsample/bivariate.rb', line 73 def t_r(r,size) r * Math::sqrt(((size)-2).to_f / (1 - r**2)) end |
.tau_a(v1, v2) ⇒ Object
Kendall Rank Correlation Coefficient (Tau a) Based on Hervé Adbi article
276 277 278 279 280 281 282 283 284 |
# File 'lib/statsample/bivariate.rb', line 276 def tau_a(v1,v2) v1a,v2a=Statsample.only_valid_clone(v1,v2) n=v1.size v1r,v2r=v1a.ranked(:numeric),v2a.ranked(:numeric) o1=ordered_pairs(v1r) o2=ordered_pairs(v2r) delta= o1.size*2-(o2 & o1).size*2 1-(delta * 2 / (n*(n-1)).to_f) end |
.tau_b(matrix) ⇒ Object
Calculates Goodman and Kruskal’s Tau b correlation. Tb is an asymmetric P-R-E measure of association for nominal scales (Mielke, X)
Tau-b defines perfect association as strict monotonicity. Although it requires strict monotonicity to reach 1.0, it does not penalize ties as much as some other measures.
Reference
Mielke, P. GOODMAN–KRUSKAL TAU AND GAMMA. Source: faculty.chass.ncsu.edu/garson/PA765/assocordinal.htm
295 296 297 298 |
# File 'lib/statsample/bivariate.rb', line 295 def tau_b(matrix) v=pairs(matrix) ((v['P']-v['Q']).to_f / Math::sqrt((v['P']+v['Q']+v['Y'])*(v['P']+v['Q']+v['X'])).to_f) end |