Class: Gitlab::Database::PostgresHll::BatchDistinctCounter
- Inherits:
-
Object
- Object
- Gitlab::Database::PostgresHll::BatchDistinctCounter
- Defined in:
- lib/gitlab/database/postgres_hll/batch_distinct_counter.rb
Overview
HyperLogLog is an PROBABILISTIC algorithm that ESTIMATES distinct count of given attribute value for supplied relation Like all probabilistic algorithm is has ERROR RATE margin, that can affect values, for given implementation no higher value was reported (gitlab.com/gitlab-org/gitlab/-/merge_requests/45673#accuracy-estimation) than 5.3% for the most of a cases this value is lower. However, if the exact value is necessary other tools has to be used.
For large tables, PostgreSQL can take a long time to count rows due to MVCC. Implements a distinct batch counter based on HyperLogLog algorithm Needs indexes on the column below to calculate max, min and range queries For larger tables just set higher batch_size with index optimization
In order to not use a possible complex time consuming query when calculating min and max values, the start and finish can be sent specifically, start and finish should contain max and min values for PRIMARY KEY of relation (most cases ‘id` column) rather than counted attribute eg: estimate_distinct_count(start: ::Project.aimed_for_deletion.minimum(:id), finish: ::Project.aimed_for_deletion.maximum(:id))
Grouped relations are NOT supported yet.
Constant Summary collapse
- ERROR_RATE =
max encountered empirical error rate, used in tests
4.9
- MIN_REQUIRED_BATCH_SIZE =
750
- SLEEP_TIME_IN_SECONDS =
10 msec sleep
0.01
- MAX_DATA_VOLUME =
4_000_000_000
- DEFAULT_BATCH_SIZE =
Each query should take < 500ms gitlab.com/gitlab-org/gitlab/-/merge_requests/22705
10_000
- ZERO_OFFSET =
1
- BUCKET_ID_MASK =
(Buckets::TOTAL_BUCKETS - ZERO_OFFSET).to_s(2)
- BIT_31_MASK =
"B'0#{'1' * 31}'"
- BIT_32_NORMALIZED_BUCKET_ID_MASK =
"B'#{'0' * (32 - BUCKET_ID_MASK.size)}#{BUCKET_ID_MASK}'"
- WRONG_CONFIGURATION_ERROR =
Class.new(ActiveRecord::StatementInvalid)
Instance Method Summary collapse
-
#execute(batch_size: nil, start: nil, finish: nil) ⇒ Gitlab::Database::PostgresHll::Buckets
Executes counter that iterates over database source and return Gitlab::Database::PostgresHll::Buckets that can be used to estimation of number of uniq elements in analysed set.
-
#initialize(relation, column = nil) ⇒ BatchDistinctCounter
constructor
A new instance of BatchDistinctCounter.
Constructor Details
#initialize(relation, column = nil) ⇒ BatchDistinctCounter
Returns a new instance of BatchDistinctCounter.
47 48 49 50 |
# File 'lib/gitlab/database/postgres_hll/batch_distinct_counter.rb', line 47 def initialize(relation, column = nil) @relation = relation @column = column || relation.primary_key end |
Instance Method Details
#execute(batch_size: nil, start: nil, finish: nil) ⇒ Gitlab::Database::PostgresHll::Buckets
Executes counter that iterates over database source and return Gitlab::Database::PostgresHll::Buckets that can be used to estimation of number of uniq elements in analysed set
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
# File 'lib/gitlab/database/postgres_hll/batch_distinct_counter.rb', line 59 def execute(batch_size: nil, start: nil, finish: nil) raise 'BatchCount can not be run inside a transaction' if transaction_open? batch_size ||= DEFAULT_BATCH_SIZE start = actual_start(start) finish = actual_finish(finish) raise WRONG_CONFIGURATION_ERROR if unwanted_configuration?(start, finish, batch_size) batch_start = start hll_buckets = Buckets.new while batch_start <= finish hll_buckets.merge_hash!(hll_buckets_for_batch(batch_start, batch_start + batch_size)) batch_start += batch_size sleep(SLEEP_TIME_IN_SECONDS) end hll_buckets end |