Hash collisions

Dimensions in Adobe Analytics collect string values. Sometimes these strings are hundreds of characters long, while other times they are short. To improve performance, these string values are not used directly in report-time processing. Instead, a hash is computed for each value, producing a uniform-size identifier. For most fields, the value is converted to lowercase before hashing, which reduces the total number of unique values. All reports run on these hashed values, which drastically increases their performance.

Adobe Analytics maintains a separate hash table for each variable, and each table is rebuilt every month. Within any one of those tables, two different source values can occasionally produce the same hash, known as a hash collision.

Hash collisions can manifest in reports as follows:

  • If you view a report over time and see an unexpected spike, it is possible that multiple unique values for that variable use the same hash.
  • If you use a segment and see an unexpected value, it is possible that the unexpected dimension item uses the same hash as another dimension item that matched your segment.

Odds of a hash collision

Adobe Analytics uses 32-bit hashes for most dimensions, which means that there are 232 possible hash combinations (approximately 4.3 billion). The approximate odds of encountering a hash collision based on the number of unique values are as follows. These odds are based on a single dimension for a single month.

Unique values
Odds
1,000
0.01%
10,000
1%
50,000
26%
100,000
71%

Similar to the birthday paradox, the likelihood of hash collisions drastically increases as the number of unique values increases. At 1 million unique values, it is likely that there are at least 100 hash collisions for that dimension.

Mitigating hash collisions

Hash collisions cannot be eliminated entirely, but their impact on reports can be mitigated. Most hash collisions happen with two uncommon values, which have no meaningful impact on reports. Even if a hash collides with a common and uncommon value, the result is negligible. However, in rare cases where two popular values experience a hash collision, it is possible to see its effect clearly. Adobe recommends the following to reduce its effect in reports:

  • Change the date range: Hash tables change each month. Changing the date range to span another month can give each value different hashes that don’t collide. It is usually the fastest way to clear a visible anomaly from a specific report.
  • Reduce the number of unique values: You can adjust your implementation or use Processing rules to help reduce the number of unique values that a dimension collects. For example, if your dimension collects a URL, you can strip query strings or protocol.
  • Use Data Warehouse or Data Feeds: These tools do not rely on hash tables.
  • Move to Customer Journey Analytics: Customer Journey Analytics has no hashing layer and no cardinality limits on dimensions. Consider moving to this product if hash collisions or Low-Traffic frequently affect your reports.
recommendation-more-help
analytics-help-implement