An Efficient Similarity Digests Database Lookup - A Logarithmic Divide & Conquer Approach

Frank Breitinger, Christian Rathgeb, Harald Baier


Investigating seized devices within digital forensics represents a challenging task due to the increasing amount of data. Common procedures utilize automated file identification, which reduces the amount of data an investigator has to examine manually. In the past years the research field of approximate matching arises to detect similar data. However, if n denotes the number of similarity digests in a database, then the lookup for a single similarity digest is of complexity of O(n). This paper presents a concept to extend existing approximate matching algorithms, which reduces the lookup complexity from O(n) to O(log(n)). Our proposed approach is based on the well-known divide and conquer paradigm and builds a Bloom filter-based tree data structure in order to enable an efficient lookup of similarity digests. Further, it is demonstrated that the presented technique is highly scalable operating a trade-off between storage requirements and computational efficiency. We perform a theoretical assessment based on recently published results and reasonable magnitudes of input data, and show that the complexity reduction achieved by the proposed technique yields a 220-fold acceleration of look-up costs.


digital forensics; hashing; approximate matching; Bloom filter; mrsh-v2; sdhash; indexing

Full Text:



Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13 , 422--426.

Breitinger, F., & Baier, H. (2013). Similarity preserving hashing: Eligible properties and a new algorithm mrsh-v2. In M. Rogers & K. Seigfried-Spellar (Eds.), Digital forensics and cyber crime (Vol. 114, pp. 167--182). Springer Berlin Heidelberg. Retrieved from 978-3-642-39891-9 11 doi: 10.1007/978-3-642-39891-9 11

Breitinger, F., Baier, H., & White, D. (2014). On the database lookup problem of approximate matching. 1st Digital Forensics Research Conference EU (DFRWS-EU'14).

Breitinger, F., Liu, H., Winter, C., Baier, H., Rybalchenko, A., & Steinebach, M. (2013, Sept). Towards a process model for hash functions in digital forensics. 5th International Conference on Digital Forensics & Cyber Crime.

Breitinger, F., Stivaktakis, G., & Baier, H. (2013, August). FRASH: A framework to test algorithms of similarity hashing. In 13th Digital Forensics Research Conference (DFRWS'13). Monterey.

Breitinger, F., Stivaktakis, G., & Roussev, V. (2013, Sept). Evaluating detection error trade-offs for bytewise approximate matching algorithms. 5th ICST Conference on Digital Forensics & Cyber

Crime (ICDF2C).

Broder, A., & Mitzenmacher, M. (2005). Network Applications of Bloom Filters: A Survey. Internet Mathematics, 1 (4), 485--509.

Gallagher, P., & Director, A. (1995). Secure Hash Standard (SHS) (Tech. Rep.). National Institute of Standards and Technologies, Federal Information Processing Standards Publication 180-1.

Kornblum, J. (2006, September). Identifying almost identical files using context triggered piecewise hashing. Digital Investigation, 3 , 91--97. Retrieved from doi:


Mullin, J. (1990, may). Optimal semijoins for distributed database systems. IEEE Transactions on Software Engineering, 16 (5), 558 -560. doi: 10.1109/32.52778

NIST Information Technology Laboratory. (2013). National Software Reference Library.

Noll, L. C. (1994--2012). Fnv hash.

Roussev, V. (2010). Data fingerprinting with similarity digests. In K.-P. Chow & S. Shenoi (Eds.), Advances in digital forensics vi (Vol. 337, pp. 207--226). Springer Berlin Heidelberg. Retrieved from 978-3-642-15506-2 15 doi: 10.1007/978-3-642-15506-2n 15

Roussev, V. (2011, August). An evaluation of forensic similarity hashes. Digital Investigation, 8 , 34--41. Retrieved from doi:


Roussev, V. (2012). Managing terabyte-scale investigations with similarity digests. In

G. Peterson & S. Shenoi (Eds.), Advances in digital forensics viii (Vol. 383, pp. 19--34). Springer Berlin Heidelberg. Retrieved from 2 doi: 10.1007/978-3-642-33962-2 2

Roussev, V., III, G. G. R., & Marziale, L. 2007, September). Multi-resolution similarity hashing. Digital Investigation, 4, 105--113. doi: 10.1016/j.diin.2007.06.011

Winter, C., Schneider, M., & Yannikos, Y. (2013). F2s2: Fast forensic similarity search through indexing piecewise hashsignatures. Digital Investigation, 0 , -. (Article in Press - no journal issue assigned by now) doi: 10.1016/j.diin.2013.08.003


  • There are currently no refbacks.

Copyright (c) 2014 Journal of Digital Forensics, Security and Law

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

(c) 2006-2015 Association of Digital Forensics, Security and Law