Difference between revisions of "LGenerics"
Line 76: | Line 76: | ||
* 128-bit integers (unit LGInt128) | * 128-bit integers (unit LGInt128) | ||
* JSON validator/parser/generator(unit lgJson) | * JSON validator/parser/generator(unit lgJson) | ||
+ | |||
+ | =Perfomance= | ||
+ | |||
+ | In 2022/03, a number of functions have been added that implement some algorithms on strings and sequences, regarding metrics and fuzzy matching, the list is annotated in the README of the project. The example/fuzz_bench folder contains several benchmarks for these functions. | ||
+ | |||
+ | It was curious to compare the performance of the SimRatioLevEx() function (which is inspired by FuzzyWuzzy) with some incarnations of the FuzzyWuzzy (listed here) on benchmark datasets. Disclamer: SimRatioLevEx() does not reproduce FuzzyWuzzy, but it does some things in a similar way, in particular, SimRatioLevEx() in smTokenSetEx mode and token_set_ratio() do roughly the same job. | ||
+ | |||
+ | It seems the C++ version only supports single-byte strings, so only compared to the single-byte version of SimRatioLevEx(): | ||
+ | <pre> | ||
+ | Dataset SimRatioLevEx() token_set_ratio() | ||
+ | |||
+ | Short/big_dist 1154 6440 | ||
+ | Short/small_dist 967 3020 | ||
+ | Medium/big_dist 811 3450 | ||
+ | Medium/small_dist 702 1470 | ||
+ | Long/big_dist 1966 15000 | ||
+ | Long/small_dist 1061 2250 | ||
+ | </pre> | ||
+ | |||
+ | The numbers indicate the run time in milliseconds; the C++ version was compiled with gcc-8.1.0 with options -O3 -m64; the Pascal version was compiled with FPC-3.3.1-9941-g8e6e9bbf33, -O3. The benchmark was run on a Windows x64 machine. | ||
+ | |||
+ | The Go version, on the contrary, always works with Unicode strings: | ||
+ | <pre> | ||
+ | Dataset SimRatioLevExUtf8() TokenSetRatio() | ||
+ | |||
+ | Short/big_dist 2143 18705 | ||
+ | Short/small_dist 1593 2224 | ||
+ | Medium/big_dist 1266 15062 | ||
+ | Medium/small_dist 853 1742 | ||
+ | Long/big_dist 3853 79851 | ||
+ | Long/small_dist 1269 3126 | ||
+ | </pre> | ||
+ | |||
+ | Go version: go1.10.4 linux/amd64; FPC-3.3.1-10683-g2a19e152b7 -O3. The benchmark was run on a virtual Linux machine. | ||
=Download= | =Download= | ||
GitHub: https://github.com/avk959/LGenerics | GitHub: https://github.com/avk959/LGenerics |
Revision as of 12:27, 9 March 2022
About
Collection of generic algorithms and data structures entirely written in/for FPC and Lazarus. Started as a self-education project, it now seems quite comfortable and fast.
Requires: FPC 3.2+, Lazarus 1.9+.
Author: A.Koverdyaev (avk)
License: Apache License 2.0
Features
Implemented primitives:
- stack (unit LGStack)
- queue (unit LGQueue)
- deque (unit LGDeque)
- vector (unit LGVector)
- vector of bits (unit LGVector)
- priority queue based on binary heap (unit LGPriorityQueue)
- priority queue with key update and melding based on pairing heap (unit LGPriorityQueue)
- sorted list (unit LGList)
- hashed list - array based list with the ability to fast search by key (unit LGList)
- hashset (unit LGHashSet)
- fine-grained concurrent hashset (unit LGHashSet)
- sorted set (unit LGTreeSet)
- set of arbitrary size (unit LGUtil, TGSet)
- hash multiset (unit LGHashMultiSet)
- fine-grained concurrent hashmultiset (unit LGHashMultiSet)
- sorted multiset (unit LGTreeMultiSet)
- hashmap (unit LGHashMap)
- fine-grained concurrent hashmap (unit LGHashMap)
- sorted map (unit LGTreeMap)
- hash multimap (unit LGMultiMap)
- tree multimap (unit LGMultiMap)
- list miltimap (unit LGMultiMap)
- bijective map (unit LGBiMap)
- sparse 2D table (unit LGTable2D)
- disjoint set (unit LGHashSet)
- AVL tree (unit LGAvlTree)
- red-black tree (unit LGRbTree)
- some treap variants (unit LGTreap)
- general rooted tree (unit LGRootTree)
- sparse labeled undirected graph (unit LGSimpleGraph)
- sparse labeled directed graph (unit LGSimpleDigraph)
Algorithms on arrays and vectors (mostly unit LGArrayHelpers):
- reverse, right/left cyclic shifts
- permutations
- binary search
- N-th order statistics
- inversion counting
- distinct values selection
- quicksort
- introsort
- dual pivot quicksort
- mergesort
- timsort (unit LGMiscUtils)
- counting sort
- radix sort
- translation of Orson Peters' PDQSort algorithm
- static segment tree
- ...
Algorithms on strings
- Boyer-Moore string matching algorithm(in Fast Search variant), case sensitive and case insensitive(unit lgStrHelpers)
- Boyer-Moore-Horspool-Raita algorithm(unit lgStrHelpers)
Other:
- non-cryptogarphic hashes (unit LGHash):
- Yann Collet's xxHash32, xxHash64
- Austin Appleby's MurmurHash2, MurmurHash2A, MurmurHash3_x86_32, MurmurHash64A
- brief and dirty implementation of futures concept (unit LGAsync)
- brief channel implementation (unit LGAsync)
- brief implementation of thread pool (unit LGAsync)
- 128-bit integers (unit LGInt128)
- JSON validator/parser/generator(unit lgJson)
Perfomance
In 2022/03, a number of functions have been added that implement some algorithms on strings and sequences, regarding metrics and fuzzy matching, the list is annotated in the README of the project. The example/fuzz_bench folder contains several benchmarks for these functions.
It was curious to compare the performance of the SimRatioLevEx() function (which is inspired by FuzzyWuzzy) with some incarnations of the FuzzyWuzzy (listed here) on benchmark datasets. Disclamer: SimRatioLevEx() does not reproduce FuzzyWuzzy, but it does some things in a similar way, in particular, SimRatioLevEx() in smTokenSetEx mode and token_set_ratio() do roughly the same job.
It seems the C++ version only supports single-byte strings, so only compared to the single-byte version of SimRatioLevEx():
Dataset SimRatioLevEx() token_set_ratio() Short/big_dist 1154 6440 Short/small_dist 967 3020 Medium/big_dist 811 3450 Medium/small_dist 702 1470 Long/big_dist 1966 15000 Long/small_dist 1061 2250
The numbers indicate the run time in milliseconds; the C++ version was compiled with gcc-8.1.0 with options -O3 -m64; the Pascal version was compiled with FPC-3.3.1-9941-g8e6e9bbf33, -O3. The benchmark was run on a Windows x64 machine.
The Go version, on the contrary, always works with Unicode strings:
Dataset SimRatioLevExUtf8() TokenSetRatio() Short/big_dist 2143 18705 Short/small_dist 1593 2224 Medium/big_dist 1266 15062 Medium/small_dist 853 1742 Long/big_dist 3853 79851 Long/small_dist 1269 3126
Go version: go1.10.4 linux/amd64; FPC-3.3.1-10683-g2a19e152b7 -O3. The benchmark was run on a virtual Linux machine.