Levenshtein distance and Jaro–Winkler similarity

07:35 28 Mar 2026

I'm trying to build a program take a large list of company name strings (often messy: legal suffixes, &, typos, abbreviations) and group names that likely refer to the same or closely related entities i.e. **near-duplicate / fuzzy clustering

Example:** Paypal Holding | Paypal holdings | Paypal Inc
Warner Bros. | Warnner Brothers

Note: The algorithm should handle hundred thousands of records.

java data-structures jvm levenshtein-distance jaro-winkler

Your Answer

Privacy & Cookie Consent