Levenshtein distance and Jaro–Winkler similarity
I'm trying to build a program take a large list of company name strings (often messy: legal suffixes, &, typos, abbreviations) and group names that likely refer to the same or closely related entities i.e. **near-duplicate / fuzzy clustering
Example:** Paypal Holding | Paypal holdings | Paypal Inc
Warner Bros. | Warnner Brothers
Note: The algorithm should handle hundred thousands of records.