Find Fuzzy Duplicates in CSV Files

Regular duplicate detection only catches exact matches. But what about typos? "Jon Smith" and "John Smith" are clearly the same person, but they won't show up as duplicates.

Fuzzy duplicate detection finds rows that are similar but not identical.

What is Fuzzy Matching?

Fuzzy matching uses algorithms to calculate how similar two strings are. The most common is Levenshtein distance—the number of edits needed to transform one string into another.

String A	String B	Distance	Similarity
John	Jon	1	75%
Smith	Smyth	1	80%
Microsoft	Microsft	1	89%

Find Fuzzy Duplicates Online

-> Fuzzy Duplicate Finder

Upload your CSV
Select the column to compare
Set similarity threshold (80% recommended)
Download grouped results

Understanding the Output

The tool adds two columns:

_DUPLICATE_GROUP - Number identifying which rows are similar
_SIMILARITY - How closely each row matches the group

Example output:

Name,Email,_DUPLICATE_GROUP,_SIMILARITY
John Smith,john@email.com,1,1
Jon Smith,jon@email.com,1,0.89
Jonathan Smith,jonathan@email.com,1,0.72

Choosing the Right Threshold

Threshold	Catches	Risk
90%+	Minor typos only	Few false positives
80%	Common variations	Good balance
70%	Significant differences	More false positives
60%	Very loose matching	Many false positives

Recommended: Start with 80% and adjust based on results.

Common Use Cases

Contact Deduplication

"Robert Johnson" vs "Rob Johnson"
"Mary O'Brien" vs "Mary OBrien"

Product Matching

"iPhone 15 Pro" vs "iPhone15 Pro"
"Samsung Galaxy S24" vs "Samsung Galaxy S 24"

Address Cleanup

"123 Main St" vs "123 Main Street"
"New York, NY" vs "New York NY"

Company Name Normalization

"Microsoft Corp" vs "Microsoft Corporation"
"Apple Inc." vs "Apple"

Python Alternative

from fuzzywuzzy import fuzz
import pandas as pd

df = pd.read_csv('data.csv')

# Compare each pair
for i, row1 in df.iterrows():
    for j, row2 in df.iterrows():
        if i < j:
            score = fuzz.ratio(row1['Name'], row2['Name'])
            if score > 80:
                print(f"Similar: {row1['Name']} ~ {row2['Name']} ({score}%)")

Note: This O(n²) approach is slow for large datasets. Our tool is optimized for performance.

Catch the typos. HappyCSV finds fuzzy duplicates that exact matching misses.

Find Fuzzy Duplicates in CSV Files

Find Fuzzy Duplicates in CSV Files

What is Fuzzy Matching?

Find Fuzzy Duplicates Online

Understanding the Output

Choosing the Right Threshold

Common Use Cases

Contact Deduplication

Product Matching

Address Cleanup

Company Name Normalization

Python Alternative

Related Articles

Convert CSV to HTML Table

Convert CSV to Markdown Table

Convert TSV to CSV (Tab-Separated to Comma-Separated)

Need to handle CSV files?