3 min read
By HappyCSV Team

Find Fuzzy Duplicates in CSV Files

Find near-duplicate rows using similarity matching. Catch typos like 'Jon Smith' vs 'John Smith'. Free online tool.

Find Fuzzy Duplicates in CSV Files

Regular duplicate detection only catches exact matches. But what about typos? "Jon Smith" and "John Smith" are clearly the same person, but they won't show up as duplicates.

Fuzzy duplicate detection finds rows that are similar but not identical.

What is Fuzzy Matching?

Fuzzy matching uses algorithms to calculate how similar two strings are. The most common is Levenshtein distance—the number of edits needed to transform one string into another.

String AString BDistanceSimilarity
JohnJon175%
SmithSmyth180%
MicrosoftMicrosft189%

Find Fuzzy Duplicates Online

-> Fuzzy Duplicate Finder

  1. Upload your CSV
  2. Select the column to compare
  3. Set similarity threshold (80% recommended)
  4. Download grouped results

Understanding the Output

The tool adds two columns:

  • _DUPLICATE_GROUP - Number identifying which rows are similar
  • _SIMILARITY - How closely each row matches the group

Example output:

Name,Email,_DUPLICATE_GROUP,_SIMILARITY
John Smith,john@email.com,1,1
Jon Smith,jon@email.com,1,0.89
Jonathan Smith,jonathan@email.com,1,0.72

Choosing the Right Threshold

ThresholdCatchesRisk
90%+Minor typos onlyFew false positives
80%Common variationsGood balance
70%Significant differencesMore false positives
60%Very loose matchingMany false positives

Recommended: Start with 80% and adjust based on results.

Common Use Cases

Contact Deduplication

  • "Robert Johnson" vs "Rob Johnson"
  • "Mary O'Brien" vs "Mary OBrien"

Product Matching

  • "iPhone 15 Pro" vs "iPhone15 Pro"
  • "Samsung Galaxy S24" vs "Samsung Galaxy S 24"

Address Cleanup

  • "123 Main St" vs "123 Main Street"
  • "New York, NY" vs "New York NY"

Company Name Normalization

  • "Microsoft Corp" vs "Microsoft Corporation"
  • "Apple Inc." vs "Apple"

Python Alternative

from fuzzywuzzy import fuzz
import pandas as pd

df = pd.read_csv('data.csv')

# Compare each pair
for i, row1 in df.iterrows():
    for j, row2 in df.iterrows():
        if i < j:
            score = fuzz.ratio(row1['Name'], row2['Name'])
            if score > 80:
                print(f"Similar: {row1['Name']} ~ {row2['Name']} ({score}%)")

Note: This O(n²) approach is slow for large datasets. Our tool is optimized for performance.


Catch the typos. HappyCSV finds fuzzy duplicates that exact matching misses.

Need to handle CSV files?

HappyCSV is the free, secure way to merge, split, and clean your data — all in your browser.