Find Fuzzy Duplicates in CSV Files
Find near-duplicate rows using similarity matching. Catch typos like 'Jon Smith' vs 'John Smith'. Free online tool.
Find Fuzzy Duplicates in CSV Files
Regular duplicate detection only catches exact matches. But what about typos? "Jon Smith" and "John Smith" are clearly the same person, but they won't show up as duplicates.
Fuzzy duplicate detection finds rows that are similar but not identical.
What is Fuzzy Matching?
Fuzzy matching uses algorithms to calculate how similar two strings are. The most common is Levenshtein distance—the number of edits needed to transform one string into another.
| String A | String B | Distance | Similarity |
|---|---|---|---|
| John | Jon | 1 | 75% |
| Smith | Smyth | 1 | 80% |
| Microsoft | Microsft | 1 | 89% |
Find Fuzzy Duplicates Online
- Upload your CSV
- Select the column to compare
- Set similarity threshold (80% recommended)
- Download grouped results
Understanding the Output
The tool adds two columns:
- _DUPLICATE_GROUP - Number identifying which rows are similar
- _SIMILARITY - How closely each row matches the group
Example output:
Name,Email,_DUPLICATE_GROUP,_SIMILARITY
John Smith,john@email.com,1,1
Jon Smith,jon@email.com,1,0.89
Jonathan Smith,jonathan@email.com,1,0.72
Choosing the Right Threshold
| Threshold | Catches | Risk |
|---|---|---|
| 90%+ | Minor typos only | Few false positives |
| 80% | Common variations | Good balance |
| 70% | Significant differences | More false positives |
| 60% | Very loose matching | Many false positives |
Recommended: Start with 80% and adjust based on results.
Common Use Cases
Contact Deduplication
- "Robert Johnson" vs "Rob Johnson"
- "Mary O'Brien" vs "Mary OBrien"
Product Matching
- "iPhone 15 Pro" vs "iPhone15 Pro"
- "Samsung Galaxy S24" vs "Samsung Galaxy S 24"
Address Cleanup
- "123 Main St" vs "123 Main Street"
- "New York, NY" vs "New York NY"
Company Name Normalization
- "Microsoft Corp" vs "Microsoft Corporation"
- "Apple Inc." vs "Apple"
Python Alternative
from fuzzywuzzy import fuzz
import pandas as pd
df = pd.read_csv('data.csv')
# Compare each pair
for i, row1 in df.iterrows():
for j, row2 in df.iterrows():
if i < j:
score = fuzz.ratio(row1['Name'], row2['Name'])
if score > 80:
print(f"Similar: {row1['Name']} ~ {row2['Name']} ({score}%)")
Note: This O(n²) approach is slow for large datasets. Our tool is optimized for performance.
Catch the typos. HappyCSV finds fuzzy duplicates that exact matching misses.
Related Articles
Convert CSV to HTML Table
Convert CSV files to HTML table code for websites. Free online tool with proper escaping and semantic markup.
Convert CSV to Markdown Table
Convert CSV files to Markdown tables for GitHub README files, documentation, and blog posts. Free online tool.
Convert TSV to CSV (Tab-Separated to Comma-Separated)
Convert TSV files to standard CSV format. Free online converter for tab-delimited data. Works with Excel exports.
Need to handle CSV files?
HappyCSV is the free, secure way to merge, split, and clean your data — all in your browser.