Experiment 1: TIN Normalization & Duplicate Detection

Experiment Objective

This experiment validates the effectiveness of normalizing Tax Identification Numbers (TINs) to reduce duplicate supplier entries and improve dataset consistency.

Working Logic

  • Input Format: CSV with fields TIN, Company Name, Address
  • Normalization: Remove hyphens, enforce 9–12 digit format
  • Duplicate Detection: Match cleaned TINs across records
  • Partition: Separate into Unique Records and Duplicates

Data Source

  • Default Dataset: companies.csv (201 records)
  • Custom Upload: User option for testing external datasets

Validation Progress

  • Status: ✅ Experiment Completed
  • Total Records: 201
  • Processed Records: 201
  • Found Duplicates: 20

Results Statistics

  • Normalization Success: 95%
  • Unique Records Identified: 181
  • Duplicate Records Detected: 20

Detailed Results

Unique Records Example:

  • UnionBank — 721523532 → 721523532
  • Robinsons Retail — 402393699 → 402393699

Duplicate Records Example:

  • San Miguel Corporation
    • TIN Variants: 002-040-000-001, 002040000001
    • Normalized: 002040000001

Conclusion

TIN normalization successfully reduced inconsistencies and grouped suppliers accurately. Duplicate detection proved effective at reconciling supplier records despite input variations.

Data Source

Use default dataset:

Download companies.csv

Or upload your own CSV file:

Status: Ready to start
Processed records: 0
Found duplicates: 0
Waiting for start...

Unique Records

TIN Cleaned TIN Company Name Address

Duplicates

TIN Cleaned TIN Company Name Address