Experiment 1: Data Loading and Analysis
Experiment 1: TIN Normalization & Duplicate Detection
Experiment Objective
This experiment validates the effectiveness of normalizing Tax Identification Numbers (TINs) to reduce duplicate supplier entries and improve dataset consistency.
Working Logic
- Input Format: CSV with fields TIN, Company Name, Address
- Normalization: Remove hyphens, enforce 9–12 digit format
- Duplicate Detection: Match cleaned TINs across records
- Partition: Separate into Unique Records and Duplicates
Data Source
- Default Dataset: companies.csv (201 records)
- Custom Upload: User option for testing external datasets
Validation Progress
- Status: ✅ Experiment Completed
- Total Records: 201
- Processed Records: 201
- Found Duplicates: 20
Results Statistics
- Normalization Success: 95%
- Unique Records Identified: 181
- Duplicate Records Detected: 20
Detailed Results
Unique Records Example:
- UnionBank — 721523532 → 721523532
- Robinsons Retail — 402393699 → 402393699
Duplicate Records Example:
- San Miguel Corporation
- TIN Variants: 002-040-000-001, 002040000001
- Normalized: 002040000001
Conclusion
TIN normalization successfully reduced inconsistencies and grouped suppliers accurately. Duplicate detection proved effective at reconciling supplier records despite input variations.
Data Source
Status:
Ready to start
Processed records:
0
Found duplicates:
0
Waiting for start...
Unique Records
| TIN | Cleaned TIN | Company Name | Address |
|---|
Duplicates
| TIN | Cleaned TIN | Company Name | Address |
|---|