Created and published the “Natural Unit Conversion Dataset” on Hugging Face, comprising 830K+ samples for Named-Entity Recognition (NER) in unit conversion tasks.
Designed a structured annotation schema using spaCy-supported entity types (UNIT_VALUE, FROM_UNIT, TO_UNIT, etc.) to enable precise extraction of unit conversion data from natural language.
Implemented stratified sampling to ensure balanced entity distribution across training (583K), validation (100K), and test (150K) splits for robust model training.
Curated a diverse dataset covering 100+ unit types across multiple domains, including length, mass, volume, temperature, time, and energy.
Optimized dataset loading and integration with the Hugging Face datasets library, facilitating seamless access for researchers and developers.
Released the dataset under the CC0-1.0 license, making it freely available for research and commercial applications.
Natural Unit Conversion Dataset

Written by
in