Introduction
Semantic Textual Similarity (STS) is a core NLP task, but judgments often depend on which aspect of the sentences is being compared. To address this, the Conditional STS (C-STS) task was proposed, where sentence pairs are evaluated under explicit conditions (e.g., number of people, color of objects).
However, the original C-STS dataset contained ambiguous and inconsistent condition statements as well as noisy similarity ratings, which limited model performance.
Our Work
In this paper (EMNLP 2025), we present a large-scale re-annotated C-STS dataset created with the assistance of Large Language Models (LLMs).
- We first refine condition statements to remove ambiguity and grammatical issues.
- Then we use GPT-4o and Claude-3.7-Sonnet to re-annotate similarity scores, combining their judgments with the original human labels.
- The resulting dataset is more accurate, balanced, and reliable for training C-STS models.
Key Results
- Our re-annotated dataset achieves a 5.4% improvement in Spearman correlation.
- Models trained on our dataset reach 73.9% correlation with human-labeled test data.
- This resource substantially improves robustness and consistency in conditional similarity measurement.
Download
The re-annotated dataset is released to support further research in conditional semantic similarity.