This dataset contains naturally-occurring English sentences that feature non-trivial noun-verb ambiguity.
English part-of-speech taggers regularly make egregious errors related to noun-verb ambiguity, despite having achieved 97%+ accuracy on the WSJ Penn Treebank since 2002. These mistakes have been difficult to quantify and make taggers less useful to downstream tasks such as translation and text to speech synthesis.
The dataset contains sentences in CoNLL format. Each sentence has a single token that has been manually annotated as either VERB or NON-Verb. The sentences come from multiple domains. Where applicable, the url of the source page for the sentence is included in a comment line before the sentence.