A Large Self-Annotated Corpus for Sarcasm

Khodak, Mikhail; Saunshi, Nikunj; Vodrahalli, Kiran

Full-text links:

Download:

(license)

Current browse context:

cs.CL

< prev | next >

new | recent | 1704

Computer Science > Computation and Language

Title: A Large Self-Annotated Corpus for Sarcasm

Authors: Mikhail Khodak, Nikunj Saunshi, Kiran Vodrahalli

(Submitted on 19 Apr 2017 (v1), last revised 21 Apr 2017 (this version, v2))

Abstract: We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in regimes of both balanced and unbalanced labels. Each statement is furthermore self-annotated -- sarcasm is labeled by the author and not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, compare it to previous related corpora, and provide baselines for the task of sarcasm detection.

Comments:	5 pages, 5 figures. In submission
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Learning (cs.LG)
Cite as:	arXiv:1704.05579 [cs.CL]
	(or arXiv:1704.05579v2 [cs.CL] for this version)

Submission history

From: Kiran Vodrahalli [view email]
[v1] Wed, 19 Apr 2017 02:01:39 GMT (226kb,D)
[v2] Fri, 21 Apr 2017 01:25:08 GMT (226kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

arXiv.org > cs > arXiv:1704.05579

Download:

Current browse context:

Change to browse by:

References & Citations

1 blog link

Bookmark

Computer Science > Computation and Language

Title: A Large Self-Annotated Corpus for Sarcasm

Submission history