ForePLay: The Polish Dataset of Erotic Content

We present ForePLay, a large-scale annotated corpus of Polish language content consisting of 24,583 sentences. The dataset was systematically sampled from two primary sources:

User-generated content from online fiction repositories,
Polish literary works, including translations of world literature and LGBTQ+ literature.

Annotation Process

The annotation process involved a gender-balanced team of 6 annotators. Each sentence was annotated by three annotators, employing majority voting. In cases of complete disagreement, a superannotation process resolved discrepancies.

ForePLay Dataset Composition

6,361 sentences labeled as erotic
1,344 sentences labeled as ambiguous
16,878 sentences labeled as neutral

For detailed annotation guidelines, refer to the accompanying publication:

Behind Closed Words: Creating and Investigating the ForePLay Annotated Dataset for Polish Erotic Discourse

Please note that the released dataset does not include the minor classes outlined in the original data framework that pertain to sexual violence and socially unacceptable behaviors. Due to ethical considerations, we have chosen not to publish potentially harmful data.

Release 1.0

A subset of erotic and ambiguous sentences, totaling 3,704 samples, has been released.

ForePLay Dataset Release 1.0 Composition

2,728 sentences labeled as erotic
976 sentences labeled as ambiguous

This subset has undergone additional copyright verification and was made available following legal consultations. The released data and its license comply with new legal regulations that came into effect after the data collection and annotation process had been completed. The published dataset will be gradually expanded.

The password-protected data stored within a .zip file, along with the password itself, can be found here.

⚠️ Content Warning:
This repository contains a dataset that includes erotic and potentially sensitive textual content intended strictly for research purposes. It is recommended for use by individuals aged 18 and older.
The content has been annotated and curated in accordance with legal and ethical guidelines, and it is intended to support research on content moderation, natural language processing, and harmful content detection.

Disclaimer

We have made every effort to comply with license requirements and respect the rights associated with all source materials included in the dataset. If you have any concerns or questions regarding the content, please feel free to contact the Department of Linguistic Engineering and Text Analysis. We will review and address them promptly.

Citation

If you make use of this dataset, please cite the following paper:

Kołos, A., Lorenc, K., Wiśnios, E., Karlińska, A. Behind Closed Words: Creating and Investigating the ForePLay Annotated Dataset for Polish Erotic Discourse. 2024. arXiv:2412.17533.

License

The dataset and code in the repository are made available under a Attribution-NonCommercial-NoDerivatives 4.0 International license.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
imgs		imgs
README.md		README.md
license.txt		license.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ForePLay: The Polish Dataset of Erotic Content

Annotation Process

ForePLay Dataset Composition

Release 1.0

ForePLay Dataset Release 1.0 Composition

Disclaimer

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ForePLay: The Polish Dataset of Erotic Content

Annotation Process

ForePLay Dataset Composition

Release 1.0

ForePLay Dataset Release 1.0 Composition

Disclaimer

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages