Datasets and Preprocessing for Discourse Data

Setup

You can easily install discopy-data by using pip:

pip install git+https://github.com/rknaebel/discopy-data

or you just clone the repository. The you can either install discopy-data through pip

pip install -e path/to/discopy-data

Usage

Discopy-data is the discopy backend that handles datastructures, preprocessing, and dataset extraction.

Sample preparation of a text file, adds also constituent parse trees

The first script uses trankit for tokenization, tagging, and dependency parsing. In addition, the second script is used, to add constituency trees with the supar parser. If dependency trees should be added by super as well, add the flag -d.

discopy-tokenize -i /some/examples/wsj_0336 | discopy-add-parses -c

Tokenize raw text without tagging nor parsing

This might be useful for neural pipeline that does not rely on language features.

cat /some/text | discopy-tokenize --tokenize-only

Preparation of full datasets

This is still experimental. A list of possible datasets is listed under cli/extract.py.

discopy-extract pdtb /data/discourse/conll2016/ --use-gpu --limit 2 | discopy-add-annotations pdtb /data/discourse/conll2016/ --simple-connectives --sense-level 2 | discopy-update-parses --dependency-parser ''

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
discopy_data		discopy_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datasets and Preprocessing for Discourse Data

Setup

Usage

Sample preparation of a text file, adds also constituent parse trees

Tokenize raw text without tagging nor parsing

Preparation of full datasets

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Datasets and Preprocessing for Discourse Data

Setup

Usage

Sample preparation of a text file, adds also constituent parse trees

Tokenize raw text without tagging nor parsing

Preparation of full datasets

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages