Note that LxGrTgr is currently being beta tested and should not be used in research. Once the beta testing concludes, this message will change.
LxGrTgr was developed using Spacy (version 3.5; en_core_web_trf model). Users will need to follow the instructions on Spacy's website to download Spacy for your specific system and the en_core_web_trf model. Note that if you are new to Python, we suggest starting with the Anaconda distribution of Python and using Visual Studio Code as your code editor.
Before installing LxGrTgr, it is helpful to verify that Spacy is functioning correctly. You can do so by running the following code in Python:
import spacy
nlp = spacy.load("en_core_web_trf") #load model
doc = nlp("This is a sample sentence.") #process a sentence
for token in doc:
print(token.text,token.lemma_,token.pos_,token.dep_)
Running this code should result in the following output:
This this PRON nsubj
is be AUX ROOT
a a DET det
sample sample NOUN compound
sentence sentence NOUN attr
. . PUNCT punct
Once you have successfully installed Spacy have dowloaded the en_core_web_trf model, you can use LxGrTgr. To install LxGrTgr, use pip:
pip install lxgrtgr
In addition to using the code below, a demo web app (which uses a faster but slightly less accurate NLP backend) is also available.
First, import LxGrTgr:
import lxgrtgr as lxgr
Then, strings can be tagged and printed:
sample1 = lxgr.tag("This is a very important opportunity that only comes once in a lifetime.")
lxgr.printer(sample1) #by default, the format is: token_id,text,lemma,complexity_tag
#sentid = 0
0 This this None
1 is be None
2 a a None
3 very very rb+jjrbmod
4 important important attr+npremod
5 opportunity opportunity None
6 that that None
7 only only rb+advl
8 comes come finitecls+rel
9 once once rb+advl
10 in in None
11 a a None
12 lifetime lifetime None
13 . . None
These commands can also be combined for efficiency's sake:
lxgr.printer(lxgr.tag("This is a very important opportunity that only comes once in a lifetime."))
Output can also be written to a file:
lxgr.writer("sample_results/sample1.tsv",sample1)
sample2 = lxgr.tag("I like pizza. I also enjoy eating it because it gives me a reason to drink a tasty beverage.")
lxgr.writer("sample_results/sample2.tsv",sample2)
Corpora come in all shapes and sizes. By default LxGrTgr presumes that each corpus file is represented as a UTF-8 text file and that all corpus files are in the same folder/directory.
To tag a corpus with LxGrTgr, simply use the tagFolder()
function.
tagFolder(targetDir,outputDir,suff = ".txt")
targetDir
is the folder/directory where your corpus files are. outputDir
is the folder where the tagged versions of your corpus files will be written.
An additional optional argument (suff
) can also be used. By default, suff = ".txt"
. If your corpus filenames end in something other than ".txt", be sure to include the suff
argument with the correct filename ending.
lxgr.tagFolder("folderWithCorpusFiles/","folderWhereTaggedVersionsWillBeWritten/")
Next, tagging should be checked and edited as appropriate.
After checking and editing the tags in your corpus, it is time to get tag counts for each document in your corpus using the countTagsFolder()
function.
countTagsFolder(targetDir,tagList = None,suff = ".txt")
By default, complexity tags are counted. The countTagsFolder()
function returns a dictionary with filenames as keys and feature counts as values.
sampleCountDictionary = lxgr.countTagsFolder("folderWhereTaggedVersionsWereWritten/")
The writeCounts()
function can be used to write the results to a file. By default, counts are normed as the incidence per 10,000 words, though this can be changed using the norming
argument. Raw counts can be obtained by including normed = False
.
writeCounts(outputD,outName, tagList = None, sep = "\t", normed = True,norming = 10000)
If the default options are desired, the writeCounts()
function only needs two arguments - a dictionary of filenames and index counts and a filename for the spreadsheet file:
lxgr.writeCounts(sampleCountDictionary,"sampleOutputFile.txt")
Add more functions for random sampling and tag-fixing.
We are currently developing tag descriptions and detailed annotation guidelines for complexity features. Click here to access the document (updated/revised weekly)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.