Skip to content

Commit a0d4f80

Browse files
authored
Dev (#34)
### New features + Add support of PTB POS tagset for `nsca-lca` + Add `--text` option for `nsca-lca` ### Improvements + Add formulae in README for LCA measures ### Bug fixes + Fix not correctly using Chinese JDK mirror
1 parent 2b3b1bb commit a0d4f80

35 files changed

+1505
-164
lines changed

CHANGELOG.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,20 @@
11
<div align="center"><h1>Changelog</h1></div>
22

3+
## [0.0.52](https://github.com/tanloong/neosca/releases/tag/0.0.52) (1 September 2023)
4+
5+
### New features
6+
7+
+ Add support of PTB POS tagset for `nsca-lca`
8+
+ Add `--text` option for `nsca-lca`
9+
10+
### Improvements
11+
12+
+ Add formulae in README for LCA measures
13+
14+
### Bug fixes
15+
16+
+ Fix not correctly using Chinese JDK mirror
17+
318
## [0.0.51](https://github.com/tanloong/neosca/releases/tag/0.0.51) (23 August 2023)
419

520
### Bug fixes

README.md

Lines changed: 55 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,8 @@
1313

1414
<!-- ![](img/testing-on-Windows.gif) -->
1515

16-
[简体中文](https://github.com/tanloong/neosca/blob/master/README_zh_cn.md) |
17-
[繁體中文](https://github.com/tanloong/neosca/blob/master/README_zh_tw.md) |
16+
[简体中文](https://github.com/tanloong/neosca/blob/master/README_zh_cn.md)|
17+
[繁體中文](https://github.com/tanloong/neosca/blob/master/README_zh_tw.md)|
1818
English
1919

2020
NeoSCA is a fork of [Xiaofei Lu](http://personal.psu.edu/xxl13/index.html)'s [L2 Syntactic Complexity Analyzer](http://personal.psu.edu/xxl13/downloads/l2sca.html) (L2SCA), with added support for Windows and an improved command-line interface for easier usage. NeoSCA is written by Tan, Long (谭龙)。It accepts written English texts and computes the following measures:
@@ -252,7 +252,7 @@ There are two approaches to define a structure: using `tregex_pattern` or `value
252252
+ [Powerpoint tutorial](https://nlp.stanford.edu/software/tregex/The_Wonderful_World_of_Tregex.ppt) for Tregex by Galen Andrew
253253
+ [The TregexPattern javadoc page](http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/TregexPattern.html)
254254

255-
`value_source` specifies an arithmetic operation on values of other structures to calculate the value of the a structure. `value_source` can include names of other structures, integers, decimals, `+`, `-`, `*`, `/`, `(` and `)`. `value_source` are tokenized using Python's standard library `tokenize`, which is specifically designed for Python source code. The name of a structure that is refered to in a `value_source` should adhere to the naming convention of Python variables (composed of *letters*, *numbers*, and *underscores*, cannot start with a *number*; *letters* refer to those characters defined in the Unicode character database as "Letter", such as English letters and Chinese characters), or otherwise the name will not be correctly recognized.
255+
`value_source` specifies an arithmetic operation on values of other structures to calculate the value of the structure being defined. `value_source` can include names of other structures, integers, decimals, `+`, `-`, `*`, `/`, `(` and `)`. `value_source` are tokenized using Python's standard library `tokenize`, which is specifically designed for Python source code. The name of a structure that is refered to in a `value_source` should adhere to the naming convention of Python variables (composed of *letters*, *numbers*, and *underscores*, cannot start with a *number*; *letters* refer to those characters defined in the Unicode character database as "Letter", such as English letters and Chinese characters), or otherwise the name will not be correctly recognized.
256256

257257
The `value_source` definition can be nested, which means that dependant structures in turn can also be defined through `value_source` and rely on others, forming a tree-like relationship. But the terminal structures must be defined by `tregex_pattern` to avoid circular definition.
258258

@@ -395,7 +395,54 @@ NeoSCA has a Tregex command line interface `nsca-tregex`, which behaves similar
395395

396396
#### Lexical complexity analysis
397397

398-
NeoSCA provides an `nsca-lca` command to do the lexical complexity analysis mirroring the functionality of [LCA (Lexical Complexity Analyzer)](https://sites.psu.edu/xxl13/lca/).
398+
NeoSCA provides an `nsca-lca` command to do the lexical complexity analysis mirroring the functionality of [LCA (Lexical Complexity Analyzer)](https://sites.psu.edu/xxl13/lca/). Below are the available measures:
399+
400+
<!-- {{{ LCA measures -->
401+
<details>
402+
<summary>
403+
Measures of Lexical Density and Sophistication
404+
</summary>
405+
406+
|Measure|Formula|
407+
|-|-|
408+
|Lexical Density|![Formula](/img/ld.svg "the ratio of the number of lexical words to the number of words")|
409+
|Lexical Sophistication-I|![Formula](/img/ls1.svg "the ratio of the number of sophisticated lexical words to the total number of lexical words")|
410+
|Lexical Sophistication-II|![Formula](/img/ls2.svg "the ratio of the number of sophisticated word types to the total number of word types")|
411+
|Verb Sophistication-I|![Formula](/img/vs1.svg "the ratio of the number of sophisticated verb types to the total number of verbs")|
412+
|Corrected Verb Sophistication-I|![Formula](/img/cvs1.svg "the ratio of the number of sophisticated verb types to the square root of two times the number of verbs")|
413+
|Verb Sophistication-II|![Formula](/img/vs2.svg "the ratio of the number of sophisticated verb types squared to the number of verbs")|
414+
415+
</details>
416+
417+
<details>
418+
<summary>
419+
Measures of Lexical Variation
420+
</summary>
421+
422+
|Measure|Formula|
423+
|-|-|
424+
|Number of Different Words|![Formula](/img/ndw.svg "the number of word types")|
425+
|Number of Different Words (first 50 words)|![Formula](/img/ndw-50.svg "no hover text for this formula")|
426+
|Number of Different Words (expected random 50)|![Formula](/img/ndw-er50.svg "no hover text for this formula")|
427+
|Number of Different Words (expected sequence 50)|![Formula](/img/ndw-es50.svg "no hover text for this formula")|
428+
|Type-Token Ratio|![Formula](/img/ttr.svg "the ratio of the number of word types to the number of words")|
429+
|Mean Segmental Type-Token Ratio (50)|![Formula](/img/msttr-50.svg "divide a sample into successive 50-word segments, discard the remaining text with fewer words than 50, and then calculate the average TTR of all segments")|
430+
|Corrected Type-Token Ratio|![Formula](/img/cttr.svg "the ratio of the number of word types to the square root of two times the total number of words")|
431+
|Root Type-Token Ratio|![Formula](/img/rttr.svg "the ratio of the number of word types to the square root of the number of words")|
432+
|Bilogarithmic Type-Token Ratio|![Formula](/img/logttr.svg "no hover text for this formula")|
433+
|Uber Index|![Formula](/img/uber.svg "no hover text for this formula")|
434+
|Lexical Word Variation|![Formula](/img/lv.svg "the ratio of the number of lexical word types to the total number of lexical words")|
435+
|Verb Variation-I|![Formula](/img/vv1.svg "the ratio of the number of verb types to the total number of verbs")|
436+
|Squared Verb Variation-I|![Formula](/img/svv1.svg "the ratio of the number of verb types squared to the number of verbs")|
437+
|Corrected Verb Variation-I|![Formula](/img/cvv1.svg "the ratio of the number of verb types to the square root of two times the total number of verbs")|
438+
|Verb Variation-II|![Formula](/img/vv2.svg "the ratio of the number of verb types to the number of lexical words")|
439+
|Noun Variation|![Formula](/img/nv.svg "the ratio of the number of noun types to the number of lexical words")|
440+
|Adjective Variation|![Formula](/img/adjv.svg "the ratio of the number of adjective types to the number of lexical words")|
441+
|Adverb Variation|![Formula](/img/advv.svg "the ratio of the number of adverb types to the number of lexical words")|
442+
|Modifier Variation|![Formula](/img/modv.svg "the ratio of the number of modifier (both adjective and adverb) types to the number of lexical words")|
443+
444+
</details>
445+
<!-- }}} -->
399446

400447
```sh
401448
nsca-lca sample.txt # single input file
@@ -414,7 +461,7 @@ BibTeX
414461

415462
```BibTeX
416463
@misc{tan2022neosca,
417-
title = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.0.51},
464+
title = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.0.52},
418465
author = {Long Tan},
419466
howpublished = {\url{https://github.com/tanloong/neosca}},
420467
year = {2022}
@@ -429,7 +476,7 @@ year = {2022}
429476
APA (7th edition)
430477
</summary>
431478

432-
<pre>Tan, L. (2022). <i>NeoSCA</i> (version 0.0.51) [Computer software]. Github. https://github.com/tanloong/neosca</pre>
479+
<pre>Tan, L. (2022). <i>NeoSCA</i> (version 0.0.52) [Computer software]. Github. https://github.com/tanloong/neosca</pre>
433480

434481
</details>
435482

@@ -439,7 +486,7 @@ APA (7th edition)
439486
MLA (9th edition)
440487
</summary>
441488

442-
<pre>Tan, Long. <i>NeoSCA</i>. version 0.0.51, GitHub, 2022, https://github.com/tanloong/neosca.</pre>
489+
<pre>Tan, Long. <i>NeoSCA</i>. version 0.0.52, GitHub, 2022, https://github.com/tanloong/neosca.</pre>
443490

444491
</details>
445492

@@ -487,7 +534,7 @@ MLA (9th edition)
487534

488535
</details>
489536

490-
If you used the lexical complexity analyzing feature, please cite Xiaofei's article about LCA.
537+
If you use the lexical complexity analyzing feature, please cite Xiaofei's article about LCA.
491538

492539
<details>
493540

README_zh_cn.md

Lines changed: 51 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -385,7 +385,54 @@ CN/C: complex nominals per clause
385385

386386
#### 词法复杂度分析
387387

388-
使用 `nsca-lca` 命令可以分析输入文本的词法复杂度,功能与 [LCA (Lexical Complexity Analyzer)](https://sites.psu.edu/xxl13/lca/) 类似。
388+
使用 `nsca-lca` 命令可以分析输入文本的词法复杂度,功能与 [LCA (Lexical Complexity Analyzer)](https://sites.psu.edu/xxl13/lca/) 相同,指标如下:
389+
390+
<!-- {{{ LCA measures -->
391+
<details>
392+
<summary>
393+
Measures of Lexical Density and Sophistication
394+
</summary>
395+
396+
|Measure|Formula|
397+
|-|-|
398+
|Lexical Density|![Formula](/img/ld.svg "the ratio of the number of lexical words to the number of words")|
399+
|Lexical Sophistication-I|![Formula](/img/ls1.svg "the ratio of the number of sophisticated lexical words to the total number of lexical words")|
400+
|Lexical Sophistication-II|![Formula](/img/ls2.svg "the ratio of the number of sophisticated word types to the total number of word types")|
401+
|Verb Sophistication-I|![Formula](/img/vs1.svg "the ratio of the number of sophisticated verb types to the total number of verbs")|
402+
|Corrected Verb Sophistication-I|![Formula](/img/cvs1.svg "the ratio of the number of sophisticated verb types to the square root of two times the number of verbs")|
403+
|Verb Sophistication-II|![Formula](/img/vs2.svg "the ratio of the number of sophisticated verb types squared to the number of verbs")|
404+
405+
</details>
406+
407+
<details>
408+
<summary>
409+
Measures of Lexical Variation
410+
</summary>
411+
412+
|Measure|Formula|
413+
|-|-|
414+
|Number of Different Words|![Formula](/img/ndw.svg "the number of word types")|
415+
|Number of Different Words (first 50 words)|![Formula](/img/ndw-50.svg "no hover text for this formula")|
416+
|Number of Different Words (expected random 50)|![Formula](/img/ndw-er50.svg "no hover text for this formula")|
417+
|Number of Different Words (expected sequence 50)|![Formula](/img/ndw-es50.svg "no hover text for this formula")|
418+
|Type-Token Ratio|![Formula](/img/ttr.svg "the ratio of the number of word types to the number of words")|
419+
|Mean Segmental Type-Token Ratio (50)|![Formula](/img/msttr-50.svg "divide a sample into successive 50-word segments, discard the remaining text with fewer words than 50, and then calculate the average TTR of all segments")|
420+
|Corrected Type-Token Ratio|![Formula](/img/cttr.svg "the ratio of the number of word types to the square root of two times the total number of words")|
421+
|Root Type-Token Ratio|![Formula](/img/rttr.svg "the ratio of the number of word types to the square root of the number of words")|
422+
|Bilogarithmic Type-Token Ratio|![Formula](/img/logttr.svg "no hover text for this formula")|
423+
|Uber Index|![Formula](/img/uber.svg "no hover text for this formula")|
424+
|Lexical Word Variation|![Formula](/img/lv.svg "the ratio of the number of lexical word types to the total number of lexical words")|
425+
|Verb Variation-I|![Formula](/img/vv1.svg "the ratio of the number of verb types to the total number of verbs")|
426+
|Squared Verb Variation-I|![Formula](/img/svv1.svg "the ratio of the number of verb types squared to the number of verbs")|
427+
|Corrected Verb Variation-I|![Formula](/img/cvv1.svg "the ratio of the number of verb types to the square root of two times the total number of verbs")|
428+
|Verb Variation-II|![Formula](/img/vv2.svg "the ratio of the number of verb types to the number of lexical words")|
429+
|Noun Variation|![Formula](/img/nv.svg "the ratio of the number of noun types to the number of lexical words")|
430+
|Adjective Variation|![Formula](/img/adjv.svg "the ratio of the number of adjective types to the number of lexical words")|
431+
|Adverb Variation|![Formula](/img/advv.svg "the ratio of the number of adverb types to the number of lexical words")|
432+
|Modifier Variation|![Formula](/img/modv.svg "the ratio of the number of modifier (both adjective and adverb) types to the number of lexical words")|
433+
434+
</details>
435+
<!-- }}} -->
389436

390437
```sh
391438
nsca-lca sample.txt # 单篇分析
@@ -404,7 +451,7 @@ BibTeX
404451

405452
```BibTeX
406453
@misc{tan2022neosca,
407-
title = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.0.51},
454+
title = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.0.52},
408455
author = {Long Tan},
409456
howpublished = {\url{https://github.com/tanloong/neosca}},
410457
year = {2022}
@@ -419,7 +466,7 @@ year = {2022}
419466
APA (7th edition)
420467
</summary>
421468

422-
<pre>Tan, L. (2022). <i>NeoSCA</i> (version 0.0.51) [Computer software]. Github. https://github.com/tanloong/neosca</pre>
469+
<pre>Tan, L. (2022). <i>NeoSCA</i> (version 0.0.52) [Computer software]. Github. https://github.com/tanloong/neosca</pre>
423470

424471
</details>
425472

@@ -429,7 +476,7 @@ APA (7th edition)
429476
MLA (9th edition)
430477
</summary>
431478

432-
<pre>Tan, Long. <i>NeoSCA</i>. version 0.0.51, GitHub, 2022, https://github.com/tanloong/neosca.</pre>
479+
<pre>Tan, Long. <i>NeoSCA</i>. version 0.0.52, GitHub, 2022, https://github.com/tanloong/neosca.</pre>
433480

434481
</details>
435482

README_zh_tw.md

Lines changed: 51 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -385,7 +385,54 @@ CN/C: complex nominals per clause
385385

386386
#### 詞法複雜度分析
387387

388-
使用 `nsca-lca` 命令可以分析輸入文本的詞法複雜度,功能與 [LCA (Lexical Complexity Analyzer)](https://sites.psu.edu/xxl13/lca/) 類似。
388+
使用 `nsca-lca` 命令可以分析輸入文本的詞法複雜度,功能與 [LCA (Lexical Complexity Analyzer)](https://sites.psu.edu/xxl13/lca/) 相同,指標如下:
389+
390+
<!-- {{{ LCA measures -->
391+
<details>
392+
<summary>
393+
Measures of Lexical Density and Sophistication
394+
</summary>
395+
396+
|Measure|Formula|
397+
|-|-|
398+
|Lexical Density|![Formula](/img/ld.svg "the ratio of the number of lexical words to the number of words")|
399+
|Lexical Sophistication-I|![Formula](/img/ls1.svg "the ratio of the number of sophisticated lexical words to the total number of lexical words")|
400+
|Lexical Sophistication-II|![Formula](/img/ls2.svg "the ratio of the number of sophisticated word types to the total number of word types")|
401+
|Verb Sophistication-I|![Formula](/img/vs1.svg "the ratio of the number of sophisticated verb types to the total number of verbs")|
402+
|Corrected Verb Sophistication-I|![Formula](/img/cvs1.svg "the ratio of the number of sophisticated verb types to the square root of two times the number of verbs")|
403+
|Verb Sophistication-II|![Formula](/img/vs2.svg "the ratio of the number of sophisticated verb types squared to the number of verbs")|
404+
405+
</details>
406+
407+
<details>
408+
<summary>
409+
Measures of Lexical Variation
410+
</summary>
411+
412+
|Measure|Formula|
413+
|-|-|
414+
|Number of Different Words|![Formula](/img/ndw.svg "the number of word types")|
415+
|Number of Different Words (first 50 words)|![Formula](/img/ndw-50.svg "no hover text for this formula")|
416+
|Number of Different Words (expected random 50)|![Formula](/img/ndw-er50.svg "no hover text for this formula")|
417+
|Number of Different Words (expected sequence 50)|![Formula](/img/ndw-es50.svg "no hover text for this formula")|
418+
|Type-Token Ratio|![Formula](/img/ttr.svg "the ratio of the number of word types to the number of words")|
419+
|Mean Segmental Type-Token Ratio (50)|![Formula](/img/msttr-50.svg "divide a sample into successive 50-word segments, discard the remaining text with fewer words than 50, and then calculate the average TTR of all segments")|
420+
|Corrected Type-Token Ratio|![Formula](/img/cttr.svg "the ratio of the number of word types to the square root of two times the total number of words")|
421+
|Root Type-Token Ratio|![Formula](/img/rttr.svg "the ratio of the number of word types to the square root of the number of words")|
422+
|Bilogarithmic Type-Token Ratio|![Formula](/img/logttr.svg "no hover text for this formula")|
423+
|Uber Index|![Formula](/img/uber.svg "no hover text for this formula")|
424+
|Lexical Word Variation|![Formula](/img/lv.svg "the ratio of the number of lexical word types to the total number of lexical words")|
425+
|Verb Variation-I|![Formula](/img/vv1.svg "the ratio of the number of verb types to the total number of verbs")|
426+
|Squared Verb Variation-I|![Formula](/img/svv1.svg "the ratio of the number of verb types squared to the number of verbs")|
427+
|Corrected Verb Variation-I|![Formula](/img/cvv1.svg "the ratio of the number of verb types to the square root of two times the total number of verbs")|
428+
|Verb Variation-II|![Formula](/img/vv2.svg "the ratio of the number of verb types to the number of lexical words")|
429+
|Noun Variation|![Formula](/img/nv.svg "the ratio of the number of noun types to the number of lexical words")|
430+
|Adjective Variation|![Formula](/img/adjv.svg "the ratio of the number of adjective types to the number of lexical words")|
431+
|Adverb Variation|![Formula](/img/advv.svg "the ratio of the number of adverb types to the number of lexical words")|
432+
|Modifier Variation|![Formula](/img/modv.svg "the ratio of the number of modifier (both adjective and adverb) types to the number of lexical words")|
433+
434+
</details>
435+
<!-- }}} -->
389436

390437
```sh
391438
nsca-lca sample.txt # 單篇分析
@@ -404,7 +451,7 @@ BibTeX
404451

405452
```BibTeX
406453
@misc{tan2022neosca,
407-
title = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.0.51},
454+
title = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.0.52},
408455
author = {Long Tan},
409456
howpublished = {\url{https://github.com/tanloong/neosca}},
410457
year = {2022}
@@ -419,7 +466,7 @@ year = {2022}
419466
APA (7th edition)
420467
</summary>
421468

422-
<pre>Tan, L. (2022). <i>NeoSCA</i> (version 0.0.51) [Computer software]. Github. https://github.com/tanloong/neosca</pre>
469+
<pre>Tan, L. (2022). <i>NeoSCA</i> (version 0.0.52) [Computer software]. Github. https://github.com/tanloong/neosca</pre>
423470

424471
</details>
425472

@@ -429,7 +476,7 @@ APA (7th edition)
429476
MLA (9th edition)
430477
</summary>
431478

432-
<pre>Tan, Long. <i>NeoSCA</i>. version 0.0.51, GitHub, 2022, https://github.com/tanloong/neosca.</pre>
479+
<pre>Tan, Long. <i>NeoSCA</i>. version 0.0.52, GitHub, 2022, https://github.com/tanloong/neosca.</pre>
433480

434481
</details>
435482

0 commit comments

Comments
 (0)