Skip to content

Commit bd2a5bb

Browse files
authored
Dev (#31)
### Bug fixes + Fix `--reserve-matched` not working since 0.0.48
1 parent 9fa7e71 commit bd2a5bb

17 files changed

+256
-114
lines changed

CHANGELOG.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
<div align="center"><h1>Changelog</h1></div>
22

3-
## [0.0.49](https://github.com/tanloong/neosca/releases/tag/0.0.48) (19 August 2023)
3+
## [0.0.50](https://github.com/tanloong/neosca/releases/tag/0.0.50) (23 August 2023)
4+
5+
### Bug fixes
6+
7+
+ Fix `--reserve-matched` not working since 0.0.48
8+
9+
## [0.0.49](https://github.com/tanloong/neosca/releases/tag/0.0.49) (19 August 2023)
410

511
### Bug fixes
612

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
[繁體中文](https://github.com/tanloong/neosca/blob/master/README_zh_tw.md) |
1818
English
1919

20-
NeoSCA is a fork of [Xiaofei Lu](http://personal.psu.edu/xxl13/index.html)'s [L2 Syntactic Complexity Analyzer](http://personal.psu.edu/xxl13/downloads/l2sca.html) (L2SCA), with added support for Windows and an improved command-line interface for easier usage. NeoSCA accepts written English texts and computes the following measures:
20+
NeoSCA is a fork of [Xiaofei Lu](http://personal.psu.edu/xxl13/index.html)'s [L2 Syntactic Complexity Analyzer](http://personal.psu.edu/xxl13/downloads/l2sca.html) (L2SCA), with added support for Windows and an improved command-line interface for easier usage. NeoSCA is written by Tan, Long (谭龙)。It accepts written English texts and computes the following measures:
2121

2222
<details>
2323

@@ -168,8 +168,8 @@ This ensures that the entire filename including the spaces, is interpreted as a
168168
Specify the input directory after `nsca`.
169169

170170
```
171-
nsca samples/ # analyze every txt/docx file under the "samples/" directory
172-
nsca samples/ --ftype txt # analyze only txt files under "samples/"
171+
nsca samples/ # analyze every txt/docx file under the "samples/" directory
172+
nsca samples/ --ftype txt # analyze only txt files under "samples/"
173173
nsca samples/ --ftype docx # analyze only docx files under "samples/"
174174
```
175175

@@ -184,8 +184,8 @@ You can also use [wildcards](https://www.gnu.org/savannah-checkouts/gnu/clisp/im
184184

185185
```sh
186186
cd ./samples/
187-
nsca sample*.txt # every file whose name starts with "sample" and ends with ".txt"
188-
nsca sample[1-9].txt sample10.txt # sample1.txt -- sample10.txt
187+
nsca sample*.txt # every file whose name starts with "sample" and ends with ".txt"
188+
nsca sample[1-9].txt sample10.txt # sample1.txt -- sample10.txt
189189
nsca sample10[1-9].txt sample1[1-9][0-9].txt sample200.txt # sample101.txt -- sample200.txt
190190
```
191191

@@ -414,7 +414,7 @@ BibTeX
414414

415415
```BibTeX
416416
@misc{tan2022neosca,
417-
title = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.0.49},
417+
title = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.0.50},
418418
author = {Long Tan},
419419
howpublished = {\url{https://github.com/tanloong/neosca}},
420420
year = {2022}
@@ -429,7 +429,7 @@ year = {2022}
429429
APA (7th edition)
430430
</summary>
431431

432-
<pre>Tan, L. (2022). <i>NeoSCA</i> (version 0.0.49) [Computer software]. Github. https://github.com/tanloong/neosca</pre>
432+
<pre>Tan, L. (2022). <i>NeoSCA</i> (version 0.0.50) [Computer software]. Github. https://github.com/tanloong/neosca</pre>
433433

434434
</details>
435435

@@ -439,7 +439,7 @@ APA (7th edition)
439439
MLA (9th edition)
440440
</summary>
441441

442-
<pre>Tan, Long. <i>NeoSCA</i>. version 0.0.49, GitHub, 2022, https://github.com/tanloong/neosca.</pre>
442+
<pre>Tan, Long. <i>NeoSCA</i>. version 0.0.50, GitHub, 2022, https://github.com/tanloong/neosca.</pre>
443443

444444
</details>
445445

README_zh_cn.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
[繁體中文](https://github.com/tanloong/neosca/blob/master/README_zh_tw.md) |
1818
[English](https://github.com/tanloong/neosca#readme)
1919

20-
NeoSCA 是 [Xiaofei Lu](http://personal.psu.edu/xxl13/index.html)[L2 Syntactic Complexity Analyzer (L2SCA)](http://personal.psu.edu/xxl13/downloads/l2sca.html) 的复刻版,添加了对 Windows 的支持和更多的命令行选项。NeoSCA 对英文语料统计以下内容:
20+
NeoSCA 是 [Xiaofei Lu](http://personal.psu.edu/xxl13/index.html)[L2 Syntactic Complexity Analyzer (L2SCA)](http://personal.psu.edu/xxl13/downloads/l2sca.html) 的复刻版,添加了对 Windows 的支持和更多的命令行选项,作者谭龙。NeoSCA 对英文语料统计以下内容:
2121

2222
<details>
2323

@@ -161,8 +161,8 @@ nsca "./samples/sample 1.txt"
161161
`nsca` 的右边指定输入文件夹。
162162

163163
```
164-
nsca samples/ # 分析 samples/ 文件夹下所有的 txt 和 docx 文件
165-
nsca samples/ --ftype txt # 只分析 txt 文件
164+
nsca samples/ # 分析 samples/ 文件夹下所有的 txt 和 docx 文件
165+
nsca samples/ --ftype txt # 只分析 txt 文件
166166
nsca samples/ --ftype docx # 只分析 docx 文件
167167
```
168168

@@ -177,8 +177,8 @@ nsca sample1.txt sample2.txt
177177

178178
```sh
179179
cd ./samples/
180-
nsca sample*.txt # 指定所有文件名以 “sample” 开头并且以 “.txt” 结尾的文件
181-
nsca sample[1-9].txt sample10.txt # sample1.txt -- sample10.txt
180+
nsca sample*.txt # 指定所有文件名以 “sample” 开头并且以 “.txt” 结尾的文件
181+
nsca sample[1-9].txt sample10.txt # sample1.txt -- sample10.txt
182182
nsca sample10[1-9].txt sample1[1-9][0-9].txt sample200.txt # sample101.txt -- sample200.txt
183183
```
184184

@@ -404,7 +404,7 @@ BibTeX
404404

405405
```BibTeX
406406
@misc{tan2022neosca,
407-
title = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.0.49},
407+
title = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.0.50},
408408
author = {Long Tan},
409409
howpublished = {\url{https://github.com/tanloong/neosca}},
410410
year = {2022}
@@ -419,7 +419,7 @@ year = {2022}
419419
APA (7th edition)
420420
</summary>
421421

422-
<pre>Tan, L. (2022). <i>NeoSCA</i> (version 0.0.49) [Computer software]. Github. https://github.com/tanloong/neosca</pre>
422+
<pre>Tan, L. (2022). <i>NeoSCA</i> (version 0.0.50) [Computer software]. Github. https://github.com/tanloong/neosca</pre>
423423

424424
</details>
425425

@@ -429,7 +429,7 @@ APA (7th edition)
429429
MLA (9th edition)
430430
</summary>
431431

432-
<pre>Tan, Long. <i>NeoSCA</i>. version 0.0.49, GitHub, 2022, https://github.com/tanloong/neosca.</pre>
432+
<pre>Tan, Long. <i>NeoSCA</i>. version 0.0.50, GitHub, 2022, https://github.com/tanloong/neosca.</pre>
433433

434434
</details>
435435

README_zh_tw.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
[繁體中文](https://github.com/tanloong/neosca/blob/master/README_zh_tw.md) |
1818
[English](https://github.com/tanloong/neosca#readme)
1919

20-
NeoSCA 是 [Xiaofei Lu](http://personal.psu.edu/xxl13/index.html)[L2 Syntactic Complexity Analyzer (L2SCA)](http://personal.psu.edu/xxl13/downloads/l2sca.html) 的復刻版,添加了對 Windows 的支持和更多的命令行選項。NeoSCA 對英文語料統計以下內容:
20+
NeoSCA 是 [Xiaofei Lu](http://personal.psu.edu/xxl13/index.html)[L2 Syntactic Complexity Analyzer (L2SCA)](http://personal.psu.edu/xxl13/downloads/l2sca.html) 的復刻版,添加了對 Windows 的支持和更多的命令行選項,作者譚龍。NeoSCA 對英文語料統計以下內容:
2121

2222
<details>
2323

@@ -161,8 +161,8 @@ nsca "./samples/sample 1.txt"
161161
`nsca` 的右邊指定輸入文件夾。
162162

163163
```
164-
nsca samples/ # 分析 samples/ 文件夾下所有的 txt 和 docx 文件
165-
nsca samples/ --ftype txt # 只分析 txt 文件
164+
nsca samples/ # 分析 samples/ 文件夾下所有的 txt 和 docx 文件
165+
nsca samples/ --ftype txt # 只分析 txt 文件
166166
nsca samples/ --ftype docx # 只分析 docx 文件
167167
```
168168

@@ -177,8 +177,8 @@ nsca sample1.txt sample2.txt
177177

178178
```sh
179179
cd ./samples/
180-
nsca sample*.txt # 指定所有文件名以 「sample」 開頭並且以 「.txt」 結尾的文件
181-
nsca sample[1-9].txt sample10.txt # sample1.txt -- sample10.txt
180+
nsca sample*.txt # 指定所有文件名以 「sample」 開頭並且以 「.txt」 結尾的文件
181+
nsca sample[1-9].txt sample10.txt # sample1.txt -- sample10.txt
182182
nsca sample10[1-9].txt sample1[1-9][0-9].txt sample200.txt # sample101.txt -- sample200.txt
183183
```
184184

@@ -404,7 +404,7 @@ BibTeX
404404

405405
```BibTeX
406406
@misc{tan2022neosca,
407-
title = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.0.49},
407+
title = {NeoSCA: A Fork of L2 Syntactic Complexity Analyzer, version 0.0.50},
408408
author = {Long Tan},
409409
howpublished = {\url{https://github.com/tanloong/neosca}},
410410
year = {2022}
@@ -419,7 +419,7 @@ year = {2022}
419419
APA (7th edition)
420420
</summary>
421421

422-
<pre>Tan, L. (2022). <i>NeoSCA</i> (version 0.0.49) [Computer software]. Github. https://github.com/tanloong/neosca</pre>
422+
<pre>Tan, L. (2022). <i>NeoSCA</i> (version 0.0.50) [Computer software]. Github. https://github.com/tanloong/neosca</pre>
423423

424424
</details>
425425

@@ -429,7 +429,7 @@ APA (7th edition)
429429
MLA (9th edition)
430430
</summary>
431431

432-
<pre>Tan, Long. <i>NeoSCA</i>. version 0.0.49, GitHub, 2022, https://github.com/tanloong/neosca.</pre>
432+
<pre>Tan, Long. <i>NeoSCA</i>. version 0.0.50, GitHub, 2022, https://github.com/tanloong/neosca.</pre>
433433

434434
</details>
435435

neosca/about.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
#!/usr/bin/env python3
22
# -*- coding=utf-8 -*-
33

4-
__version__ = "0.0.49"
4+
__version__ = "0.0.50"

neosca/depends_installer.py

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import logging
55
import lzma
66
import os
7+
import os.path as os_path
78
import re
89
import shutil
910
import subprocess
@@ -120,7 +121,7 @@ def _get_normalized_archive_ext(self, file: str) -> str:
120121
raise ValueError(f"Error: {file} has unexpected extension.")
121122

122123
def _extract_files(self, file: str, file_ending: str, destination_folder: str) -> str:
123-
if not os.path.isfile(file):
124+
if not os_path.isfile(file):
124125
raise ValueError(f"Error: {file} is not a regular file.")
125126

126127
start_listing = set(os.listdir(destination_folder))
@@ -142,26 +143,26 @@ def _extract_files(self, file: str, file_ending: str, destination_folder: str) -
142143
end_listing = set(os.listdir(destination_folder))
143144
unzipped_directory = end_listing.difference(start_listing).pop()
144145

145-
return os.path.join(destination_folder, unzipped_directory)
146+
return os_path.join(destination_folder, unzipped_directory)
146147

147148
def _path_parse(self, file_path: str) -> _Path:
148-
dirname = os.path.dirname(file_path)
149-
base = os.path.basename(file_path)
150-
name, ext = os.path.splitext(base)
149+
dirname = os_path.dirname(file_path)
150+
base = os_path.basename(file_path)
151+
name, ext = os_path.splitext(base)
151152
return _Path(dir=dirname, base=base, name=name, ext=ext)
152153

153154
def _unpack_jars(self, fs_path: str, java_bin_path: str) -> None:
154-
if os.path.isdir(fs_path):
155+
if os_path.isdir(fs_path):
155156
for f in os.listdir(fs_path):
156-
current_path = os.path.join(fs_path, f)
157+
current_path = os_path.join(fs_path, f)
157158
self._unpack_jars(current_path, java_bin_path)
158159
return
159-
elif os.path.isfile(fs_path):
160-
file_ext = os.path.splitext(fs_path)[-1]
160+
elif os_path.isfile(fs_path):
161+
file_ext = os_path.splitext(fs_path)[-1]
161162
if file_ext.endswith("pack"):
162163
p = self._path_parse(fs_path)
163-
name = os.path.join(p.dir, p.name)
164-
tool_path = os.path.join(java_bin_path, _UNPACK200)
164+
name = os_path.join(p.dir, p.name)
165+
tool_path = os_path.join(java_bin_path, _UNPACK200)
165166
try:
166167
subprocess.run(
167168
[tool_path, _UNPACK200_ARGS, f"{name}.pack", f"{name}.jar"],
@@ -178,15 +179,15 @@ def _decompress_archive(
178179
self, archive_path: str, file_extension: str, target_dir: str
179180
) -> str:
180181
logging.info(f"Decompressing {archive_path} to {target_dir}...")
181-
if not os.path.isdir(target_dir):
182+
if not os_path.isdir(target_dir):
182183
os.makedirs(target_dir)
183184

184-
archive_path = os.path.normpath(archive_path)
185+
archive_path = os_path.normpath(archive_path)
185186

186-
if os.path.isfile(archive_path):
187+
if os_path.isfile(archive_path):
187188
unzipped_directory = self._extract_files(archive_path, file_extension, target_dir)
188189
return unzipped_directory
189-
elif os.path.isdir(archive_path):
190+
elif os_path.isdir(archive_path):
190191
return archive_path
191192
else:
192193
raise ValueError(f"Error: {archive_path} is neither a directory not a file.")
@@ -252,7 +253,7 @@ def _download(self, download_url: str, name: str) -> str:
252253
else:
253254
filename = urllib.parse.urlparse(download_url).path.rpartition("/")[-1]
254255
# e.g. stanford-tregex-4.2.0.zip, stanford-parser-4.2.0.zip
255-
filename = os.path.join(tempfile.gettempdir(), filename) # type: ignore
256+
filename = os_path.join(tempfile.gettempdir(), filename) # type: ignore
256257
try:
257258
opener = urllib.request.build_opener()
258259
opener.addheaders = list(self.headers.items())
@@ -314,7 +315,7 @@ def install_java(
314315
jdk_archive = self._download(url, name=JAVA)
315316
jdk_ext = self._get_normalized_archive_ext(jdk_archive)
316317
jdk_dir = self._decompress_archive(jdk_archive, jdk_ext, target_dir)
317-
jdk_bin = os.path.join(jdk_dir, "bin")
318+
jdk_bin = os_path.join(jdk_dir, "bin")
318319
self._unpack_jars(jdk_dir, jdk_bin)
319320
if jdk_archive:
320321
os.remove(jdk_archive)

neosca/lca/lca.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -155,7 +155,12 @@ def run_on_ifile(
155155

156156
if lemma not in easy_words:
157157
slex_count_map[lemma] = slex_count_map.get(lemma, 0) + 1
158-
elif pos == "VERB" and lemma not in ("be", "have"):
158+
# Don't have to filter auxiliary verbs, because the VERB tag covers
159+
# main verbs (content verbs) but it does not cover auxiliary verbs
160+
# and verbal copulas (in the narrow sense), for which there is the
161+
# AUX tag.
162+
# https://universaldependencies.org/u/pos/VERB.html
163+
elif pos == "VERB":
159164
verb_count_map[lemma] = verb_count_map.get(lemma, 0) + 1
160165
lex_count_map[lemma] = lex_count_map.get(lemma, 0) + 1
161166

neosca/lca/main.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ def install_spacy(self) -> SCAProcedureResult:
7676
import subprocess
7777
from subprocess import CalledProcessError
7878

79-
command = [sys.executable, "-m", "pip", "install", "spacy"]
79+
command = [sys.executable, "-m", "pip", "install", "-U", "spacy"]
8080
try:
8181
subprocess.run(command, check=True, capture_output=False)
8282
except CalledProcessError as e:
@@ -94,6 +94,7 @@ def check_spacy(self):
9494
try:
9595
logging.info("Trying to load spaCy...")
9696
import spacy # type: ignore # noqa: F401 'en_core_web_sm' imported but unused
97+
import en_core_web_sm # type: ignore # noqa: F401 'en_core_web_sm' imported but unused
9798
except ModuleNotFoundError:
9899
is_install = get_yes_or_no(
99100
"Running LCA requires spaCy. Do you want me to install it for you?"

neosca/neosca.py

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import json
22
import logging
33
import os
4+
import os.path as os_path
45
import sys
56
from typing import Dict, List, Optional, Set, Tuple
67

@@ -85,10 +86,10 @@ def ensure_stanford_parser_initialized(self) -> None:
8586

8687
def already_parsed(self, ofile_parsed: str, ifile: str) -> bool:
8788
has_been_parsed = False
88-
is_exist = os.path.exists(ofile_parsed)
89+
is_exist = os_path.exists(ofile_parsed)
8990
if is_exist:
90-
is_not_empty = os.path.getsize(ofile_parsed) > 0
91-
is_parsed_newer_than_input = os.path.getmtime(ofile_parsed) > os.path.getmtime(ifile)
91+
is_not_empty = os_path.getsize(ofile_parsed) > 0
92+
is_parsed_newer_than_input = os_path.getmtime(ofile_parsed) > os_path.getmtime(ifile)
9293
if is_not_empty and is_parsed_newer_than_input:
9394
has_been_parsed = True
9495
return has_been_parsed
@@ -125,7 +126,7 @@ def parse_ifile(self, ifile: str) -> Optional[str]:
125126
# assume input as parse trees
126127
return self.io.read_txt(ifile, is_guess_encoding=False)
127128

128-
ofile_parsed = os.path.splitext(ifile)[0] + ".parsed"
129+
ofile_parsed = os_path.splitext(ifile)[0] + ".parsed"
129130
has_been_parsed = self.already_parsed(ofile_parsed=ofile_parsed, ifile=ifile)
130131
if has_been_parsed:
131132
logging.info(
@@ -142,7 +143,7 @@ def parse_ifile(self, ifile: str) -> Optional[str]:
142143
try:
143144
trees = self.parse_text(text, ofile_parsed)
144145
except KeyboardInterrupt:
145-
if os.path.exists(ofile_parsed):
146+
if os_path.exists(ofile_parsed):
146147
os.remove(ofile_parsed)
147148
sys.exit(1)
148149
else:

0 commit comments

Comments
 (0)