|
1 | | -<!-- START doctoc generated TOC please keep comment here to allow auto update --> |
2 | | -<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> |
3 | | -**Table of Contents** |
4 | | - |
5 | | -- [cntext:面向社会科学研究的中文文本分析工具库](#cntext%E9%9D%A2%E5%90%91%E7%A4%BE%E4%BC%9A%E7%A7%91%E5%AD%A6%E7%A0%94%E7%A9%B6%E7%9A%84%E4%B8%AD%E6%96%87%E6%96%87%E6%9C%AC%E5%88%86%E6%9E%90%E5%B7%A5%E5%85%B7%E5%BA%93) |
6 | | -- [安装 cntext](#%E5%AE%89%E8%A3%85-cntext) |
7 | | -- [功能模块](#%E5%8A%9F%E8%83%BD%E6%A8%A1%E5%9D%97) |
8 | | -- [QuickStart](#quickstart) |
9 | | -- [一、IO 模块](#%E4%B8%80io-%E6%A8%A1%E5%9D%97) |
10 | | - - [1.1 get_dict_list()](#11-get_dict_list) |
11 | | - - [1.2 内置 yaml 词典](#12-%E5%86%85%E7%BD%AE-yaml-%E8%AF%8D%E5%85%B8) |
12 | | - - [1.3 read_dict_yaml()](#13-read_dict_yaml) |
13 | | - - [1.4 detect_encoding()](#14-detect_encoding) |
14 | | - - [1.5 get_files(fformat)](#15-get_filesfformat) |
15 | | - - [1.6 read_pdf](#16-read_pdf) |
16 | | - - [1.7 read_docx](#17-read_docx) |
17 | | - - [1.8 read_file()](#18-read_file) |
18 | | - - [1.9 read_files()](#19-read_files) |
19 | | - - [1.10 extract_mda](#110-extract_mda) |
20 | | - - [1.11 traditional2simple()](#111-traditional2simple) |
21 | | - - [1.12 fix_text()](#112-fix_text) |
22 | | - - [1.13 fix_contractions(text)](#113-fix_contractionstext) |
23 | | -- [二、Stats 模块](#%E4%BA%8Cstats-%E6%A8%A1%E5%9D%97) |
24 | | - - [2.1 word_count()](#21-word_count) |
25 | | - - [2.2 readability()](#22-readability) |
26 | | - - [2.3 sentiment(text, diction, lang)](#23-sentimenttext-diction-lang) |
27 | | - - [2.4 sentiment_by_valence()](#24-sentiment_by_valence) |
28 | | - - [2.5 word_in_context()](#25-word_in_context) |
29 | | - - [2.6 epu()](#26-epu) |
30 | | - - [2.7 fepu()](#27-fepu) |
31 | | - - [2.8 semantic_brand_score()](#28-semantic_brand_score) |
32 | | - - [2.9 文本相似度](#29-%E6%96%87%E6%9C%AC%E7%9B%B8%E4%BC%BC%E5%BA%A6) |
33 | | - - [2.10 word_hhi](#210-word_hhi) |
34 | | -- [三、Plot 模块](#%E4%B8%89plot-%E6%A8%A1%E5%9D%97) |
35 | | - - [3.1 matplotlib_chinese()](#31-matplotlib_chinese) |
36 | | - - [3.2 lexical_dispersion_plot1()](#32-lexical_dispersion_plot1) |
37 | | - - [3.3 lexical_dispersion_plot2()](#33-lexical_dispersion_plot2) |
38 | | -- [四、Model 模块](#%E5%9B%9Bmodel-%E6%A8%A1%E5%9D%97) |
39 | | - - [4.1 Word2Vec()](#41-word2vec) |
40 | | - - [4.2 GloVe()](#42-glove) |
41 | | -- [4.3 evaluate_similarity()](#43-evaluate_similarity) |
42 | | -- [4.4 evaluate_analogy()](#44-evaluate_analogy) |
43 | | -- [4.5 SoPmi()](#45-sopmi) |
44 | | -- [4.6 load_w2v()](#46-load_w2v) |
45 | | - - [4.7 glove2word2vec()](#47-glove2word2vec) |
46 | | - - [注意](#%E6%B3%A8%E6%84%8F) |
47 | | - - [4.8 expand_dictionary()](#48-expand_dictionary) |
48 | | -- [五、Mind 模块](#%E4%BA%94mind-%E6%A8%A1%E5%9D%97) |
49 | | - - [5.1 semantic_centroid(wv, words)](#51-semantic_centroidwv-words) |
50 | | -- [5.2 generate_concept_axis(wv, poswords, negwords)](#52-generate_concept_axiswv-poswords-negwords) |
51 | | - - [5.3 sematic_distance()](#53-sematic_distance) |
52 | | - - [5.4 sematic_projection()](#54-sematic_projection) |
53 | | -- [5.5 project_word](#55-project_word) |
54 | | - - [5.6 project_text()](#56-project_text) |
55 | | - - [5.7 divergent_association_task()](#57-divergent_association_task) |
56 | | - - [5.8 discursive_diversity_score()](#58-discursive_diversity_score) |
57 | | - - [5.8 procrustes_align()](#58-procrustes_align) |
58 | | -- [六、LLM 模块](#%E5%85%ADllm-%E6%A8%A1%E5%9D%97) |
59 | | - - [6.1 ct.llm()](#61-ctllm) |
60 | | - - [6.2 内置prompt](#62-%E5%86%85%E7%BD%AEprompt) |
61 | | -- [使用声明](#%E4%BD%BF%E7%94%A8%E5%A3%B0%E6%98%8E) |
62 | | - - [apalike](#apalike) |
63 | | - - [bibtex](#bibtex) |
64 | | - - [endnote](#endnote) |
65 | | - |
66 | | -<!-- END doctoc generated TOC please keep comment here to allow auto update --> |
67 | | - |
68 | 1 |
|
69 | 2 |
|
70 | 3 | ## cntext:面向社会科学研究的中文文本分析工具库 |
@@ -166,7 +99,6 @@ cntext 含 io、model、stats、mind 五个模块 |
166 | 99 | | **mind** | `sematic_projection(wv, words, poswords, negwords, return_full=False, cosine=False)` | 测量语义投影 | |
167 | 100 | | **mind** | `ct.project_word(wv, a, b, cosine=False)` | 计算词语 a 在词语 b 上的投影 | |
168 | 101 | | **mind** | `ct.project_text(wv, text, axis, lang='chinese', cosine=False)` | 计算词语文本text在概念轴向量axis上的投影值| |
169 | | -| **mind** | `ct.project_text(wv, text, axis, lang='chinese', cosine=False)` | 计算词语文本text在概念轴向量axis上的投影值| |
170 | 102 | | **mind** | `ct.sematic_distance(wv, words1, words2)` | 测量语义距离 | |
171 | 103 | | **mind** | `ct.divergent_association_task(wv, words)` | 测量发散思维(创造力) | |
172 | 104 | | **mind** | `ct.discursive_diversity_score(wv, words)` | 测量语言差异性(认知差异性) | |
@@ -1449,7 +1381,7 @@ Output Saved To: output/三体-GloVe.50.15.bin |
1449 | 1381 |
|
1450 | 1382 | <br> |
1451 | 1383 |
|
1452 | | -## 4.3 evaluate_similarity() |
| 1384 | +### 4.3 evaluate_similarity() |
1453 | 1385 |
|
1454 | 1386 | 评估词向量模型语义相似表现。 使用 Spearman's Rank Coeficient 作为评价指标, 取值[-1, 1], 1 完全相关,-1 完全负相关, 0 毫无相关性。 |
1455 | 1387 |
|
@@ -1512,7 +1444,7 @@ Processing Similarity Test: 100%|██████████| 537/537 [00:00< |
1512 | 1444 |
|
1513 | 1445 | <br> |
1514 | 1446 |
|
1515 | | -## 4.4 evaluate_analogy() |
| 1447 | +### 4.4 evaluate_analogy() |
1516 | 1448 |
|
1517 | 1449 | 用于评估词向量模型在类比测试(analogy test)中表现的函数。它通过读取指定的类比测试文件,计算模型对词语关系预测的准确性,并输出每个类别的准确率、发现词语数量、未发现词语数量以及平均排名等指标。 |
1518 | 1450 |
|
@@ -1597,7 +1529,7 @@ Processing Analogy Test: 100%|█████████████| 1198/1198 |
1597 | 1529 |
|
1598 | 1530 | <br> |
1599 | 1531 |
|
1600 | | -## 4.5 SoPmi() |
| 1532 | +### 4.5 SoPmi() |
1601 | 1533 |
|
1602 | 1534 | ```python |
1603 | 1535 | ct.SoPmi(corpus_file, seed_file) #人工标注的初始种子词 |
@@ -1630,7 +1562,7 @@ Finish! used 19.74 s |
1630 | 1562 |
|
1631 | 1563 | <br> |
1632 | 1564 |
|
1633 | | -## 4.6 load_w2v() |
| 1565 | +### 4.6 load_w2v() |
1634 | 1566 |
|
1635 | 1567 | 导入 cntext2.x 预训练的 word2vec 模型 .txt 文件 |
1636 | 1568 |
|
@@ -1808,7 +1740,7 @@ array([ 0.15567462, -0.05117003, -0.18534171, 0.20808656, -0.01133028, |
1808 | 1740 |
|
1809 | 1741 | <br> |
1810 | 1742 |
|
1811 | | -## 5.2 generate_concept_axis(wv, poswords, negwords) |
| 1743 | +### 5.2 generate_concept_axis(wv, poswords, negwords) |
1812 | 1744 |
|
1813 | 1745 | 生成概念轴向量。 |
1814 | 1746 |
|
@@ -2000,7 +1932,7 @@ Run |
2000 | 1932 |
|
2001 | 1933 | <br> |
2002 | 1934 |
|
2003 | | -## 5.5 project_word |
| 1935 | +### 5.5 project_word |
2004 | 1936 |
|
2005 | 1937 | 在向量空间中, 计算词语a在词语b上的投影(余弦相似度)。默认返回的是投影值。 |
2006 | 1938 | 如果 cosine=True,返回词语a与词语b的余弦相似度。 |
|
0 commit comments