Skip to content

Commit ffe730e

Browse files
authored
feat: adds compare information and work metadata caching mechanism with MongoDB (#229)
- Adds logic for saving and getting works' metadata and compare information to/from the MongoDB; - Adds a Docker Compose file for running MongoDB 8.0 and instructions for running the container with MongoDB and setup connection from the codeplag CLI; - Adds new arguments into the command "codeplag settings modify": "--mongo-port", "--mongo-user", "--mongo-pass" and "mongo-host", which help to setup connection with MongoDB; - Also adds unit and auto tests for the logic described previously.
1 parent 715f43c commit ffe730e

File tree

29 files changed

+1722
-211
lines changed

29 files changed

+1722
-211
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
UTIL_VERSION := 0.5.22
1+
UTIL_VERSION := 0.5.23
22
UTIL_NAME := codeplag
33
PWD := $(shell pwd)
44

README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,10 @@ Program for finding plagiarism in the source code written in Python 3, C, and C+
4242
```
4343
$ docker run --rm --tty --interactive --volume <absolute_local_path_with_data>:/usr/src/works "artanias/codeplag-ubuntu22.04:latest" /bin/bash
4444
```
45+
or if Mongo is used on localhost
46+
```
47+
$ docker run --rm --tty --interactive --volume <absolute_local_path_with_data>:/usr/src/works --add-host=host.docker.internal:host-gateway "artanias/codeplag-ubuntu22.04:latest" /bin/bash
48+
```
4549

4650
### 1.3 Install with package manager apt-get
4751

@@ -58,6 +62,16 @@ Program for finding plagiarism in the source code written in Python 3, C, and C+
5862
$ sudo apt-get install <path_to_the_package>/<package_name>.deb
5963
```
6064

65+
### 1.4 MongoDB cache
66+
67+
If you want to use MongoDB cache for saving reports and works metadata, complete steps:
68+
69+
- Run MongoDB (you can configure DB params in [compose](docker/compose.yml))
70+
71+
```
72+
$ docker compose --file docker/compose.yml up --detach
73+
```
74+
6175
## 2. Tests
6276

6377
### 2.1. Pre-commit
@@ -128,6 +142,10 @@ Program for finding plagiarism in the source code written in Python 3, C, and C+
128142
# Path to environment variables '/usr/src/works/.env'
129143
$ codeplag settings modify --threshold 70 --language en --show_progress 1 --reports_extension csv --reports /usr/src/works --environment /usr/src/works/.env --ngrams-length 2 --workers 4
130144
```
145+
- If you use MongoDB with custom settings configure util (don't forget to provide password (`example` by default))
146+
```
147+
$ codeplag settings modify --mongo-pass <mongo-pass> --mongo-port <mongo-port> --mongo-user <mongo-user> --mongo-host <mongo-host>
148+
```
131149
- Python analyzer:
132150
```
133151
$ codeplag check --extension py --files src/codeplag/pyplag/astwalkers.py --directories src/codeplag/pyplag

docker/compose.yml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
services:
2+
mongodb:
3+
image: mongo:8.0
4+
restart: unless-stopped
5+
container_name: mongodb
6+
environment:
7+
MONGO_INITDB_ROOT_USERNAME: root
8+
MONGO_INITDB_ROOT_PASSWORD: example
9+
volumes:
10+
- mongodb_data:/data/db
11+
ports:
12+
- "127.0.0.1:27017:27017"
13+
networks:
14+
- codeplag-network
15+
healthcheck:
16+
test: [ "CMD", "mongosh", "--eval", "db.adminCommand('ping')" ]
17+
interval: 5s
18+
timeout: 5s
19+
retries: 3
20+
start_period: 5s
21+
22+
volumes:
23+
mongodb_data: {}
24+
25+
networks:
26+
codeplag-network:
27+
driver: bridge

docker/docker.mk

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ docker-test-image: docker-base-image
3939
docker-test: docker-test-image
4040
docker run --rm \
4141
--volume $(PWD)/test:/usr/src/$(UTIL_NAME)/test \
42+
--volume /var/run/docker.sock:/var/run/docker.sock \
4243
"$(TEST_DOCKER_TAG)"
4344

4445
docker-autotest: docker-test-image docker-build-package
@@ -48,6 +49,7 @@ docker-autotest: docker-test-image docker-build-package
4849
else \
4950
docker run --rm \
5051
--volume $(PWD)/$(DEBIAN_PACKAGES_PATH):/usr/src/$(UTIL_NAME)/$(DEBIAN_PACKAGES_PATH) \
52+
--volume /var/run/docker.sock:/var/run/docker.sock \
5153
--volume $(PWD)/test:/usr/src/$(UTIL_NAME)/test \
5254
--env-file .env \
5355
"$(TEST_DOCKER_TAG)" bash -c \
@@ -86,6 +88,7 @@ docker-run: docker-image
8688
@touch .env
8789
docker run --rm --tty --interactive \
8890
--env-file .env \
91+
--add-host=host.docker.internal:host-gateway \
8992
"$(DOCKER_TAG)"
9093

9194
docker-rmi:

locales/codeplag.pot

Lines changed: 71 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@
55
#, fuzzy
66
msgid ""
77
msgstr ""
8-
"Project-Id-Version: codeplag 0.5.21\n"
9-
"POT-Creation-Date: 2025-05-04 14:26+0300\n"
8+
"Project-Id-Version: codeplag 0.5.23\n"
9+
"POT-Creation-Date: 2025-05-27 18:24+0300\n"
1010
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
1111
"Last-Translator: Artyom Semidolin\n"
1212
"Language-Team: LANGUAGE <[email protected]>\n"
@@ -15,214 +15,237 @@ msgstr ""
1515
"Content-Transfer-Encoding: 8bit\n"
1616
"Generated-By: Babel 2.15.0\n"
1717

18-
#: src/codeplag/codeplagcli.py:47
18+
#: src/codeplag/codeplagcli.py:49
1919
msgid "You cannot specify the same value multiple times. You provided '{values}'."
2020
msgstr ""
2121

22-
#: src/codeplag/codeplagcli.py:61
22+
#: src/codeplag/codeplagcli.py:66
23+
msgid "Enter MongoDB password: "
24+
msgstr ""
25+
26+
#: src/codeplag/codeplagcli.py:76
2327
msgid "Directory '{path}' not found or not a directory."
2428
msgstr ""
2529

26-
#: src/codeplag/codeplagcli.py:74
30+
#: src/codeplag/codeplagcli.py:89
2731
msgid "File '{path}' not found or not a file."
2832
msgstr ""
2933

30-
#: src/codeplag/codeplagcli.py:86
34+
#: src/codeplag/codeplagcli.py:101
3135
msgid "Modifies and shows static settings of the '{util_name}' util."
3236
msgstr ""
3337

34-
#: src/codeplag/codeplagcli.py:92
38+
#: src/codeplag/codeplagcli.py:107
3539
msgid "Settings commands of the '{util_name}' util."
3640
msgstr ""
3741

38-
#: src/codeplag/codeplagcli.py:101
42+
#: src/codeplag/codeplagcli.py:116
3943
msgid "Manage the '{util_name}' util settings."
4044
msgstr ""
4145

42-
#: src/codeplag/codeplagcli.py:106
46+
#: src/codeplag/codeplagcli.py:121
4347
msgid "Path to the environment file with GitHub access token."
4448
msgstr ""
4549

46-
#: src/codeplag/codeplagcli.py:112
50+
#: src/codeplag/codeplagcli.py:128
4751
msgid ""
4852
"If defined, then saves reports about suspect works into provided file or "
4953
"directory. If directory by provided path doesn't exists than saves "
5054
"reports as a file."
5155
msgstr ""
5256

53-
#: src/codeplag/codeplagcli.py:123
54-
msgid "Extension of saved report files."
57+
#: src/codeplag/codeplagcli.py:139
58+
msgid ""
59+
"When provided 'csv' saves similar works compare info into csv file. When "
60+
"provided 'mongo' saves similar works compare info and works metadata into"
61+
" MongoDB."
5562
msgstr ""
5663

57-
#: src/codeplag/codeplagcli.py:130
64+
#: src/codeplag/codeplagcli.py:149
5865
msgid "Show progress of searching plagiarism."
5966
msgstr ""
6067

61-
#: src/codeplag/codeplagcli.py:137
68+
#: src/codeplag/codeplagcli.py:157
6269
msgid ""
6370
"When provided '0' show all check works results in the stdout. When "
6471
"provided '1' show only new found check works results in the stdout. When "
6572
"provided '2' do not show check works result in the stdout."
6673
msgstr ""
6774

68-
#: src/codeplag/codeplagcli.py:148
75+
#: src/codeplag/codeplagcli.py:168
6976
msgid ""
7077
"Threshold of analyzer which classifies two work as same. If this number "
7178
"is too large, such as 99, then completely matching jobs will be found. "
7279
"Otherwise, if this number is small, such as 50, then all work with "
7380
"minimal similarity will be found."
7481
msgstr ""
7582

76-
#: src/codeplag/codeplagcli.py:162
83+
#: src/codeplag/codeplagcli.py:181
7784
msgid "The maximum depth of the AST structure which play role in calculations."
7885
msgstr ""
7986

80-
#: src/codeplag/codeplagcli.py:169
87+
#: src/codeplag/codeplagcli.py:189
8188
msgid ""
8289
"The length of N-grams generated to calculate the Jakkar coefficient. A "
8390
"long length of N-grams reduces the Jakkar coefficient because there are "
8491
"fewer equal sequences of two works."
8592
msgstr ""
8693

87-
#: src/codeplag/codeplagcli.py:180
94+
#: src/codeplag/codeplagcli.py:199
8895
msgid "The language of help messages, generated reports, errors."
8996
msgstr ""
9097

91-
#: src/codeplag/codeplagcli.py:186
98+
#: src/codeplag/codeplagcli.py:206
9299
msgid ""
93100
"Sets the threshold for the '{util_name}' util loggers'. Logging messages "
94101
"that are less severe than the level will be ignored."
95102
msgstr ""
96103

97-
#: src/codeplag/codeplagcli.py:196
104+
#: src/codeplag/codeplagcli.py:215
98105
msgid "The maximum number of processes that can be used to compare works."
99106
msgstr ""
100107

101-
#: src/codeplag/codeplagcli.py:204
108+
#: src/codeplag/codeplagcli.py:222
109+
msgid "The host address of the MongoDB server."
110+
msgstr ""
111+
112+
#: src/codeplag/codeplagcli.py:228
113+
msgid "The port of the MongoDB."
114+
msgstr ""
115+
116+
#: src/codeplag/codeplagcli.py:236
117+
msgid "The username for connecting to the MongoDB server."
118+
msgstr ""
119+
120+
#: src/codeplag/codeplagcli.py:242
121+
msgid "The password for connecting to the MongoDB server. If empty - hide input."
122+
msgstr ""
123+
124+
#: src/codeplag/codeplagcli.py:251
102125
msgid "Show the '{util_name}' util settings."
103126
msgstr ""
104127

105-
#: src/codeplag/codeplagcli.py:208
128+
#: src/codeplag/codeplagcli.py:255
106129
msgid "Start searching similar works."
107130
msgstr ""
108131

109-
#: src/codeplag/codeplagcli.py:214
132+
#: src/codeplag/codeplagcli.py:261
110133
msgid "Absolute or relative path to a local directories with project files."
111134
msgstr ""
112135

113-
#: src/codeplag/codeplagcli.py:224
136+
#: src/codeplag/codeplagcli.py:271
114137
msgid "Absolute or relative path to files on a computer."
115138
msgstr ""
116139

117-
#: src/codeplag/codeplagcli.py:231
140+
#: src/codeplag/codeplagcli.py:279
118141
msgid ""
119142
"Choose one of the following modes of searching plagiarism. The "
120143
"'many_to_many' mode may require more free memory."
121144
msgstr ""
122145

123-
#: src/codeplag/codeplagcli.py:242
146+
#: src/codeplag/codeplagcli.py:290
124147
msgid ""
125148
"A regular expression for filtering checked works by name. Used with "
126149
"options 'directories', 'github-user' and 'github-project-folders'."
127150
msgstr ""
128151

129-
#: src/codeplag/codeplagcli.py:251
152+
#: src/codeplag/codeplagcli.py:298
130153
msgid "Ignore the threshold when checking of works."
131154
msgstr ""
132155

133-
#: src/codeplag/codeplagcli.py:258
156+
#: src/codeplag/codeplagcli.py:305
134157
msgid "Extension responsible for the analyzed programming language."
135158
msgstr ""
136159

137-
#: src/codeplag/codeplagcli.py:268
160+
#: src/codeplag/codeplagcli.py:315
138161
msgid "Searching in all branches."
139162
msgstr ""
140163

141-
#: src/codeplag/codeplagcli.py:275
164+
#: src/codeplag/codeplagcli.py:322
142165
msgid "A regular expression to filter searching repositories on GitHub."
143166
msgstr ""
144167

145-
#: src/codeplag/codeplagcli.py:282
168+
#: src/codeplag/codeplagcli.py:329
146169
msgid "URL to file in a GitHub repository."
147170
msgstr ""
148171

149-
#: src/codeplag/codeplagcli.py:288
172+
#: src/codeplag/codeplagcli.py:335
150173
msgid "GitHub organization/user name."
151174
msgstr ""
152175

153-
#: src/codeplag/codeplagcli.py:295
176+
#: src/codeplag/codeplagcli.py:342
154177
msgid "URL to a GitHub project folder."
155178
msgstr ""
156179

157-
#: src/codeplag/codeplagcli.py:305
180+
#: src/codeplag/codeplagcli.py:353
158181
msgid ""
159182
"Handling generated by the {util_name} reports as creating html report "
160183
"file or show it on console."
161184
msgstr ""
162185

163-
#: src/codeplag/codeplagcli.py:313
186+
#: src/codeplag/codeplagcli.py:360
164187
msgid "Report commands of the '{util_name}' util."
165188
msgstr ""
166189

167-
#: src/codeplag/codeplagcli.py:322
190+
#: src/codeplag/codeplagcli.py:369
168191
msgid "Generate general report from created some time ago report files."
169192
msgstr ""
170193

171-
#: src/codeplag/codeplagcli.py:327
194+
#: src/codeplag/codeplagcli.py:375
172195
msgid ""
173196
"Path to save generated report. If it's a directory, then create a file in"
174197
" it."
175198
msgstr ""
176199

177-
#: src/codeplag/codeplagcli.py:336
200+
#: src/codeplag/codeplagcli.py:383
178201
msgid "Type of the created report file."
179202
msgstr ""
180203

181-
#: src/codeplag/codeplagcli.py:344
204+
#: src/codeplag/codeplagcli.py:392
182205
msgid ""
183206
"Path to first compared works. Can be path to directory or URL to the "
184207
"project folder."
185208
msgstr ""
186209

187-
#: src/codeplag/codeplagcli.py:354
210+
#: src/codeplag/codeplagcli.py:402
188211
msgid ""
189212
"Path to second compared works. Can be path to directory or URL to the "
190213
"project folder."
191214
msgstr ""
192215

193-
#: src/codeplag/codeplagcli.py:366
216+
#: src/codeplag/codeplagcli.py:414
194217
msgid ""
195218
"Program help to find similar parts of source codes for the different "
196219
"languages."
197220
msgstr ""
198221

199-
#: src/codeplag/codeplagcli.py:373
222+
#: src/codeplag/codeplagcli.py:420
200223
msgid "Print current version number and exit."
201224
msgstr ""
202225

203-
#: src/codeplag/codeplagcli.py:379
226+
#: src/codeplag/codeplagcli.py:426
204227
msgid "Commands help."
205228
msgstr ""
206229

207-
#: src/codeplag/codeplagcli.py:394
230+
#: src/codeplag/codeplagcli.py:441
208231
msgid "No command is provided; please choose one from the available (--help)."
209232
msgstr ""
210233

211-
#: src/codeplag/codeplagcli.py:405
234+
#: src/codeplag/codeplagcli.py:452
212235
msgid "There is nothing to modify; please provide at least one argument."
213236
msgstr ""
214237

215-
#: src/codeplag/codeplagcli.py:409
238+
#: src/codeplag/codeplagcli.py:456
216239
msgid "The'repo-regexp' option requires the provided 'github-user' option."
217240
msgstr ""
218241

219-
#: src/codeplag/codeplagcli.py:417
242+
#: src/codeplag/codeplagcli.py:465
220243
msgid ""
221244
"The'path-regexp' option requires the provided 'directories', 'github-"
222245
"user', or 'github-project-folder' options."
223246
msgstr ""
224247

225-
#: src/codeplag/codeplagcli.py:428 src/codeplag/handlers/report.py:97
248+
#: src/codeplag/codeplagcli.py:475 src/codeplag/handlers/report.py:97
226249
msgid "All paths must be provided."
227250
msgstr ""
228251

0 commit comments

Comments
 (0)