Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
157 commits
Select commit Hold shift + click to select a range
72308ae
first working validate.py
ellepannitto Sep 1, 2025
f1cbee3
add folder to keep logs
ellepannitto Sep 1, 2025
963f022
ignoring actual logs
ellepannitto Sep 1, 2025
8f2b692
add logging utils
ellepannitto Sep 1, 2025
78b8e55
add dotenv to dependencies
ellepannitto Sep 1, 2025
f6ad3f2
add regex and unicodedata as dependencies
ellepannitto Sep 1, 2025
4aaafae
add basic logger
ellepannitto Sep 1, 2025
c6034fc
add args pretty print
ellepannitto Sep 1, 2025
e67205e
move test cases into new tests folder
harisont Sep 1, 2025
e6997da
fix import
harisont Sep 1, 2025
d289ad5
pytest infrastructure
harisont Sep 1, 2025
30aaf44
Merge branch 'infrastructure' into tests
ellepannitto Sep 1, 2025
e1853d7
Merge pull request #120 from UniversalDependencies/tests
ellepannitto Sep 1, 2025
4084968
minor changes
ellepannitto Sep 1, 2025
96b75e6
Merge branch 'infrastructure' of https://github.com/UniversalDependen…
ellepannitto Sep 1, 2025
a21c0c6
semi-auto-generated docs
harisont Sep 1, 2025
06f4255
add files for modularization
ellepannitto Sep 1, 2025
81f7c1a
started refactoring regex
ellepannitto Sep 1, 2025
3e15718
utils module
harisont Sep 1, 2025
840fa2d
micro whitespace changes
harisont Sep 1, 2025
d493fc3
move compiled regex to dedicated module
ellepannitto Sep 1, 2025
4455b3d
WIP tests for utils
harisont Sep 1, 2025
f99265c
minor changes
harisont Sep 1, 2025
47e1de9
Merge pull request #121 from UniversalDependencies/utils
harisont Sep 1, 2025
17ddf0f
Merge branch 'infrastructure' into regex
ellepannitto Sep 1, 2025
3f63f51
Merge pull request #122 from UniversalDependencies/regex
ellepannitto Sep 1, 2025
133020a
use the new crex module
harisont Sep 2, 2025
e51507f
minor comments
harisont Sep 2, 2025
9a4f7d0
Merge pull request #123 from UniversalDependencies/utils
harisont Sep 2, 2025
79d217f
rm outdated notes file
harisont Sep 2, 2025
65981d0
mv messages to dedicated modules
harisont Sep 2, 2025
9840b82
add missing import
harisont Sep 2, 2025
7c661ee
docstrings for messages
harisont Sep 2, 2025
edd13fd
remove useless files
ellepannitto Sep 2, 2025
c488c20
add notes for ludovica
ellepannitto Sep 2, 2025
7758fac
update notes
ellepannitto Sep 2, 2025
bfcf9fc
move loading functions to specific module
ellepannitto Sep 2, 2025
55cd4ab
add option to CLI for data folder
ellepannitto Sep 2, 2025
3826d0e
more tests and docstrings for utils
harisont Sep 2, 2025
80da1ff
todo
harisont Sep 2, 2025
770a090
Merge pull request #124 from UniversalDependencies/utils
ellepannitto Sep 2, 2025
ca3d51f
Merge pull request #125 from UniversalDependencies/loaders
ellepannitto Sep 2, 2025
812ba92
merge infrastructure into msg
harisont Sep 2, 2025
99b1691
merge infrastructure into msg
harisont Sep 2, 2025
24b9e0e
add specs object in validator
ellepannitto Sep 2, 2025
3223029
Merge branch 'infrastructure' of https://github.com/UniversalDependen…
harisont Sep 2, 2025
f209509
make loaders and specs work together with messages
harisont Sep 2, 2025
fd5c063
Merge pull request #126 from UniversalDependencies/msg
harisont Sep 2, 2025
3abff87
comment notes
harisont Sep 2, 2025
da4cf0a
more comment notes
harisont Sep 2, 2025
0f1653f
classes for test functions, errors and warnings
harisont Sep 2, 2025
553f084
working Test class
ellepannitto Sep 2, 2025
d83ad20
merge changes that make test callable
harisont Sep 2, 2025
8a42f4e
change of mind
harisont Sep 2, 2025
f42a0b1
minor change to incident.py
ellepannitto Sep 2, 2025
ff71a1a
Merge branch 'infrastructure' of https://github.com/UniversalDependen…
ellepannitto Sep 2, 2025
7da2d6a
rm tests
harisont Sep 2, 2025
400c22e
merge remove state from incident
harisont Sep 2, 2025
a10d3c3
fix docstring
harisont Sep 2, 2025
79259d5
add test for validate functions
ellepannitto Sep 2, 2025
09dbbdf
notes
ellepannitto Sep 2, 2025
357ae17
add some really dumb helper functions for quick testing
harisont Sep 2, 2025
d5dfd43
rename validator in validator_tmp temporarily
ellepannitto Sep 2, 2025
e9868b4
use futils in test_cases
harisont Sep 2, 2025
2206a3b
WIP reimplementation of validate_token_ranges
harisont Sep 2, 2025
53b63c4
finish v_mwt_token_ranges
harisont Sep 3, 2025
8767f22
no more misleading failed tests
harisont Sep 3, 2025
c9e76fc
Merge branch 'infrastructure' of https://github.com/UniversalDependen…
ellepannitto Sep 3, 2025
0dd84e3
Merge branch 'infrastructure' of https://github.com/UniversalDependen…
harisont Sep 3, 2025
8faa63b
finally done with validate_token_ranges
harisont Sep 3, 2025
169ec91
get rid of futile tests
harisont Sep 3, 2025
4dff58b
refactor validate_id_sequence & add missing test cases
harisont Sep 3, 2025
1501d0c
refactor validate_id_references
harisont Sep 3, 2025
81c40c5
add validate_x: mwt_empty_vals, empty_node_empty_vals, character_cons…
ellepannitto Sep 3, 2025
2f7f34e
refactor validate_tree
harisont Sep 3, 2025
77e034a
add test for multiple roots
harisont Sep 3, 2025
1ec7f3d
add 'validate_deps'
ellepannitto Sep 3, 2025
ec7f4cc
fix testclasses (use enum)
harisont Sep 3, 2025
1a5926c
remove lineno=-1 cause it's the default (might regret that it's the d…
harisont Sep 3, 2025
184259b
start long-term todo list
harisont Sep 3, 2025
b71e39b
add 'validate_misc'
ellepannitto Sep 3, 2025
0c7a581
Merge branch 'block-valfuns' of https://github.com/UniversalDependenc…
harisont Sep 3, 2025
e257950
add 'validate_deps_all_or_none'
ellepannitto Sep 3, 2025
4d1e63d
add 'validate_newlines'
ellepannitto Sep 3, 2025
7b47806
refactor meta tests
harisont Sep 3, 2025
f1c3fc9
fix parameter based on #127
harisont Sep 4, 2025
0ea009f
Merge pull request #128 from UniversalDependencies/meta-valfuns
harisont Sep 4, 2025
2ba49c8
file renaming
ellepannitto Sep 4, 2025
b26437d
Merge branch 'infrastructure' into validate-basics
harisont Sep 4, 2025
7e90141
Merge pull request #129 from UniversalDependencies/validate-basics
harisont Sep 4, 2025
18bc571
attempt to sketch engine and lots of subsequent random stuff. sorry Dan
harisont Sep 4, 2025
581a7e6
fix a few engine and check functions bugs, including schrödinger's li…
harisont Sep 4, 2025
3f8a1f2
example json format
ellepannitto Sep 5, 2025
1210113
infrastructure for json output
ellepannitto Sep 5, 2025
b81d850
todos
ellepannitto Sep 5, 2025
201ac03
tentative config format
harisont Sep 5, 2025
c721cee
Merge branch 'engine' of https://github.com/UniversalDependencies/too…
ellepannitto Sep 5, 2025
2ada0e2
minor fix
ellepannitto Sep 5, 2025
ad182f1
basic infrastructure for reading config
ellepannitto Sep 5, 2025
0908027
Bruno's wish is our command
harisont Sep 5, 2025
16ed511
notes about updated engine functionalities
ellepannitto Sep 5, 2025
7781b4e
rename refactored functions
ellepannitto Sep 5, 2025
a8594b7
minor changes
ellepannitto Sep 5, 2025
021621b
Merge branch 'engine' of https://github.com/UniversalDependencies/too…
ellepannitto Sep 5, 2025
58280d9
mv content of messages.py to output_utils.py, with minor refactoring …
harisont Sep 5, 2025
3324610
add dependencies to configuration
ellepannitto Sep 5, 2025
ca4a35f
update validator logic with dependencies -- not fully tested!
ellepannitto Sep 5, 2025
39af501
basic json output & minor improvements
harisont Sep 5, 2025
03d340f
override str of testclass
harisont Sep 5, 2025
764208a
fix indentation of explain_xxx
harisont Sep 5, 2025
659739f
add 'validate_deprels'
ellepannitto Sep 5, 2025
b0e18b6
file rename
ellepannitto Sep 5, 2025
02fa9e4
explanation index
ellepannitto Sep 5, 2025
51cd6ac
merge validate-tabular into infrastructure
harisont Sep 5, 2025
13c7f26
merge json
harisont Sep 5, 2025
501f9bd
merge config
harisont Sep 5, 2025
b5806a9
Merge branch 'master' of https://github.com/UniversalDependencies/too…
harisont Sep 5, 2025
b31877a
merge validate.py with validator.py
harisont Sep 5, 2025
e166d51
Co-authored-by: Ludovica <[email protected]>
harisont Sep 5, 2025
cedf93b
minor updates to config
harisont Sep 5, 2025
6ae4b80
whitespace changes
harisont Sep 5, 2025
5605ca0
fix post-merge (?) mess
harisont Sep 5, 2025
5e3cccc
rm draft of the draft pr text (added accidentally)
harisont Sep 7, 2025
ff0051c
refactor validate->check_root
harisont Sep 7, 2025
4a1a4ee
fix indent
harisont Sep 7, 2025
271ecff
fix more indentation and forgotten return
harisont Sep 7, 2025
a0b570f
refactor validate->check_enhanced_orphan
harisont Sep 7, 2025
05c6952
refactor validate->check_enhanced_orphan
harisont Sep 7, 2025
9672200
fix docstring
harisont Sep 7, 2025
91d69f8
refactor validate->check_words_with_spaces
harisont Sep 7, 2025
7fbd6b1
rename lcode->lang everywhere in validate for consistency
harisont Sep 7, 2025
fff6258
minor fix(Incident->Error)
harisont Sep 7, 2025
00d76b5
refactor validate->check_features_level4
harisont Sep 7, 2025
a997c57
refactor lv 5 checks
harisont Sep 7, 2025
c27c0bd
WIP validate_annotation and the myriad of functions it calls (top-dow…
harisont Sep 7, 2025
8111834
add table with description of checks
ellepannitto Sep 8, 2025
acdb792
minor fixes
ellepannitto Sep 8, 2025
d845b78
update table with description of checks
ellepannitto Sep 8, 2025
9a1b2ac
minor fixes
ellepannitto Sep 8, 2025
0a45061
add minimal test case
ellepannitto Sep 9, 2025
5f8baa9
add minimal test case
ellepannitto Sep 9, 2025
ba2f89a
update table with description of checks
ellepannitto Sep 9, 2025
22aee51
better error representation
ellepannitto Sep 9, 2025
57fe163
fix check_invalid_lines and check_columns_format
ellepannitto Sep 9, 2025
770e9f6
fix validate behaviour
ellepannitto Sep 9, 2025
700ec8e
add minimal testing scenarios
ellepannitto Sep 9, 2025
52ab7bc
update table with description of checks
ellepannitto Sep 9, 2025
aede17e
done refactoring misplaced-comment
ellepannitto Sep 9, 2025
df789cd
better handling of line number
ellepannitto Sep 9, 2025
6a344aa
tests for pseudo-empty-line and extra-empty-line
ellepannitto Sep 10, 2025
e3ab7b3
add tests for 'unicode-normalization', 'mwt-empty-vals' and 'empty-no…
ellepannitto Sep 11, 2025
ef835c2
finish level 1 tests
ellepannitto Sep 11, 2025
056b225
test for 'check_sent_id' and support for kwargs
ellepannitto Sep 12, 2025
1f89f4b
minor changes
ellepannitto Sep 12, 2025
7117c9a
'check_parallel_id' and 'check_test_meta' + add dataclass for state
ellepannitto Sep 16, 2025
c119cc7
minor fix
ellepannitto Sep 16, 2025
39c61d5
pull master validator for testing purposes
ellepannitto Sep 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,7 @@ docs/_build/

# PyBuilder
target/

# Validator
validator/logs/*.log
validator/logs/*.err
489 changes: 0 additions & 489 deletions test-cases/test.log

This file was deleted.

1,009 changes: 557 additions & 452 deletions validate.py

Large diffs are not rendered by default.

File renamed without changes.
32 changes: 32 additions & 0 deletions validator/docs/checks_table.md

Large diffs are not rendered by default.

48 changes: 48 additions & 0 deletions validator/docs/example_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
file:
check_newlines:
level: 1

block:
check_extra_empty_line:
level: 1
check_misplaced_comment:
level: 1
check_invalid_lines:
level: 1

line:
check_unicode_normalization:
level: 1
check_pseudo_empty_line:
level: 1

token_lines:
check_columns_format:
level: 1

comment_lines:

cols:
check_id_sequence:
level: 1
check_token_ranges:
level: 1
check_tree:
level: 2
depends_on:
- invalid-word-id
- invalid-word-interval
- misplaced-word-interval
- misplaced-empty-node
- word-id-sequence
- reversed-word-interval
- word-interval-out
- invalid-head
- unknown-head
- invalid-deps
- invalid-ehead
- unknown-ehead

tree:

node:
36 changes: 36 additions & 0 deletions validator/docs/example_output.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"filename_1.conllu": [
{
"level": 1,
"testclass": "FORMAT",
"testid": "trailing-whitespace",
"message": "short error description",
"sent_id": "3456",
"line_no": 1245,
"line": "CONLL-U LINE CONTENT"
},
{
"level": 3,
"testclass": "MORPHO",
"testid": "unknown-feature",
"message": "short error description",
"sent_id": "234",
"line_no": 500
},
{
...
}

],
"filename_2.conllu": [
{
...
},
{
...
},
{
...
}
]
}
29 changes: 29 additions & 0 deletions validator/docs/example_working.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
file:
check_newlines:
level: 1

block:
check_extra_empty_line:
level: 1
check_misplaced_comment:
level: 1
check_invalid_lines:
level: 1

line:
check_unicode_normalization:
level: 1
check_pseudo_empty_line:
level: 1

token_lines:
check_columns_format:
level: 1

comment_lines:

cols:

tree:

node:
2 changes: 2 additions & 0 deletions validator/docs/explanations.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
explain_deprel > unknown-deprel
explain_edeprel > unknown-edeprel
153 changes: 153 additions & 0 deletions validator/docs/extra.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@

# def load_set(f_name_ud, lang, validate_langspec=False, validate_enhanced=False):
# """
# Loads a list of values from the two files, and returns their
# set. If lang doesn't exist, loads nothing and returns
# None (ie this taglist is not checked for the given language). If lang
# is None, only loads the UD one. This is probably only useful for CPOS which doesn't
# allow language-specific extensions. Set validate_langspec=True when loading basic dependencies.
# That way the language specific deps will be checked to be truly extensions of UD ones.
# Set validate_enhanced=True when loading enhanced dependencies. They will be checked to be
# truly extensions of universal relations, too; but a more relaxed regular expression will
# be checked because enhanced relations may contain stuff that is forbidden in the basic ones.
# """
# res = load_file(os.path.join(g.THISDIR, 'data', f_name_ud))
# # Now res holds UD
# # Next load and optionally check the langspec extensions
# if lang is not None and lang != f_name_ud:
# l_spec = load_file(os.path.join(g.THISDIR,"data","tokens_w_space.json"), lang)
# for v in l_spec:
# if validate_enhanced:
# # We are reading the list of language-specific dependency relations in the enhanced representation
# # (i.e., the DEPS column, not DEPREL). Make sure that they match the regular expression that
# # restricts enhanced dependencies.
# if not g.edeprel_re.match(v):
# testlevel = 4
# testclass = 'Enhanced'
# testid = 'edeprel-def-regex'
# testmessage = f"Spurious language-specific enhanced relation '{v}' - it does not match the regular expression that restricts enhanced relations."
# warn(testmessage, testclass, testlevel, testid, lineno=-1)
# continue
# elif validate_langspec:
# # We are reading the list of language-specific dependency relations in the basic representation
# # (i.e., the DEPREL column, not DEPS). Make sure that they match the regular expression that
# # restricts basic dependencies. (In particular, that they do not contain extensions allowed in
# # enhanced dependencies, which should be listed in a separate file.)
# if not re.match(r"^[a-z]+(:[a-z]+)?$", v):
# testlevel = 4
# testclass = 'Syntax'
# testid = 'deprel-def-regex'
# testmessage = f"Spurious language-specific relation '{v}' - in basic UD, it must match '^[a-z]+(:[a-z]+)?'."
# warn(testmessage, testclass, testlevel, testid, lineno=-1)
# continue
# if validate_langspec or validate_enhanced:
# try:
# parts=v.split(':')
# if parts[0] not in res and parts[0] != 'ref':
# testlevel = 4
# testclass = 'Syntax'
# testid = 'deprel-def-universal-part'
# testmessage = f"Spurious language-specific relation '{v}' - not an extension of any UD relation."
# warn(testmessage, testclass, testlevel, testid, lineno=-1)
# continue
# except:
# testlevel = 4
# testclass = 'Syntax'
# testid = 'deprel-def-universal-part'
# testmessage = f"Spurious language-specific relation '{v}' - not an extension of any UD relation."
# warn(testmessage, testclass, testlevel, testid, lineno=-1)
# continue
# res.add(v)
# return res



# def load_feat_set(filename_langspec, lcode):
# """
# Loads the list of permitted feature-value pairs and returns it as a set.
# """
# with open(os.path.join(g.THISDIR, 'data', filename_langspec), 'r', encoding='utf-8') as f:
# all_features_0 = json.load(f)
# g.featdata = all_features_0['features']
# featset = get_featdata_for_language(lcode)
# # Prepare a global message about permitted features and values. We will add
# # it to the first error message about an unknown feature. Note that this
# # global information pertains to the default validation language and it
# # should not be used with code-switched segments in alternative languages.
# msg = ''
# if not lcode in g.featdata:
# msg += f"No feature-value pairs have been permitted for language [{lcode}].\n"
# msg += "They can be permitted at the address below (if the language has an ISO code and is registered with UD):\n"
# msg += "https://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_feature.pl\n"
# g.warn_on_undoc_feats = msg
# else:
# # Identify feature values that are permitted in the current language.
# for f in featset:
# for e in featset[f]['errors']:
# msg += f"ERROR in _{lcode}/feat/{f}.md: {e}\n"
# res = set()
# for f in featset:
# if featset[f]['permitted'] > 0:
# for v in featset[f]['uvalues']:
# res.add(f+'='+v)
# for v in featset[f]['lvalues']:
# res.add(f+'='+v)
# sorted_documented_features = sorted(res)
# msg += f"The following {len(sorted_documented_features)} feature values are currently permitted in language [{lcode}]:\n"
# msg += ', '.join(sorted_documented_features) + "\n"
# msg += "If a language needs a feature that is not documented in the universal guidelines, the feature must\n"
# msg += "have a language-specific documentation page in a prescribed format.\n"
# msg += "See https://universaldependencies.org/contributing_language_specific.html for further guidelines.\n"
# msg += "All features including universal must be specifically turned on for each language in which they are used.\n"
# msg += "See https://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/langspec/specify_feature.pl for details.\n"
# warn_on_undoc_feats = msg
# return featset

# def get_featdata_for_language(lcode):
# """
# Searches the previously loaded database of feature-value combinations.
# Returns the lists for a given language code. For most CoNLL-U files,
# this function is called only once at the beginning. However, some files
# contain code-switched data and we may temporarily need to validate
# another language.
# """
# ###!!! If lcode is 'ud', we should permit all universal feature-value pairs,
# ###!!! regardless of language-specific documentation.
# # Do not crash if the user asks for an unknown language.
# if not lcode in g.featdata:
# return {} ###!!! or None?
# return g.featdata[lcode]

# def get_auxdata_for_language(lcode):
# """
# Searches the previously loaded database of auxiliary/copula lemmas. Returns
# the AUX and COP lists for a given language code. For most CoNLL-U files,
# this function is called only once at the beginning. However, some files
# contain code-switched data and we may temporarily need to validate
# another language.
# """
# auxdata = g.auxdata
# # If any of the functions of the lemma is other than cop.PRON, it counts as an auxiliary.
# # If any of the functions of the lemma is cop.*, it counts as a copula.
# auxlist = []
# coplist = []
# if lcode == 'shopen':
# for lcode1 in auxdata.keys():
# lemmalist = auxdata[lcode1].keys()
# auxlist = auxlist + [x for x in lemmalist
# if len([y for y in auxdata[lcode1][x]['functions']
# if y['function'] != 'cop.PRON']) > 0]
# coplist = coplist + [x for x in lemmalist
# if len([y for y in auxdata[lcode1][x]['functions']
# if re.match(r"^cop\.", y['function'])]) > 0]
# else:
# lemmalist = auxdata.get(lcode, {}).keys()
# auxlist = [x for x in lemmalist
# if len([y for y in auxdata[lcode][x]['functions']
# if y['function'] != 'cop.PRON']) > 0]
# coplist = [x for x in lemmalist
# if len([y for y in auxdata[lcode][x]['functions']
# if re.match(r"^cop\.", y['function'])]) > 0]
# return auxlist, coplist


46 changes: 46 additions & 0 deletions validator/docs/full-validation.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
file:
check_newlines:
level: 1

block:
check_extra_empty_line:
level: 1
check_misplaced_comment:
level: 1

line:
check_invalid_lines:
level: 1
check_columns_format:
level: 1
depends_on:
- invalid-line
check_pseudo_empty_line:
level: 1
check_unicode_normalization:
level: 1


token_lines:

comment_lines:
check_sent_id:
level: 2

cols:
check_mwt_empty_vals:
level: 2
check_empty_node_empty_vals:
level: 2

tokens_cols:
check_id_sequence:
level: 1
check_token_ranges:
level: 1
depends_on:
- invalid-word-id

tree:

node:
24 changes: 24 additions & 0 deletions validator/docs/invalid-line.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
file:

block:
check_misplaced_comment:
level: 1

line:
check_invalid_lines:
level: 1
check_columns_format:
level: 1
depends_on:
- 'invalid-line'

token_lines:

comment_lines:

cols:


tree:

node:
3 changes: 3 additions & 0 deletions validator/docs/long_term_todo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
- organize test cases so that they match `testid`s
- add as a comment which tests should fail as a metadata
- try to have a 1:1 mapping between test functions (`validate_xxx`) and `testid`s/incidents, or at least modularize test functions further (e.g. `validate_tree` is conceptually composed of 3 tests)
Loading