Skip to content

Add new statistics#683

Open
karinaA7 wants to merge 1 commit intoDesbordante:mainfrom
karinaA7:add-new-statistics
Open

Add new statistics#683
karinaA7 wants to merge 1 commit intoDesbordante:mainfrom
karinaA7:add-new-statistics

Conversation

@karinaA7
Copy link

@karinaA7 karinaA7 commented Feb 25, 2026

New statistics:

whitespaceOnlyCount - the number of lines that consist only of whitespace characters
(space and tab)

firstCharFrequency / lastCharFrequency - the most common first/ last character in the column lines and the number of its occurrences

leadingWhitespaceCount / trailingWhitespaceCount - the number of lines with spaces at the beginning or at the end

specialCharsCount
Number of lines containing special characters

@xJoskiy
Copy link
Collaborator

xJoskiy commented Feb 26, 2026

Since you've added new statistic methods to DataStats, they are now printed whenever data_stats.get_all_statistics_as_string() is called in data_stats.py. So in order to pass CI you may copy output of data_stats.py into corresponding snapshot in snap_test_examples_pytest.py

Comment on lines +959 to +969

for (char c : str) {
if (special_chars.find(c) != std::string::npos) {
has_special = true;
break;
}
}

if (has_special) {
count++;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first

Suggested change
for (char c : str) {
if (special_chars.find(c) != std::string::npos) {
has_special = true;
break;
}
}
if (has_special) {
count++;
}
for (char c : str) {
if (special_chars.find(c) != std::string::npos) {
count++;
break;
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, finding spec character in a string take O(N * M) time, N - string length, M - spec character length. You may store a 256 sized array (number of possible values of char) for checking if the current symbol of string is in this array.

Suggested change
for (char c : str) {
if (special_chars.find(c) != std::string::npos) {
has_special = true;
break;
}
}
if (has_special) {
count++;
}
static constexpr const std::array<bool, 256> map = {0};
for (char c : special_characters) {
map[static_cast<unisgned char>(c)] = true;
}
for (char c : str) {
if (map[static_cast<unisgned char>(c)]) {
count++;
break;
}
}

Copy link
Collaborator

@xJoskiy xJoskiy Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode symbols won't affect the frequency of special characters since they start after 0x7F.


mo::TypedColumnData const& col = col_data_[index];
if (col.GetTypeId() != +mo::TypeId::kString) return {};
std::string const special_chars = "@#$%^&!?*_+=~'-\"";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::string const special_chars = "@#$%^&!?*_+=~'-\"";
static const auto constexpr special_chars = "@#$%^&!?*_+=~'-\"";

char most_frequent = '\0';
size_t max_count = 0;

for (auto const& pair : freq_map) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto const& pair : freq_map) {
for (auto [c, freq] : freq_map) {

return Statistic(res, &int_type, false);
}

Statistic DataStats::GetFirstCharFrequency(size_t index) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetFirstCharFrequency and GetLastCharFrequency are identical, better put a common part in a separate function

@karinaA7 karinaA7 force-pushed the add-new-statistics branch 2 times, most recently from 9dab0de to 742d809 Compare February 28, 2026 10:21
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions


mo::TypedColumnData const& col = col_data_[index];
if (col.GetTypeId() != +mo::TypeId::kString) return {};
static constexpr std::string_view const special_chars = "@#$%^&!?*_+=~'-\"";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: invalid case style for static constant 'special_chars' [readability-identifier-naming]

Suggested change
static constexpr std::string_view const special_chars = "@#$%^&!?*_+=~'-\"";
static constexpr std::string_view const kSpecialChars = "@#$%^&!?*_+=~'-\"";

src/core/algorithms/statistics/data_stats.cpp:959:

-         for (char c : special_chars) {
+         for (char c : kSpecialChars) {

if (col.IsNullOrEmpty(i)) continue;

auto const& str = mo::Type::GetValue<std::string>(col.GetValue(i));
bool has_special = false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: unused variable 'has_special' [clang-diagnostic-unused-variable]

        bool has_special = false;
             ^

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions


mo::TypedColumnData const& col = col_data_[index];
if (col.GetTypeId() != +mo::TypeId::kString) return {};
static constexpr std::string_view const special_chars = "@#$%^&!?*_+=~'-\"";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: invalid case style for static constant 'special_chars' [readability-identifier-naming]

Suggested change
static constexpr std::string_view const special_chars = "@#$%^&!?*_+=~'-\"";
static constexpr std::string_view const kSpecialChars = "@#$%^&!?*_+=~'-\"";

src/core/algorithms/statistics/data_stats.cpp:958:

-         for (char c : special_chars) {
+         for (char c : kSpecialChars) {

Comment on lines +958 to +961
static std::array<bool, 256> map = {0};
for (char c : special_chars) {
map[static_cast<unsigned char>(c)] = true;
}
Copy link
Collaborator

@xJoskiy xJoskiy Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should better put it outside of loop, do not initialize on each iteration. Initialization also can be made constexpr via lambda function

Suggested change
static std::array<bool, 256> map = {0};
for (char c : special_chars) {
map[static_cast<unsigned char>(c)] = true;
}
static contexpr std::array<bool, 256> map = [&special_chars]() constexpr {
std::array<bool, 256> map = {0};
for (char c : special_chars) {
map[static_cast<unsigned char>(c)] = true;
}
return map;
}();

Comment on lines +1000 to +1005
for (auto const& [c, freq] : freq_map) {
if (freq > max_count) {
max_count = freq;
most_frequent = c;
}
}
Copy link
Collaborator

@xJoskiy xJoskiy Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto const& [c, freq] : freq_map) {
if (freq > max_count) {
max_count = freq;
most_frequent = c;
}
}
auto const& [c, max_freq] = *std::max_element(map.begin(), map.end(), [](const auto& lhs, const auto& rhs){ return lhs.second < rhs.second; });

Also consider empty container in order to not dereference end() iterator

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we might want to find the least lexicographical character if frequencies are equal. It can be made with std::tie

Suggested change
for (auto const& [c, freq] : freq_map) {
if (freq > max_count) {
max_count = freq;
most_frequent = c;
}
}
auto const& [c, max_freq] = *std::max_element(map.begin(), map.end(), [](const auto& lhs, const auto& rhs) {
// frequencies in tuples are compared first, then lexicographic order of characters
return std::tie(lhs.second, lhs.first) < std::tie(rhs.second, rhs.first);
};

@xJoskiy xJoskiy self-assigned this Feb 28, 2026
Comment on lines +964 to +969
for (char c : str) {
if (map[static_cast<unsigned char>(c)]) {
count++;
break;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::any_of

Comment on lines +881 to +886
for (char c : str) {
if (c != ' ' && c != '\t') {
only_space_or_tab = false;
break;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why didn't you use std::isspace?

if (col.IsNullOrEmpty(i)) continue;

auto const& str = mo::Type::GetValue<std::string>(col.GetValue(i));
static std::array<bool, 256> map = []() constexpr {
Copy link
Collaborator

@xJoskiy xJoskiy Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constexpr, also put outside of loop

@karinaA7 karinaA7 force-pushed the add-new-statistics branch from c5b7a1b to 03a3e42 Compare March 2, 2026 15:36
Comment on lines +881 to +886
for (char c : str) {
if (!std::isspace(static_cast<unsigned char>(c))) {
only_space_or_tab = false;
break;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (char c : str) {
if (!std::isspace(static_cast<unsigned char>(c))) {
only_space_or_tab = false;
break;
}
}
if (str.empty() or std::none_of(...))
count++;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test doesn't pass with non_of, I suggest leaving it as is

Comment on lines +898 to +942
Statistic DataStats::GetLeadingWhitespaceCount(size_t index) const {
if (all_stats_[index].leading_whitespace_count.HasValue())
return all_stats_[index].leading_whitespace_count;

mo::TypedColumnData const& col = col_data_[index];
if (col.GetTypeId() != +mo::TypeId::kString) return {};

size_t count = 0;

for (size_t i = 0; i < col.GetNumRows(); i++) {
if (col.IsNullOrEmpty(i)) continue;

auto const& str = mo::Type::GetValue<std::string>(col.GetValue(i));
if (!str.empty() && std::isspace(static_cast<unsigned char>(str[0]))) {
count++;
}
}

mo::IntType int_type;
std::byte const* res = int_type.MakeValue(count);
return Statistic(res, &int_type, false);
}

Statistic DataStats::GetTrailingWhitespaceCount(size_t index) const {
if (all_stats_[index].trailing_whitespace_count.HasValue())
return all_stats_[index].trailing_whitespace_count;

mo::TypedColumnData const& col = col_data_[index];
if (col.GetTypeId() != +mo::TypeId::kString) return {};

size_t count = 0;

for (size_t i = 0; i < col.GetNumRows(); i++) {
if (col.IsNullOrEmpty(i)) continue;

auto const& str = mo::Type::GetValue<std::string>(col.GetValue(i));
if (!str.empty() && std::isspace(static_cast<unsigned char>(str.back()))) {
count++;
}
}

mo::IntType int_type;
std::byte const* res = int_type.MakeValue(count);
return Statistic(res, &int_type, false);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost identical

static constexpr std::string_view const kSpecialChars = "@#$%^&!?*_+=~'-\"";
size_t count = 0;

std::array<bool, 256> map = []() constexpr {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::array<bool, 256> map = []() constexpr {
static constexpr std::array<bool, 256> map = []() constexpr {

@karinaA7 karinaA7 force-pushed the add-new-statistics branch from 03a3e42 to b52df00 Compare March 3, 2026 16:07
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

static constexpr std::string_view const kSpecialChars = "@#$%^&!?*_+=~'-\"";
size_t count = 0;

static constexpr std::array<bool, 256> map = []() constexpr {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: invalid case style for static constant 'map' [readability-identifier-naming]

Suggested change
static constexpr std::array<bool, 256> map = []() constexpr {
static constexpr std::array<bool, 256> kMap = []() constexpr {

src/core/algorithms/statistics/data_stats.cpp:960:

-                         [](char c) { return map[static_cast<unsigned char>(c)]; })) {
+                         [](char c) { return kMap[static_cast<unsigned char>(c)]; })) {

// Returns the most frequent last character.
Statistic GetLastCharFrequency(size_t index) const;
enum class CharPosition { kFirst, kLast };
enum class WhitespacePosition { Leading, Trailing };
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: invalid case style for enum constant 'Leading' [readability-identifier-naming]

Suggested change
enum class WhitespacePosition { Leading, Trailing };
enum class WhitespacePosition { kLeading, Trailing };

// Returns the most frequent last character.
Statistic GetLastCharFrequency(size_t index) const;
enum class CharPosition { kFirst, kLast };
enum class WhitespacePosition { Leading, Trailing };
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: invalid case style for enum constant 'Trailing' [readability-identifier-naming]

Suggested change
enum class WhitespacePosition { Leading, Trailing };
enum class WhitespacePosition { Leading, kTrailing };

@karinaA7 karinaA7 force-pushed the add-new-statistics branch from b52df00 to fe1f418 Compare March 3, 2026 16:27
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

static constexpr std::string_view const kSpecialChars = "@#$%^&!?*_+=~'-\"";
size_t count = 0;

static constexpr std::array<bool, 256> kmap = []() constexpr {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: invalid case style for static constant 'kmap' [readability-identifier-naming]

Suggested change
static constexpr std::array<bool, 256> kmap = []() constexpr {
static constexpr std::array<bool, 256> kMap = []() constexpr {

src/core/algorithms/statistics/data_stats.cpp:960:

-                         [](char c) { return kmap[static_cast<unsigned char>(c)]; })) {
+                         [](char c) { return kMap[static_cast<unsigned char>(c)]; })) {

@karinaA7 karinaA7 force-pushed the add-new-statistics branch 2 times, most recently from 9237253 to d848454 Compare March 3, 2026 20:33
Comment on lines +181 to +182
enum class CharPosition { kFirst, kLast };
enum class WhitespacePosition { kLeading, kTrailing };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the are serving the same purpose. You might leave only the CharPosition enum and place it somewhere above, closer to DataStats() ctor, not in the middle of methods declaration

@karinaA7 karinaA7 force-pushed the add-new-statistics branch from d848454 to 0e8419f Compare March 5, 2026 11:07
return GetWhitespaceCount(index, CharPosition::kLast);
}

Statistic DataStats::GetSpecialCharsCount(size_t index) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming does not reflect the actual purpose of this function. Should better name it like GetNumberOfRowsWithSpecialChars

Comment on lines +923 to +927
Statistic DataStats::GetLeadingWhitespaceCount(size_t index) const {
return GetWhitespaceCount(index, CharPosition::kFirst);
}

Statistic DataStats::GetTrailingWhitespaceCount(size_t index) const {
Copy link
Collaborator

@xJoskiy xJoskiy Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same with these functions. GetNumberOfRowsWithTrailingWhitespaces

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants