Skip to content

Conversation

@CarloMariaProietti
Copy link
Contributor

@CarloMariaProietti CarloMariaProietti commented Nov 10, 2025

FIXES #658
It makes possible to compare DataFrame by exploiting Myers difference algotithm whose cost is O((M+N)*D) .
M is length of dfA, N is length of dfB, D is length of shortest edit script to get B from A.

Returns a DataFrame< ComparisonDescription >,
ComparisonDescription is a schema created specifically for this use case.

It comes with a proper test case.

About Myers difference algotithm:
https://neil.fraser.name/writing/diff/myers.pdf

@CarloMariaProietti CarloMariaProietti marked this pull request as ready for review November 16, 2025 19:05
import org.jetbrains.kotlinx.dataframe.api.emptyDataFrame
import org.jetbrains.kotlinx.dataframe.nrow

internal class ComparisonDescription(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data schemas are created with @DataSchema, not with : DataRowSchema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing! I added @DataSchema , however with the current implementation : DataRowSchema
is still necessary because of lines 41-42..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comparing two data frame

2 participants