Skip to content

[FIX] Warning for discrete variable with >100 values in OWFile#4120

Merged
janezd merged 2 commits intobiolab:masterfrom
PrimozGodec:limit-discrete-variable
Oct 24, 2019
Merged

[FIX] Warning for discrete variable with >100 values in OWFile#4120
janezd merged 2 commits intobiolab:masterfrom
PrimozGodec:limit-discrete-variable

Conversation

@PrimozGodec
Copy link
Contributor

@PrimozGodec PrimozGodec commented Oct 18, 2019

Issue

Fixes #4094

Description of changes

As @janezd suggested the variable with more than 100 different values should not be DiscreteVariable. This PR:

  • changes guessing mechanism such that the variable with more than 100 different values is not a StringVariable
  • add tests for guess_data_type function
  • OWFile warns the user who set the variable with more than 100 different values as a DiscreteVariable
Includes
  • Code changes
  • Tests
  • Documentation

@janezd
Copy link
Contributor

janezd commented Oct 19, 2019

As you noted, there are still many ways to force variables to have >100 values. Having a warning in DiscreteVariable.__init__ doesn't do any good (nobody sees it), and raising an exception would be worse (nobody would catch it).

I wrote a comment in #4094 - maybe the file widget would just give a warning?

Automated detection should of course recognize such variables as strings. Wasn't it already?

@ajdapretnar
Copy link
Contributor

Automated detection should of course recognize such variables as strings. Wasn't it already?

It did, I am sure of it.

I also remember some users asking to increase the number of allowed discrete variables, mostly biologists, geologists and physicists.

@PrimozGodec
Copy link
Contributor Author

Automated detection should, of course, recognize such variables as strings. Wasn't it already?

Automated detection recognizes variables that have int(round(len(values)**.7)) < num_unique_variables as string where len(values) is number of items in the table. This threshold is dependent on the number of items size (which looks ok). Last time we had a case when we had about 30000 items and for this data table, this threshold is 1361. Orange recognized a variable that had a few hundred unique values as a discrete and the box-plot freeze. I know this case is very unusual but it can happen.

Should we put a threshold of 100 hundred unique values besides the current threshold for variable recognition or we leave it as it is?

I agree with the warning in the File widget.

@PrimozGodec PrimozGodec force-pushed the limit-discrete-variable branch from fc9982d to 16636c3 Compare October 22, 2019 12:18
@PrimozGodec PrimozGodec force-pushed the limit-discrete-variable branch from 16636c3 to 905898e Compare October 22, 2019 12:29
@codecov
Copy link

codecov bot commented Oct 22, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@1fb5f8a). Click here to learn what that means.
The diff coverage is 98.27%.

@@            Coverage Diff            @@
##             master    #4120   +/-   ##
=========================================
  Coverage          ?   85.68%           
=========================================
  Files             ?      390           
  Lines             ?    69726           
  Branches          ?        0           
=========================================
  Hits              ?    59743           
  Misses            ?     9983           
  Partials          ?        0

@PrimozGodec PrimozGodec changed the title [WIP][FIX] Guess the variable with more than 100 different values as a StringVariable [FIX] Guess the variable with more than 100 different values as a StringVariable Oct 22, 2019
@PrimozGodec
Copy link
Contributor Author

I removed the warning from the Discrete variable and added the Warning on the File widget. Since now there is a warning on the widget for discrete variables with more than 100 values I suggest that we keep the same rule for guessing (see the previous comment).

@janezd
Copy link
Contributor

janezd commented Oct 23, 2019

Domain editor is a component and as such it shouldn't expect the class into which it is inserted to have some specific warning that the component will trigger.

  • I think the best solution would be to raise the warning from within OWFile widget. Currently it warns only if the user manually changes the type to categorical. The warning should also be shown in .tab files if the column is specified as discrete and contains too many values, not just if the user overrides the guessing heuristic.

  • If we keep the current solution, I believe it would be better if domain editor itself declared the warning (we therefore treat it as a kind of "mixin"). You can do this by removing
    performance_warning = widget.Msg("Categorical variables with >100 values may decrease performance.") from widget's warnings and instead add

        widget.Warning.add_message(
            "performance_warning",
            "Categorical variables with >100 values may decrease performance.")

to domain editor's __init__.

I believe that the former is better, though.

@PrimozGodec PrimozGodec force-pushed the limit-discrete-variable branch from 905898e to a480971 Compare October 24, 2019 12:01
@PrimozGodec
Copy link
Contributor Author

I moved the warning triggering to the file widget and it shows the warning if the column in the file is marked as a discrete.

@PrimozGodec PrimozGodec changed the title [FIX] Guess the variable with more than 100 different values as a StringVariable [FIX] Warning for discrete variable with >100 values in OWFile Oct 24, 2019
@janezd janezd merged commit 32b2bd4 into biolab:master Oct 24, 2019
@PrimozGodec PrimozGodec deleted the limit-discrete-variable branch October 31, 2019 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Box plot freezes when feature has many values

3 participants