Skip to content
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
a563694
listmaking WIP
thcrock Feb 20, 2019
9750c3e
forgot migraton
thcrock Feb 20, 2019
360f8f9
WIP
tweddielin Feb 21, 2019
999a46f
alembic add label_value to list_predictions table
tweddielin Feb 26, 2019
372d9c8
add docstrings
tweddielin Feb 28, 2019
16645bc
move risklist a layer above
tweddielin Mar 13, 2019
914ad76
create risklist module
tweddielin Mar 13, 2019
c92bd8b
__init__lpy
tweddielin Mar 13, 2019
d3c3ba9
fix alembic reversion and replace metta.generate_uuid with filename_f…
tweddielin Mar 13, 2019
0e92fb0
Fix down revision of production schema migration
thcrock Apr 12, 2019
dbd4578
Fix alembic revisions
thcrock Jan 6, 2021
f7d49e5
Enable github checks on this branch too
thcrock Jan 6, 2021
dee930f
Closer to getting tests to run
thcrock Jan 6, 2021
1769b00
Add CLI for risklist
thcrock Jan 8, 2021
52c9ff0
Risklist docs stub
thcrock Jan 8, 2021
173167a
Break up data gathering into experiment and matrix, use pytest fixtur…
thcrock Jan 9, 2021
f6b2d02
Modify schema for list prediction metadata
thcrock Jan 9, 2021
acffa67
fix conflicts and add helper functions for getting imputed features
tweddielin Jan 9, 2021
43c1919
Handle other imputation flag cases, fix tracking indentation error
thcrock Jan 10, 2021
7dfb7e1
Add more tests, fill out doc page
thcrock Jan 11, 2021
cc9fe4a
Fix exception name typo
thcrock Jan 11, 2021
5951565
use timechop and planner to create matrix_metadata for production
tweddielin Jan 15, 2021
537f6c8
retrain and predict forward
tweddielin Apr 15, 2021
b429540
rename to retrain_definition
tweddielin Apr 15, 2021
0045aa5
reusing random seeds from existing models
shaycrk May 8, 2021
9dc3697
fix tests (write experiment to test db)
shaycrk May 10, 2021
da870d5
unit test for reusing model random seeds
shaycrk May 10, 2021
6768ee5
add docstring
shaycrk May 10, 2021
7d6a420
only store random seed in experiment runs
shaycrk May 20, 2021
b8fe6d8
DB migration to remove random seed from experiments table
shaycrk May 20, 2021
8207fcd
debugging
shaycrk May 20, 2021
45c9d68
debug model trainer tests
shaycrk May 21, 2021
a665e7e
debug catwalk utils tests
shaycrk May 21, 2021
ead882b
debug catwalk integration test
shaycrk May 21, 2021
de85f10
use public method
tweddielin May 30, 2021
ad860cd
Merge remote-tracking branch 'origin/kit_rand_seed' into list_making
tweddielin May 31, 2021
40466d5
alembic merge
tweddielin May 31, 2021
83c7385
reuse random seed
tweddielin May 31, 2021
f97089b
use timechop for getting retrain information
tweddielin Jun 30, 2021
6f0af1c
create retrain model hash in retrain level instead of model_trainer l…
tweddielin Jun 30, 2021
42bccaa
move util functions to utils
tweddielin Jun 30, 2021
3ec377f
fix cli and docs
tweddielin Jul 1, 2021
1c4da24
update docs
tweddielin Jul 1, 2021
35bd978
use reconstructed feature dict
tweddielin Jul 1, 2021
9f5a099
add RetrainModel and Retrain
tweddielin Jul 29, 2021
ba84822
remove break point
tweddielin Jul 29, 2021
83e0f66
change experiment_runs to triage_runs
tweddielin Aug 21, 2021
d6f14f5
get retrain_config
tweddielin Aug 22, 2021
d76359b
explicitly include run_type in joins to triage_runs
shaycrk Aug 26, 2021
9698500
DB migration updates
shaycrk Aug 26, 2021
a8a29f1
update argument name in docs
shaycrk Aug 26, 2021
694edcc
ensure correct temporal config is used for predicting forward
shaycrk Aug 27, 2021
583e9bd
debug
shaycrk Aug 27, 2021
815a258
debug
shaycrk Aug 27, 2021
5e183fe
Merge branch 'master' into list_making
shaycrk Aug 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions src/tests/test_risklist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
from triage.risklist import generate_risk_list
from tests.utils import sample_config, populate_source_data
from triage.experiments import SingleThreadedExperiment
from triage.validation_primitives import table_should_have_data


def test_risklist(db_engine, project_storage):
# given a model id and as-of-date <= today
# and the model id is trained and is linked to an experiment with feature and cohort config
# generate records in listpredictions
# the # of records should equal the size of the cohort for that date
populate_source_data(db_engine)
SingleThreadedExperiment(
sample_config(),
db_engine=db_engine,
project_path=project_storage.project_path
).run()

model_id = 1
as_of_date = '2013-01-01'
generate_risk_list(
db_engine=db_engine,
matrix_storage_engine=project_storage.matrix_storage_engine(),
model_storage_engine=project_storage.model_storage_engine(),
model_id=model_id,
as_of_date=as_of_date)
table_should_have_data(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simple assertion was a good start but we should go further. Does it make sense to make assertions about the size of the table? How about the contents? We'd expect all of these rows to have the same date/model id and stuff like that, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if I recall correctly the model_metadata.matrices table will also get a row since we used the MatrixBuilder, we should make sure that row looks reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db_engine=db_engine,
table_name="production.list_predictions",
)
59 changes: 34 additions & 25 deletions src/triage/component/architect/builders.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ def __init__(
self.replace = replace
self.include_missing_labels_in_train_as = include_missing_labels_in_train_as
self.run_id = run_id
self.includes_labels = 'labels_table_name' in self.db_config

@property
def sessionmaker(self):
Expand Down Expand Up @@ -134,7 +135,7 @@ def make_entity_date_table(
"""

as_of_time_strings = [str(as_of_time) for as_of_time in as_of_times]
if matrix_type == "test" or self.include_missing_labels_in_train_as is not None:
if matrix_type == "test" or matrix_type == "production" or self.include_missing_labels_in_train_as is not None:
indices_query = self._all_valid_entity_dates_query(
as_of_time_strings=as_of_time_strings, state=state
)
Expand Down Expand Up @@ -253,17 +254,19 @@ def build_matrix(
if self.run_id:
errored_matrix(self.run_id, self.db_engine)
return
if not table_has_data(
"{}.{}".format(
self.db_config["labels_schema_name"],
self.db_config["labels_table_name"],
),
self.db_engine,
):
logging.warning("labels table is not populated, cannot build matrix")

if self.includes_labels:
if not table_has_data(
"{}.{}".format(
self.db_config["labels_schema_name"],
self.db_config["labels_table_name"],
),
self.db_engine,
):
logging.warning("labels table is not populated, cannot build matrix")
return
if self.run_id:
errored_matrix(self.run_id, self.db_engine)
return

matrix_store = self.matrix_storage_engine.get_store(matrix_uuid)
if not self.replace and matrix_store.exists:
Expand All @@ -287,7 +290,7 @@ def build_matrix(
matrix_metadata["state"],
matrix_type,
matrix_uuid,
matrix_metadata["label_timespan"],
matrix_metadata.get("label_timespan", None),
)
except ValueError as e:
logging.warning(
Expand All @@ -305,25 +308,31 @@ def build_matrix(
as_of_times, feature_dictionary, entity_date_table_name, matrix_uuid
)
logging.info(f"Feature data extracted for matrix {matrix_uuid}")
logging.info(
"Extracting label data from database into file for " "matrix %s",
matrix_uuid,
)
labels_df = self.load_labels_data(
label_name,
label_type,
entity_date_table_name,
matrix_uuid,
matrix_metadata["label_timespan"],
)
dataframes.insert(0, labels_df)

logging.info(f"Label data extracted for matrix {matrix_uuid}")
# dataframes add label_name

if self.includes_labels:
logging.info(
"Extracting label data from database into file for " "matrix %s",
matrix_uuid,
)
labels_df = self.load_labels_data(
label_name,
label_type,
entity_date_table_name,
matrix_uuid,
matrix_metadata["label_timespan"],
)
dataframes.insert(0, labels_df)
logging.info(f"Label data extracted for matrix {matrix_uuid}")
else:
labels_df = pandas.DataFrame(index=dataframes[0].index, columns=[label_name])
dataframes.insert(0, labels_df)

# stitch together the csvs
logging.info("Merging feature files for matrix %s", matrix_uuid)
output = self.merge_feature_csvs(dataframes, matrix_uuid)
logging.info(f"Features data merged for matrix {matrix_uuid}")

matrix_store.metadata = matrix_metadata
# store the matrix
labels = output.pop(matrix_store.label_column_name)
Expand Down
8 changes: 5 additions & 3 deletions src/triage/component/architect/feature_generators.py
Original file line number Diff line number Diff line change
Expand Up @@ -635,7 +635,7 @@ def _generate_agg_table_tasks_for(self, aggregation):

return table_tasks

def _generate_imp_table_tasks_for(self, aggregation, drop_preagg=True):
def _generate_imp_table_tasks_for(self, aggregation, impute_cols=None, nonimpute_cols=None, drop_preagg=True):
"""Generate SQL statements for preparing, populating, and
finalizing imputations, for each feature group table in the
given aggregation.
Expand Down Expand Up @@ -685,8 +685,10 @@ def _generate_imp_table_tasks_for(self, aggregation, drop_preagg=True):
with self.db_engine.begin() as conn:
results = conn.execute(aggregation.find_nulls())
null_counts = results.first().items()
impute_cols = [col for (col, val) in null_counts if val > 0]
nonimpute_cols = [col for (col, val) in null_counts if val == 0]
if impute_cols is None:
impute_cols = [col for (col, val) in null_counts if val > 0]
if nonimpute_cols is None:
nonimpute_cols = [col for (col, val) in null_counts if val == 0]

# table tasks for imputed aggregation table, most of the work is done here
# by collate's get_impute_create()
Expand Down
16 changes: 14 additions & 2 deletions src/triage/component/catwalk/storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
TrainEvaluation,
TestPrediction,
TrainPrediction,
ListPrediction
)
from triage.util.pandas import downcast_matrix

Expand Down Expand Up @@ -435,7 +436,7 @@ def columns(self, include_label=False):
if include_label:
return columns
else:
return [col for col in columns if col != self.metadata["label_name"]]
return [col for col in columns if col != self.metadata.get("label_name", None)]

@property
def label_column_name(self):
Expand Down Expand Up @@ -479,6 +480,8 @@ def matrix_type(self):
return TrainMatrixType
elif self.metadata["matrix_type"] == "test":
return TestMatrixType
elif self.metadata["matrix_type"] == "production":
return ProductionMatrixType
else:
raise Exception(
"""matrix metadata for matrix {} must contain 'matrix_type'
Expand Down Expand Up @@ -525,7 +528,10 @@ def matrix_with_sorted_columns(self, columns):

@property
def full_matrix_for_saving(self):
return self.design_matrix.assign(**{self.label_column_name: self.labels})
if self.labels is not None:
return self.design_matrix.assign(**{self.label_column_name: self.labels})
else:
return self.design_matrix

def load_metadata(self):
"""Load metadata from storage"""
Expand Down Expand Up @@ -644,3 +650,9 @@ class TrainMatrixType(object):
evaluation_obj = TrainEvaluation
prediction_obj = TrainPrediction
is_test = False


class ProductionMatrixType(object):
string_name = "production"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verbose_name or matrix_name might be more exact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prediction_obj = ListPrediction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're introducing a new matrix type, I think is_test has to be set in some way. How does is_test get used for the other ones? Depending on how it's used, I could see either False or True making more sense for ProductionMatrixType. is_test might be a poor name, serving as a proxy for some other type of behavior, and we should rename it in a way that makes a value for ProductionMatrixType more obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is_test is only used in ModelEvaluator to decide if it should use testing_metric_groups or training_metric_groups. For ProductionMatrixType, at least for generating risk list, there's no evaluation involved. I'm not sure why it needs is_test regardless of good or poor naming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... would it make sense to remove the is_test boolean entirely and simply check if matrix_type.string_name == 'test' in evaluation.py? As it is, it seems a little redundant with the string name, especially if it doesn't fit well with the production matrix type here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fond of comparisons against magic strings. What about making it more specifically about the metric groups? Like what if the field was metric_config_key, and for train it was 'training_metric_groups', for test it was 'testing_metric_groups', and for production it was None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a fair point about avoiding magic strings -- having it keep track of the appropriate metric_groups types it needs to use makes sense to me if that's the only place is_test is getting used downstream regardless.

Copy link
Contributor

@not-the-fish not-the-fish Aug 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just coming into this discussion quite late. It seems to me that the discussion here is exposing decisions we committed to early on and haven't tried to address.

I feel like we always saw the matrix_type as a bad pattern but implemented it for expediency to flag (poorly) the different behavior of label imputation in train/test for the inspection problem type. Is my memory accurate? There were conceptual issues with this all along because matrix_type was never a good name and it prevented us from reusing matrices from testing for training in the EWS problem type.

It also seems like which metrics to use is not properly a property of the matrix but of the task being performed. Perhaps we should be passing the metrics groups to the evaluator instead of encoding in the matrix what the evaluator should do to it.

I think maybe we can move forward with this as a way to get the feature we want now, but this discussion is making the illogic of these matrix attributes more apparent and we may have some deeper technical debt to service here rather than trying to make the current situation totally coherent.

Copy link
Contributor

@not-the-fish not-the-fish Aug 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you all discussed this on Tuesday, but here are a couple of possible solutions for Triage v 5.0.0. The first is probably easier to implement, but I think it is the less good solution:

  • Eliminate both the matrix_type and is_test properties of matrices
  • Include missing labels as missing in all matrices
  • Replace include_missing_labels_in_train_as with two keys: impute_missing_labels_at_train and impute_missing_labels_at_test. These can be True or False. If True, you must also specify the value. We already have lots of duplicated keys for train/test behavior, so this fits an existing configuration pattern. These can be set to defaults when there are inspection and EWS entry points
  • The Architect writes matrices without label imputation, allowing them to be reused for any purpose
  • Catwalk does the label imputation on the fly when it is training or testing and always writes the labels to the database. This has high storage costs, but the reason we had to decide on the label imputation at matrix-building time initially was because we had not yet conceived of the train_results schema. Now that Triage can write train predictions, it can write train labels to the database, and the matrix_type concept is no longer needed.
  • The ModelEvaluator takes only a single set of metrics, and Catwalk creates a separate ModelEvaluator object for training and testing.
  • Separate matrix storage, rather than separate matrix types, is introduced for Production and Development

An alternative:

  • Eliminate both the matrix_type and is_test properties of matrices
  • Replace include_missing_labels_in_train_as with two keys: impute_missing_labels_at_train and impute_missing_labels_at_test. These can be True or False. If True, you must also specify the value. We already have lots of duplicated keys for train/test behavior, so this fits an existing configuration pattern. These can be set to defaults when there are inspection and EWS entry points
  • Instead of storing labels in the train/test matrices, we begin storing the design matrices and the labels separately. Train labels and test labels have separate stores, and the imputation behavior is a key in the label metadata. This reduces (but does not eliminate as in the first proposal) duplication of storage in the EWS case.
  • We introduce new metadata tables for Labels and experiment-label pairs, and add a labels_id column to many of the places where we also have a matrix_id column (models, predictions, evaluations, etc.). This increases db storage a little but not as much as having to store all the train labels in the db.
  • Include all entities in the cohort in all design matrices
  • The replace pathway checks the imputation flag on the labels to determine whether the Architect needs to create new labels
  • Design matrices are combined with the correct labels on reading, and rows are dropped if there is no label and the imputation key is False
  • The ModelEvaluator takes only a single set of metrics, and Catwalk creates a separate ModelEvaluator object for training and testing.
  • You can still skip writing train results to conserve db storage
  • Production/predict-forward/retrain-and-predict uses design matrices but not labels

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the above comment to #855

Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
"""empty message

Revision ID: 1b990cbc04e4
Revises: 0bca1ba9706e
Create Date: 2019-02-20 16:41:22.810452

"""
from alembic import op
import sqlalchemy as sa


# revision identifiers, used by Alembic.
revision = '1b990cbc04e4'
down_revision = 'cfd5c3386014'
branch_labels = None
depends_on = None


def upgrade():
op.execute("CREATE SCHEMA IF NOT EXISTS production")
op.execute("ALTER TABLE model_metadata.list_predictions SET SCHEMA production;")


def downgrade():
op.execute("ALTER TABLE production.list_predictions SET SCHEMA model_metadata;")
op.execute("DROP SCHEMA IF EXISTS production")
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
"""add label_value to prodcution table

Revision ID: 264786a9fe85
Revises: 1b990cbc04e4
Create Date: 2019-02-26 13:17:05.365654

"""
from alembic import op
import sqlalchemy as sa


# revision identifiers, used by Alembic.
revision = '264786a9fe85'
down_revision = '1b990cbc04e4'
branch_labels = None
depends_on = None


def upgrade():
op.drop_table("list_predictions", schema="production")
op.create_table(
"list_predictions",
sa.Column("model_id", sa.Integer(), nullable=False),
sa.Column("entity_id", sa.BigInteger(), nullable=False),
sa.Column("as_of_date", sa.DateTime(), nullable=False),
sa.Column("score", sa.Numeric(), nullable=True),
sa.Column('label_value', sa.Integer, nullable=True),
sa.Column("rank_abs", sa.Integer(), nullable=True),
sa.Column("rank_pct", sa.Float(), nullable=True),
sa.Column("matrix_uuid", sa.Text(), nullable=True),
sa.Column("test_label_window", sa.Interval(), nullable=True),
sa.ForeignKeyConstraint(["model_id"], ["model_metadata.models.model_id"]),
sa.PrimaryKeyConstraint("model_id", "entity_id", "as_of_date"),
schema="production",
)


def downgrade():
op.drop_table("list_predictions", schema="production")
op.create_table(
"list_predictions",
sa.Column("model_id", sa.Integer(), nullable=False),
sa.Column("entity_id", sa.BigInteger(), nullable=False),
sa.Column("as_of_date", sa.DateTime(), nullable=False),
sa.Column("score", sa.Numeric(), nullable=True),
sa.Column("rank_abs", sa.Integer(), nullable=True),
sa.Column("rank_pct", sa.Float(), nullable=True),
sa.Column("matrix_uuid", sa.Text(), nullable=True),
sa.Column("test_label_window", sa.Interval(), nullable=True),
sa.ForeignKeyConstraint(["model_id"], ["results.models.model_id"]),
Copy link
Contributor

@thcrock thcrock Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This schema should be model_metadata (Alena hit this error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually ended up doing this when I was fixing conflicts (I had to fix the Alembic revisions to get the tests working after conflict fixing)

sa.PrimaryKeyConstraint("model_id", "entity_id", "as_of_date"),
schema="results",
)

4 changes: 3 additions & 1 deletion src/triage/component/results_schema/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
"CREATE SCHEMA IF NOT EXISTS model_metadata;"
" CREATE SCHEMA IF NOT EXISTS test_results;"
" CREATE SCHEMA IF NOT EXISTS train_results;"
" CREATE SCHEMA IF NOT EXISTS production;"
)

event.listen(Base.metadata, "before_create", DDL(schemas))
Expand Down Expand Up @@ -86,14 +87,15 @@ class ModelGroup(Base):
class ListPrediction(Base):

__tablename__ = "list_predictions"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it might be a little more natural to make this production.predictions to keep consistent with the train_results and test_results tables?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

__table_args__ = {"schema": "model_metadata"}
__table_args__ = {"schema": "production"}

model_id = Column(
Integer, ForeignKey("model_metadata.models.model_id"), primary_key=True
)
entity_id = Column(BigInteger, primary_key=True)
as_of_date = Column(DateTime, primary_key=True)
score = Column(Numeric)
label_value = Column(Integer)
rank_abs = Column(Integer)
rank_pct = Column(Float)
matrix_uuid = Column(Text)
Expand Down
Loading