Skip to content

Conversation

@VincentAuriau
Copy link
Collaborator

No description provided.

VincentAuriau and others added 30 commits June 13, 2024 09:39
* Update README.md

* FIX: lint

---------

Co-authored-by: VincentAuriau <auriau.vincent@gmail.com>
* ADD: small enhancements

* FIX: typo

* ENH/FIX: plots
* ENH: few modifications in doc

* ENH: Documentation Landing page
ADD: Custom KeyErrors with FeaturesStorage
ADD: ArrayStorageIndexer to follow common structure

ENH: ChoiceDataset does not automatically checks all IDs of FeaturesStorage, it is moved to a specific function that can be manually called

FIX: ArrayStorage when directly instantiated from ndarray
* ENH: diverse improvements in the documentation

* ENH: small enhancements in README
* ENH: small enhancements in README & Doc

* ENH: higher TF & TFP version

* ADD: few SimpleMNL tests
FIX:
- ChoiceDataset with FeaturesStorage for availabilities
- FeaturesStorage in the middle other features were ignoring last features

ENH:
- Faster OneHotStorage batching

ADD:
- Tests corresponding to fixes
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* FIX: rumnet regularization & evaluate

* FIX: exact NLL now using custom re implementation
* ENH: description of tqdm bars with losses
GH Actions automatically comments a PyTest coverage report in Pull Requests
cat pytest-coverage.txt
- name: Pytest coverage comment
uses: VincentAuriau/pytest-coverage-comment@main

Check warning

Code scanning / CodeQL

Unpinned tag for a non-immutable Action in workflow Medium

Unpinned 3rd party Action 'Build' step
Uses Step
uses 'VincentAuriau/pytest-coverage-comment' with ref 'main', not a pinned commit hash
Comment on lines +10 to +29
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Build draft PDF
uses: ./.github/actions/build-draft
with:
journal: joss
paper-path: docs/paper/paper.md

- name: Upload
uses: actions/upload-artifact@v4
with:
name: paper
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: docs/paper/paper.pdf

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI about 1 month ago

In general, fix this category of problem by explicitly setting the permissions key either at the workflow root (applies to all jobs that don’t override it) or under each job, granting only the scopes required (e.g., contents: read). This prevents the GITHUB_TOKEN from inheriting broader default permissions from the repository or organization.

For this workflow, the minimal non-breaking fix is to add a permissions block with contents: read. The natural place is at the top level of the workflow (between name: and on:), so all jobs inherit it. Alternatively, we could attach it directly under jobs.paper, but top-level keeps things simpler and still scoped to least privilege. No existing steps need to change: actions/checkout@v4 functions with contents: read, and actions/upload-artifact@v4 does not require repository write permissions because it works with artifacts, not repo contents. The only required edit is to .github/workflows/draft_paper.yml, inserting the permissions mapping.

Suggested changeset 1
.github/workflows/draft_paper.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/draft_paper.yml b/.github/workflows/draft_paper.yml
--- a/.github/workflows/draft_paper.yml
+++ b/.github/workflows/draft_paper.yml
@@ -1,5 +1,8 @@
 name: Paper draft to PDF
 
+permissions:
+  contents: read
+
 on:
   push:
     branches:
EOF
@@ -1,5 +1,8 @@
name: Paper draft to PDF

permissions:
contents: read

on:
push:
branches:
Copilot is powered by AI and may make mistakes. Always verify output.
Comment on lines +12 to +46
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Install Python, choice-learn & run tests
uses: actions/setup-python@v5
with:
python-version: '3.10'

- name: Install from TestPyPI & run tests with installed package
id: install
run: |
python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ choice-learn
cd ../
echo ${{ github.event.release.tag_name }}
python choice-learn/tests/manual_run.py
cd choice-learn
echo "b"
git fetch --all
BRANCH=$(git branch --list -r "origin/release_*" | tr '*' ' ')
echo $BRANCH
BRANCH="${BRANCH:9}"
echo $BRANCH
echo "BRANCH=$BRANCH" >> $GITHUB_OUTPUT
- name: publish to PyPI
uses: ./.github/actions/publish
with:
ACCESS_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PACKAGE_DIRECTORY: "./choice_learn/"
PYTHON_VERSION: "3.9"
PUBLISH_REGISTRY_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
PUBLISH_REGISTRY_USERNAME: ${{ secrets.PYPI_USERNAME }}
UPDATE_CODE_VERSION: false
BRANCH: ${{ steps.install.outputs.BRANCH }}

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI about 1 month ago

In general, to fix this issue you explicitly declare a permissions block at the workflow or job level, tightening the GITHUB_TOKEN to the minimal scopes needed. For most read-only CI/test jobs, permissions: contents: read is sufficient. For jobs that publish releases, create tags, or modify pull requests, you selectively grant write on only the specific scopes required (e.g., contents: write, packages: write, or pull-requests: write).

For this specific workflow in .github/workflows/release_pypi.yaml, the job checks out code, installs from TestPyPI, runs tests, then calls a local publish composite action with ACCESS_TOKEN: ${{ secrets.GITHUB_TOKEN }}. The composite action is responsible for publishing to PyPI and, quite possibly, for interacting with the GitHub repository (e.g., pushing tags or commits). To avoid breaking existing functionality, we should not assume it needs only read access. The safest non-breaking change, while still following the recommendation, is to introduce an explicit permissions block that matches the current effective behavior. A common choice that mirrors legacy defaults is permissions: contents: write, which allows repository content modifications while remaining explicit. Since CodeQL’s suggested “minimal starting point” is contents: read, we can start from that and elevate just contents to write to preserve any potential write operations in the composite action.

Concretely, add a permissions block at the top level of the workflow (so it applies to all jobs) between the on: block and jobs: block:

permissions:
  contents: write

No additional imports or dependencies are needed; this is a pure YAML configuration change within .github/workflows/release_pypi.yaml.

Suggested changeset 1
.github/workflows/release_pypi.yaml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/release_pypi.yaml b/.github/workflows/release_pypi.yaml
--- a/.github/workflows/release_pypi.yaml
+++ b/.github/workflows/release_pypi.yaml
@@ -7,6 +7,9 @@
       - completed
   workflow_dispatch:
 
+permissions:
+  contents: write
+
 jobs:
   test-and-publish:
     runs-on: ubuntu-latest
EOF
@@ -7,6 +7,9 @@
- completed
workflow_dispatch:

permissions:
contents: write

jobs:
test-and-publish:
runs-on: ubuntu-latest
Copilot is powered by AI and may make mistakes. Always verify output.
Comment on lines +10 to +23
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Publish choice-learn on TestPyPI
uses: ./.github/actions/publish
with:
ACCESS_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PACKAGE_DIRECTORY: "./choice_learn/"
PYTHON_VERSION: "3.9"
PUBLISH_REGISTRY_PASSWORD: ${{ secrets.TEST_PYPI_PASSWORD }}
PUBLISH_REGISTRY_USERNAME: ${{ secrets.TEST_PYPI_USERNAME }}
PUBLISH_REGISTRY: "https://test.pypi.org/legacy/"
UPDATE_CODE_VERSION: true
PUSH_BRANCH: release_${{ github.event.release.tag_name }}

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI about 1 month ago

To fix this, explicitly declare GITHUB_TOKEN permissions in the workflow. At a minimum, add a permissions: block at the top level (alongside name: and on:) to set contents: read as the default for all jobs, following the principle of least privilege. Then, since this release workflow appears to push a branch (PUSH_BRANCH is provided to the publish action), the job that performs the publish should be given contents: write so it can push commits/tags as intended.

Concretely:

  • Edit .github/workflows/release_test_pypi.yaml.
  • Add a top‑level permissions: section after the name: line that sets contents: read. This both satisfies CodeQL’s recommendation and documents the default minimal access.
  • Inside the publish-service-client-package job, add a permissions: section that overrides the default and grants contents: write, which is likely required for pushing the release_${{ github.event.release.tag_name }} branch. Keep all existing steps and behavior unchanged.

No additional imports or external dependencies are required; only YAML changes to the workflow file.

Suggested changeset 1
.github/workflows/release_test_pypi.yaml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/release_test_pypi.yaml b/.github/workflows/release_test_pypi.yaml
--- a/.github/workflows/release_test_pypi.yaml
+++ b/.github/workflows/release_test_pypi.yaml
@@ -1,4 +1,6 @@
 name: Build and publish choice-learn on TestPyPI
+permissions:
+  contents: read
 
 on:
   release:
@@ -8,6 +10,8 @@
 jobs:
   publish-service-client-package:
     runs-on: ubuntu-latest
+    permissions:
+      contents: write
     steps:
       - uses: actions/checkout@v4
       - name: Publish choice-learn on TestPyPI
EOF
@@ -1,4 +1,6 @@
name: Build and publish choice-learn on TestPyPI
permissions:
contents: read

on:
release:
@@ -8,6 +10,8 @@
jobs:
publish-service-client-package:
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- uses: actions/checkout@v4
- name: Publish choice-learn on TestPyPI
Copilot is powered by AI and may make mistakes. Always verify output.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @VincentAuriau, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents a major release, transforming a private choice modeling codebase into a public, feature-rich Python library. It consolidates extensive work into a well-structured project, offering a diverse set of discrete choice models, standardized data handling, and robust development tooling. The changes aim to provide researchers and practitioners with a powerful and user-friendly platform for formulating, estimating, and deploying choice models, complete with comprehensive documentation and examples.

Highlights

  • Project Open-Sourcing and Restructuring: This pull request marks a significant transition, effectively open-sourcing a private project named 'choice-learn-private' into the public 'choice-learn' repository. This involves renaming internal references, updating documentation, and establishing new contribution guidelines.
  • Comprehensive Choice Modeling Library: The PR introduces a full-fledged Python library for discrete choice modeling, encompassing a wide array of models (MNL, Conditional Logit, Nested Logit, RUMnet, TasteNet, ResLogit, LearningMNL, HaloMNL, AleaCarta, Shopper, SelfAttention), data handling utilities, and optimization tools.
  • Enhanced Development Workflow: New GitHub Actions for PDF generation and PyPI publishing have been added, alongside updated pre-commit hooks (ruff, pyupgrade, bandit) and detailed contributing guidelines, streamlining the development and release process.
  • Rich Dataset Integration: The library now includes loaders for numerous academic datasets (SwissMetro, ModeCanada, Heating, Electricity, Train, Car Preferences, HC, London Passenger Mode Choice, Expedia, TaFeng, Bakery) and a synthetic data generator, making it easier for users to get started and test models.
  • Advanced Model Implementations: Several advanced models, including latent class models, attention-based models, and residual network logit models, are introduced, catering to complex choice modeling scenarios and research.
  • Assortment and Pricing Optimization Tools: New toolbox functionalities for MNL and Latent Class assortment and pricing optimization using solvers like Gurobi and OR-Tools are integrated, providing practical applications for the developed models.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (5)
    • .github/workflows/ci.yaml
    • .github/workflows/deploy_docs.yaml
    • .github/workflows/draft_paper.yml
    • .github/workflows/release_pypi.yaml
    • .github/workflows/release_test_pypi.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a substantial amount of new functionality, establishing the core of the choice-learn library. It includes various choice models, data handling utilities, dataset loaders, and CI/CD configurations. The overall structure is well-thought-out, with a clear separation of concerns. However, there are several areas that need attention, particularly regarding CI/CD configuration correctness, performance of data processing code, and potential bugs in data handling. I've identified some critical issues in the GitHub Actions and pre-commit configurations that need to be addressed. Additionally, there are opportunities to significantly improve the performance of data preprocessing and negative sampling routines.

exclude: ^(.svn|CVS|.bzr|.hg|.git|__pycache__|.tox|.ipynb_checkpoints|assets|tests/assets/|venv/|.venv/)
args: ["-x", "tests", --recursive, choice_learn]

exclude: ^(.svn|CVS|.bzr|.hg|.git|__pycache__|.tox|.ipynb_checkpoints|assets|tests/assets/|venv/|.venv/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The exclude key is incorrectly indented. It should be a top-level key in the .pre-commit-config.yaml file, at the same level as repos and ci, for it to apply globally to all hooks. As it is now, it's part of the bandit repository configuration, which is not the correct syntax and will not work as intended. Please remove this line and add it at the root level of the file.

Comment on lines +170 to +180
Shape must be (batch_size,)
price_batch: np.ndarray
Batch of prices (integers) for each purchased item
Shape must be (batch_size,)
available_item_batch: np.ndarray
Batch of availability matrices (indicating the availability (1) or not (0)
of the products) (arrays) for each purchased item
Shape must be (batch_size, n_items)
Returns
-------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When concatenating datasets, the assortment index for trips from the other dataset needs to be updated. Since you are appending other.available_items to self.available_items, the indices for assortments from other will be shifted. You should iterate through other.trips and add len(self.available_items) to each trip's assortment index before concatenating the trip lists.

Comment on lines +753 to +766
negative_samples = tf.reshape(
tf.transpose(
tf.reshape(
tf.concat(
[
self.get_negative_samples(
available_items=available_item_batch[idx],
purchased_items=basket_batch[idx],
future_purchases=future_batch[idx],
next_item=item_batch[idx],
n_samples=self.n_negative_samples,
)
for idx in range(batch_size)
],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of a list comprehension to generate negative samples forces eager execution for each element in the batch. This prevents this part of the code from being compiled into an efficient TensorFlow graph and can cause significant performance bottlenecks, especially with large batch sizes. To ensure graph compatibility and improve performance, you should use tf.map_fn.

# Example of how to use tf.map_fn
negative_samples = tf.map_fn(
    fn=lambda i: self.get_negative_samples(
        available_items=available_item_batch[i],
        purchased_items=basket_batch[i],
        future_purchases=future_batch[i],
        next_item=item_batch[i],
        n_samples=self.n_negative_samples,
    ),
    elems=tf.range(batch_size, dtype=tf.int64),
    fn_output_signature=tf.TensorSpec(shape=(self.n_negative_samples,), dtype=tf.int32)
)

Comment on lines +589 to +599
negative_samples = tf.stack(
[
self.get_negative_samples(
available_items=available_item_batch[idx],
purchased_items=basket_batch[idx],
next_item=item_batch[idx],
n_samples=self.n_negative_samples,
)
for idx in range(batch_size)
],
axis=0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of a list comprehension to generate negative samples forces eager execution for each element in the batch. This prevents this part of the code from being compiled into an efficient TensorFlow graph and can cause significant performance bottlenecks, especially with large batch sizes. To ensure graph compatibility and improve performance, you should use tf.map_fn.

# Example of how to use tf.map_fn
negative_samples = tf.map_fn(
    fn=lambda i: self.get_negative_samples(
        available_items=available_item_batch[i],
        purchased_items=basket_batch[i],
        next_item=item_batch[i],
        n_samples=self.n_negative_samples,
    ),
    elems=tf.range(tf.shape(item_batch)[0], dtype=tf.int64),
    fn_output_signature=tf.TensorSpec(shape=(self.n_negative_samples,), dtype=tf.int32)
)

Comment on lines +268 to +343
# 2. Approximate the price of the items not in the trip with
# the price of the same item in the previous or next trip
for item_id in range(n_items):
if prices[item_id] == -1:
found_price = False
step = 1
while not found_price:
# Proceed step by step to find the price of the item
# in the k-th previous or the k-th next trip
prev_session_id, prev_session_data = None, None
next_session_id, next_session_data = None, None

if trip_idx - step >= 0:
prev_session_id, prev_session_data = grouped_sessions[trip_idx - step]
if trip_idx + step < len(grouped_sessions):
next_session_id, next_session_data = grouped_sessions[trip_idx + step]

if (
prev_session_data is not None
and item_id in prev_session_data["item_id"].tolist()
):
# If item_id is in the previous trip, take the
# price of the item in the previous trip
if isinstance(
dataset.set_index(["item_id", "session_id"]).loc[
(item_id, prev_session_id)
]["price"],
pd.Series,
):
# Then the price is a Pandas series (same value repeated)
prices[item_id] = (
dataset.set_index(["item_id", "session_id"])
.loc[(item_id, prev_session_id)]["price"]
.to_numpy()[0]
)
else:
# Then the price is a scalar
prices[item_id] = dataset.set_index(["item_id", "session_id"]).loc[
(item_id, prev_session_id)
]["price"]
found_price = True

elif (
next_session_data is not None
and item_id in next_session_data["item_id"].tolist()
):
# If item_id is in the next session, take the
# price of the item in the next trip
if isinstance(
dataset.set_index(["item_id", "session_id"]).loc[
(item_id, next_session_id)
]["price"],
pd.Series,
):
# Then the price is a Pandas series (same value repeated)
prices[item_id] = (
dataset.set_index(["item_id", "session_id"])
.loc[(item_id, next_session_id)]["price"]
.to_numpy()[0]
)
else:
# Then the price is a scalar
prices[item_id] = dataset.set_index(["item_id", "session_id"]).loc[
(item_id, next_session_id)
]["price"]
found_price = True

if trip_idx - step < 0 and trip_idx + step >= len(grouped_sessions):
# Then we have checked all possible previous and next trips
break

step += 1

if not found_price:
prices[item_id] = 1 # Or another default value > 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The price imputation logic involves iterating through sessions and items, which is highly inefficient and will be very slow for large datasets. This can be significantly optimized by using vectorized pandas operations. You can group by item_id, sort by session (assuming session IDs are sequential in time), and then use ffill() and bfill() to propagate prices forward and backward. This will be orders of magnitude faster.

Example:

# Assuming 'dataset' is sorted by session_id
dataset['price'] = dataset.groupby('item_id')['price'].transform(lambda x: x.ffill().bfill())
# Handle remaining NaNs if any (e.g., items with no price at all)
dataset['price'].fillna(1, inplace=True)

vname=${vname:1}
echo $vname
sed -i -r 's/__version__ *= *".*"/__version__ = "'"$vname"'"/g' ${{ inputs.PACKAGE_DIRECTORY }}__init__.py
sed -i '0,/version =.*/s//version = "'"$vname"'"/' ./pyproject.toml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using sed to update the version in pyproject.toml can be fragile. If the file's formatting changes, this command might fail. A more robust and idiomatic approach is to use Poetry's built-in version command.

          poetry version "$vname"

Comment on lines +1 to +8
---
name: Bug report
about: Report a bug you have encountered
title: "[BUG]"
labels: bug
assignees: ''

---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file contains two frontmatter sections (the parts enclosed in ---). This is invalid for GitHub issue templates and will likely cause parsing issues. The second frontmatter block seems more complete. Please remove the first, redundant block. The same issue exists in feature_request.md and question.md.

Suggested change
---
name: Bug report
about: Report a bug you have encountered
title: "[BUG]"
labels: bug
assignees: ''
---
---
name: 🐛 Bug report
about: If something isn't working 🔧
title: ''
labels: bug
assignees:
---

Comment on lines +1 to +8
---
name: Question
about: Any question about Choice-Learn?
title: ''
labels: question
assignees: ''

---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file contains two frontmatter sections (the parts enclosed in ---). This is invalid for GitHub issue templates and will likely cause parsing issues. The second frontmatter block seems more complete. Please remove the first, redundant block.

Suggested change
---
name: Question
about: Any question about Choice-Learn?
title: ''
labels: question
assignees: ''
---
---
name: Question
about: Any question about Choice-Learn?
title: ''
labels: question
assignees:
---

Comment on lines +1 to +8
---
name: Feature request
about: Suggest an idea to improve Choice-Learn
title: "[ADD]"
labels: new feature
assignees: ''

---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file contains two frontmatter sections (the parts enclosed in ---). This is invalid for GitHub issue templates and will likely cause parsing issues. The second frontmatter block seems more complete. Please remove the first, redundant block.

Suggested change
---
name: Feature request
about: Suggest an idea to improve Choice-Learn
title: "[ADD]"
labels: new feature
assignees: ''
---
---
name: 🚀 Feature request
about: Suggest an idea for this project 🏖
title: ''
labels: enhancement
assignees:
---

Comment on lines +1 to +275
"""Different classes to optimize RAM usage with repeated features over time."""

import numpy as np

from choice_learn.data.indexer import OneHotStoreIndexer, StoreIndexer


class Store:
"""Class to keep OneHotStore and FeaturesStore with same parent."""

def __init__(self, indexes=None, values=None, sequence=None, name=None, indexer=StoreIndexer):
"""Build the store.
Parameters
----------
indexes : array_like or None
list of indexes of features to store. If None is given, indexes are created from
apparition order of values
values : array_like
list of values of features to store
sequence : array_like
sequence of apparitions of the features
name: string, optional
name of the features store -- not used at the moment
"""
if indexes is None:
indexes = list(range(len(values)))
self.store = {k: v for (k, v) in zip(indexes, values)}
self.sequence = np.array(sequence)
self.name = name

if sequence is not None and values is not None:
try:
width = len(values[0])
except TypeError:
width = 1
self.shape = (len(sequence), width)

self.indexer = indexer(self)

def _get_store_element(self, index):
"""Getter method over self.sequence.
Returns the features stored at index index. Compared to __getitem__, it does take
the index-th element of sequence but the index-th element of the store.
Parameters
----------
index : (int, list, slice)
index argument of the feature
Returns
-------
array_like
features corresponding to the index index in self.store
"""
if isinstance(index, list):
return [self.store[i] for i in index]
# else:
return self.store[index]

def __len__(self):
"""Return the length of the sequence of apparition of the features."""
return len(self.sequence)

@property
def batch(self):
"""Indexing attribute."""
return self.indexer


class FeaturesStore(Store):
"""Base class to store features and a sequence of apparitions.
Mainly useful when features are repeated frequently over the sequence.
An example would be to store the features of a customers (supposing that the same customers come
several times over the work sequence) and to save which customer is concerned for each choice.
Attributes
----------
store : dict
Dictionary stocking features that can be called from indexes: {index: features}
shape : tuple
shape of the features store: (sequence_length, features_number)
sequence : array_like
List of elements of indexes representing the sequence of apparitions of the features
name: string, optional
name of the features store -- not used at the moment
dtype: type
type of the features
"""

@classmethod
def from_dict(cls, values_dict, sequence):
"""Instantiate the FeaturesStore from a dictionary of values.
Parameters
----------
values_dict : dict
dictionary of values to store, {index: value}
sequence : array_like
sequence of apparitions of the features
Returns
-------
FeaturesStore created from the values in the dictionnary
"""
# Check uniform shape of values
return cls(
indexes=list(values_dict.keys()), values=list(values_dict.values()), sequence=sequence
)

@classmethod
def from_list(cls, values_list, sequence):
"""Instantiate the FeaturesStore from a list of values.
Creates indexes for each value
Parameters
----------
values_list : list
List of values to store
sequence : array_like
sequence of apparitions of the features
Returns
-------
FeaturesStore
"""
# Check uniform shape of list
# Useful ? To rethink...
return cls(indexes=list(range(len(values_list))), values=values_list, sequence=sequence)

def __getitem__(self, sequence_index):
"""Subsets self with sequence_index.
Parameters
----------
sequence_index : (int, list, slice)
index position of the sequence
Returns
-------
array_like
features corresponding to the sequence_index-th position of sequence
"""
if isinstance(sequence_index, int):
sequence_index = [sequence_index]
new_sequence = self.sequence[sequence_index]
store = {}
for k, v in self.store.items():
if k in new_sequence:
store[k] = v
else:
print(f"Key {k} of store with value {v} not in sequence anymore")

return FeaturesStore.from_dict(store, new_sequence)

def astype(self, dtype):
"""Change the dtype of the features.
The type of the features should implement the astype method.
Typically, should work like np.ndarrays.
Parameters
----------
dtype : str or type
type to set the features as
"""
for k, v in self.store.items():
self.store[k] = v.astype(dtype)


class OneHotStore(Store):
"""Specific FeaturesStore for one hot features storage.
Inherits from FeaturesStore.
For example can be used to store a OneHot representation of the days of week.
Has the same attributes as FeaturesStore, only differs whit some One-Hot optimized methods.
"""

def __init__(
self,
indexes=None,
values=None,
sequence=None,
name=None,
dtype=np.float32,
):
"""Build the OneHot features store.
Parameters
----------
indexes : array_like or None
list of indexes of features to store. If None is given, indexes are created from
apparition order of values
values : array_like or None
list of values of features to store that must be One-Hot. If None given they are created
from order of apparition in sequence
sequence : array_like
sequence of apparitions of the features
name: string, optional
name of the features store -- not used at the moment
"""
self.name = name
self.sequence = np.array(sequence)

if values is None:
self = self.from_sequence(sequence)
else:
self.store = {k: v for (k, v) in zip(indexes, values)}
self.shape = (len(sequence), np.max(values) + 1)

self.dtype = dtype
self.indexer = OneHotStoreIndexer(self)

@classmethod
def from_sequence(cls, sequence):
"""Create a OneHotFeatureStore from a sequence of apparition.
One Hot vector are created from the order of apparition in the sequence: feature vectors
created have a length of the number of different values in the sequence and the 1 is
positioned in order of first appartitions in the sequence.
Parameters
----------
sequence : array-like
Sequence of apparitions of values, or indexes. Will be used to index self.store
Returns
-------
FeatureStore
Created from the sequence.
"""
all_indexes = np.unique(sequence)
values = np.arange(len(all_indexes))
return cls(indexes=all_indexes, values=values, sequence=sequence)

def __getitem__(self, sequence_index):
"""Get an element at sequence_index-th position of self.sequence.
Parameters
----------
sequence_index : (int, list, slice)
index from sequence of element to get
Returns
-------
np.ndarray
OneHot features corresponding to the sequence_index-th position of sequence
"""
if isinstance(sequence_index, int):
sequence_index = [sequence_index]
new_sequence = self.sequence[sequence_index]
store = {}
for k, v in self.store.items():
if k in new_sequence:
store[k] = v
else:
print(f"Key {k} of store with value {v} not in sequence anymore")

return OneHotStore(
indexes=list(store.keys()), values=list(store.values()), sequence=new_sequence
)

def astype(self, dtype):
"""Change (mainly int or float) type of returned OneHot features vectors.
Parameters
----------
dtype : type
Type to set the features as
"""
self.dtype = dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file seems to be a duplicate or an alternative implementation of the functionality in choice_learn/data/storage.py. The ChoiceDataset class uses storage.py. To avoid code duplication and potential confusion for future maintainers, it would be best to consolidate these into a single implementation. If store.py is no longer used, it should be removed.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 23, 2025

Coverage

Coverage Report for Python 3.9
FileStmtsMissCoverMissing
choice_learn
   __init__.py20100% 
   tf_ops.py62198%283
choice_learn/basket_models
   __init__.py40100% 
   alea_carta.py1482285%86–90, 92–96, 98–102, 106, 109, 131, 159, 308, 431–455
   base_basket_model.py2352789%111–112, 123, 141, 185, 255, 377, 485, 585–587, 676, 762, 772, 822–830, 891–894, 934–935
   basic_attention_model.py89496%424, 427, 433, 440
   self_attention_model.py133993%71, 73, 75, 450–454, 651
   shopper.py184995%130, 159, 325, 345, 360, 363, 377, 489, 618
choice_learn/basket_models/data
   __init__.py20100% 
   basket_dataset.py1903084%74–77, 295–297, 407, 540–576, 636, 658–661, 700–705, 790–801, 849
   preprocessing.py947817%43–45, 128–364
choice_learn/basket_models/datasets
   __init__.py30100% 
   bakery.py38392%47, 51, 61
   synthetic_dataset.py81693%62, 194–199, 247
choice_learn/basket_models/utils
   __init__.py00100% 
   permutation.py22195%37
choice_learn/data
   __init__.py30100% 
   choice_dataset.py6493395%198, 250, 283, 421, 463–464, 589, 724, 738, 840, 842, 937, 957–961, 1140, 1159–1161, 1179–1181, 1209, 1214, 1223, 1240, 1281, 1293, 1307, 1346, 1361, 1366, 1395, 1408, 1443–1444
   indexer.py2412390%20, 31, 45, 60–67, 202–204, 219–230, 265, 291, 582
   storage.py161696%22, 33, 51, 56, 61, 71
   store.py72720%3–275
choice_learn/datasets
   __init__.py40100% 
   base.py400599%42–43, 153–154, 714
   expedia.py1028319%37–301
   tafeng.py490100% 
choice_learn/datasets/data
   __init__.py00100% 
choice_learn/models
   __init__.py14286%15–16
   base_model.py3353590%145, 187, 289, 297, 303, 312, 352, 356–357, 362, 391, 395–396, 413, 426, 434, 475–476, 485–486, 587, 589, 605, 609, 611, 734–735, 935, 939–953
   baseline_models.py490100% 
   conditional_logit.py2692690%49, 52, 54, 85, 88, 91–95, 98–102, 136, 206, 212–216, 351, 388, 445, 520–526, 651, 685, 822, 826
   halo_mnl.py124298%186, 374
   latent_class_base_model.py2863986%55–61, 273–279, 288, 325–330, 497–500, 605, 624, 665–701, 715, 720, 751–752, 774–775, 869–870, 974
   latent_class_mnl.py62690%257–261, 296
   learning_mnl.py67396%157, 182, 188
   nested_logit.py2911296%55, 77, 160, 269, 351, 484, 530, 600, 679, 848, 900, 904
   reslogit.py132695%285, 360, 369, 374, 382, 432
   rumnet.py236399%748–751, 982
   simple_mnl.py139696%167, 275, 347, 355, 357, 359
   tastenet.py94397%142, 180, 188
choice_learn/toolbox
   __init__.py00100% 
   assortment_optimizer.py27678%28–30, 93–95, 160–162
   gurobi_opt.py2362360%3–675
   or_tools_opt.py2301195%103, 107, 296–305, 315, 319, 607, 611
choice_learn/utils
   metrics.py854349%74, 126–130, 147–166, 176, 190–199, 211–232, 242
TOTAL564485185% 

Tests Skipped Failures Errors Time
222 0 💤 0 ❌ 0 🔥 6m 19s ⏱️

@github-actions
Copy link
Contributor

github-actions bot commented Dec 23, 2025

Coverage

Coverage Report for Python 3.11
FileStmtsMissCoverMissing
choice_learn
   __init__.py20100% 
   tf_ops.py62198%283
choice_learn/basket_models
   __init__.py40100% 
   alea_carta.py1482285%86–90, 92–96, 98–102, 106, 109, 131, 159, 308, 431–455
   base_basket_model.py2352789%111–112, 123, 141, 185, 255, 377, 485, 585–587, 676, 762, 772, 822–830, 891–894, 934–935
   basic_attention_model.py89496%424, 427, 433, 440
   self_attention_model.py133993%71, 73, 75, 450–454, 651
   shopper.py184995%130, 159, 325, 345, 360, 363, 377, 489, 618
choice_learn/basket_models/data
   __init__.py20100% 
   basket_dataset.py1903084%74–77, 295–297, 407, 540–576, 636, 658–661, 700–705, 790–801, 849
   preprocessing.py947817%43–45, 128–364
choice_learn/basket_models/datasets
   __init__.py30100% 
   bakery.py38392%47, 51, 61
   synthetic_dataset.py81693%62, 194–199, 247
choice_learn/basket_models/utils
   __init__.py00100% 
   permutation.py22195%37
choice_learn/data
   __init__.py30100% 
   choice_dataset.py6493395%198, 250, 283, 421, 463–464, 589, 724, 738, 840, 842, 937, 957–961, 1140, 1159–1161, 1179–1181, 1209, 1214, 1223, 1240, 1281, 1293, 1307, 1346, 1361, 1366, 1395, 1408, 1443–1444
   indexer.py2412390%20, 31, 45, 60–67, 202–204, 219–230, 265, 291, 582
   storage.py161696%22, 33, 51, 56, 61, 71
   store.py72720%3–275
choice_learn/datasets
   __init__.py40100% 
   base.py400599%42–43, 153–154, 714
   expedia.py1028319%37–301
   tafeng.py490100% 
choice_learn/datasets/data
   __init__.py00100% 
choice_learn/models
   __init__.py14286%15–16
   base_model.py3353590%145, 187, 289, 297, 303, 312, 352, 356–357, 362, 391, 395–396, 413, 426, 434, 475–476, 485–486, 587, 589, 605, 609, 611, 734–735, 935, 939–953
   baseline_models.py490100% 
   conditional_logit.py2692690%49, 52, 54, 85, 88, 91–95, 98–102, 136, 206, 212–216, 351, 388, 445, 520–526, 651, 685, 822, 826
   halo_mnl.py124298%186, 374
   latent_class_base_model.py2863986%55–61, 273–279, 288, 325–330, 497–500, 605, 624, 665–701, 715, 720, 751–752, 774–775, 869–870, 974
   latent_class_mnl.py62690%257–261, 296
   learning_mnl.py67396%157, 182, 188
   nested_logit.py2911296%55, 77, 160, 269, 351, 484, 530, 600, 679, 848, 900, 904
   reslogit.py132695%285, 360, 369, 374, 382, 432
   rumnet.py236399%748–751, 982
   simple_mnl.py139696%167, 275, 347, 355, 357, 359
   tastenet.py94397%142, 180, 188
choice_learn/toolbox
   __init__.py00100% 
   assortment_optimizer.py27678%28–30, 93–95, 160–162
   gurobi_opt.py2382380%3–675
   or_tools_opt.py2301195%103, 107, 296–305, 315, 319, 607, 611
choice_learn/utils
   metrics.py854349%74, 126–130, 147–166, 176, 190–199, 211–232, 242
TOTAL564685385% 

Tests Skipped Failures Errors Time
222 0 💤 0 ❌ 0 🔥 7m 6s ⏱️

@github-actions
Copy link
Contributor

github-actions bot commented Dec 23, 2025

Coverage

Coverage Report for Python 3.10
FileStmtsMissCoverMissing
choice_learn
   __init__.py20100% 
   tf_ops.py62198%283
choice_learn/basket_models
   __init__.py40100% 
   alea_carta.py1482285%86–90, 92–96, 98–102, 106, 109, 131, 159, 308, 431–455
   base_basket_model.py2352789%111–112, 123, 141, 185, 255, 377, 485, 585–587, 676, 762, 772, 822–830, 891–894, 934–935
   basic_attention_model.py89496%424, 427, 433, 440
   self_attention_model.py133993%71, 73, 75, 450–454, 651
   shopper.py184995%130, 159, 325, 345, 360, 363, 377, 489, 618
choice_learn/basket_models/data
   __init__.py20100% 
   basket_dataset.py1903084%74–77, 295–297, 407, 540–576, 636, 658–661, 700–705, 790–801, 849
   preprocessing.py947817%43–45, 128–364
choice_learn/basket_models/datasets
   __init__.py30100% 
   bakery.py38392%47, 51, 61
   synthetic_dataset.py81693%62, 194–199, 247
choice_learn/basket_models/utils
   __init__.py00100% 
   permutation.py22195%37
choice_learn/data
   __init__.py30100% 
   choice_dataset.py6493395%198, 250, 283, 421, 463–464, 589, 724, 738, 840, 842, 937, 957–961, 1140, 1159–1161, 1179–1181, 1209, 1214, 1223, 1240, 1281, 1293, 1307, 1346, 1361, 1366, 1395, 1408, 1443–1444
   indexer.py2412390%20, 31, 45, 60–67, 202–204, 219–230, 265, 291, 582
   storage.py161696%22, 33, 51, 56, 61, 71
   store.py72720%3–275
choice_learn/datasets
   __init__.py40100% 
   base.py400599%42–43, 153–154, 714
   expedia.py1028319%37–301
   tafeng.py490100% 
choice_learn/datasets/data
   __init__.py00100% 
choice_learn/models
   __init__.py14286%15–16
   base_model.py3353590%145, 187, 289, 297, 303, 312, 352, 356–357, 362, 391, 395–396, 413, 426, 434, 475–476, 485–486, 587, 589, 605, 609, 611, 734–735, 935, 939–953
   baseline_models.py490100% 
   conditional_logit.py2692690%49, 52, 54, 85, 88, 91–95, 98–102, 136, 206, 212–216, 351, 388, 445, 520–526, 651, 685, 822, 826
   halo_mnl.py124298%186, 374
   latent_class_base_model.py2863986%55–61, 273–279, 288, 325–330, 497–500, 605, 624, 665–701, 715, 720, 751–752, 774–775, 869–870, 974
   latent_class_mnl.py62690%257–261, 296
   learning_mnl.py67396%157, 182, 188
   nested_logit.py2911296%55, 77, 160, 269, 351, 484, 530, 600, 679, 848, 900, 904
   reslogit.py132695%285, 360, 369, 374, 382, 432
   rumnet.py236399%748–751, 982
   simple_mnl.py139696%167, 275, 347, 355, 357, 359
   tastenet.py94397%142, 180, 188
choice_learn/toolbox
   __init__.py00100% 
   assortment_optimizer.py27678%28–30, 93–95, 160–162
   gurobi_opt.py2382380%3–675
   or_tools_opt.py2301195%103, 107, 296–305, 315, 319, 607, 611
choice_learn/utils
   metrics.py854349%74, 126–130, 147–166, 176, 190–199, 211–232, 242
TOTAL564685385% 

Tests Skipped Failures Errors Time
222 0 💤 0 ❌ 0 🔥 7m 5s ⏱️

@github-actions
Copy link
Contributor

github-actions bot commented Dec 23, 2025

Coverage

Coverage Report for Python 3.12
FileStmtsMissCoverMissing
choice_learn
   __init__.py20100% 
   tf_ops.py62198%283
choice_learn/basket_models
   __init__.py40100% 
   alea_carta.py1482285%86–90, 92–96, 98–102, 106, 109, 131, 159, 308, 431–455
   base_basket_model.py2352789%111–112, 123, 141, 185, 255, 377, 485, 585–587, 676, 762, 772, 822–830, 891–894, 934–935
   basic_attention_model.py89496%424, 427, 433, 440
   self_attention_model.py133993%71, 73, 75, 450–454, 651
   shopper.py184995%130, 159, 325, 345, 360, 363, 377, 489, 618
choice_learn/basket_models/data
   __init__.py20100% 
   basket_dataset.py1903084%74–77, 295–297, 407, 540–576, 636, 658–661, 700–705, 790–801, 849
   preprocessing.py947817%43–45, 128–364
choice_learn/basket_models/datasets
   __init__.py30100% 
   bakery.py38392%47, 53, 61
   synthetic_dataset.py81693%62, 194–199, 247
choice_learn/basket_models/utils
   __init__.py00100% 
   permutation.py22195%37
choice_learn/data
   __init__.py30100% 
   choice_dataset.py6493395%198, 250, 283, 421, 463–464, 589, 724, 738, 840, 842, 937, 957–961, 1140, 1159–1161, 1179–1181, 1209, 1214, 1223, 1240, 1281, 1293, 1307, 1346, 1361, 1366, 1395, 1408, 1443–1444
   indexer.py2412390%20, 31, 45, 60–67, 202–204, 219–230, 265, 291, 582
   storage.py161696%22, 33, 51, 56, 61, 71
   store.py72720%3–275
choice_learn/datasets
   __init__.py40100% 
   base.py400599%42–43, 153–154, 714
   expedia.py1028319%37–301
   tafeng.py490100% 
choice_learn/datasets/data
   __init__.py00100% 
choice_learn/models
   __init__.py14286%15–16
   base_model.py3353590%145, 187, 289, 297, 303, 312, 352, 356–357, 362, 391, 395–396, 413, 426, 434, 475–476, 485–486, 587, 589, 605, 609, 611, 734–735, 935, 939–953
   baseline_models.py490100% 
   conditional_logit.py2692690%49, 52, 54, 85, 88, 91–95, 98–102, 136, 206, 212–216, 351, 388, 445, 520–526, 651, 685, 822, 826
   halo_mnl.py124298%186, 374
   latent_class_base_model.py2863986%55–61, 273–279, 288, 325–330, 497–500, 605, 624, 665–701, 715, 720, 751–752, 774–775, 869–870, 974
   latent_class_mnl.py62690%257–261, 296
   learning_mnl.py67396%157, 182, 188
   nested_logit.py2911296%55, 77, 160, 269, 351, 484, 530, 600, 679, 848, 900, 904
   reslogit.py132695%285, 360, 369, 374, 382, 432
   rumnet.py236399%748–751, 982
   simple_mnl.py139696%167, 275, 347, 355, 357, 359
   tastenet.py94397%142, 180, 188
choice_learn/toolbox
   __init__.py00100% 
   assortment_optimizer.py27678%28–30, 93–95, 160–162
   gurobi_opt.py2382380%3–675
   or_tools_opt.py2301195%103, 107, 296–305, 315, 319, 607, 611
choice_learn/utils
   metrics.py854349%74, 126–130, 147–166, 176, 190–199, 211–232, 242
TOTAL564685385% 

Tests Skipped Failures Errors Time
222 0 💤 0 ❌ 0 🔥 7m 23s ⏱️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants