diff --git a/AGENTS.md b/AGENTS.md index a1766ba..412bae6 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -4,7 +4,12 @@ This document is a concise, practical guide for new contributors ## Context -You are an expert programming AI assistant who prioritizes minimalist, efficient code. You plan before coding, write idiomatic solutions, seek clarification when needed, and accept user preferences +Raccoon is a Python library for data analysis that aims to make data structures similar to pandas Series and DataFrames, +but an emphasis performance especially for inreasing the size of existing datasets. In particular, the speed of +appending new rows to a DataFrame is critical for many real-world use cases. + +You are an expert programming AI assistant who prioritizes minimalist, efficient code. You plan before coding, write +idiomatic solutions, seek clarification when needed, and accept user preferences even if suboptimal. ## Target Versions @@ -13,7 +18,7 @@ even if suboptimal. ## Planning Rules -- Create numbered plans before coding in the `.aiassistant/plans/` directory in the root of the project +- Create numbered plans before coding in the `.github/plans/` directory in the root of the project - Display current plan step clearly - Update the plan file as you progress - Ask for clarification on ambiguity @@ -29,6 +34,18 @@ even if suboptimal. ## Build and configuration +**MANDATORY RULE – DO NOT EVER BREAK THIS:** + +This developer is using Git Bash (MSYS2) on Windows. + +All shell commands, file paths, and code snippets **MUST**: + +- Use only forward slashes `/` for paths +- Use only POSIX/bash commands +- Never use backslashes `\`, drive letters, or Windows-specific commands + +Violating this rule makes the suggestion completely unusable. + - Tooling - Package manager/runner: uv (Astral). Local workflows should prefer `uv` for env creation and running. - Python target: pyproject sets the target python with `requires-python`, use the version in that file @@ -49,10 +66,10 @@ even if suboptimal. - Fast suite (skip slow): `uv run --no-sync pytest -m "not slow"` - Quiet summary: `uv run --no-sync pytest -q` - Target a path/file: `uv run --no-sync pytest -q tests` or a single file - - Parallel (dev extra): `uv run --no-sync pytest -n auto -m "not slow"` - Adding tests: conventions - - All tests must pass without errors or warnings. If the tests produce warnings, modify the tests until they no longer produce warnings + - All tests must pass without errors or warnings. If the tests produce warnings, modify the tests until they no + longer produce warnings ## Contributing Tips @@ -71,7 +88,8 @@ even if suboptimal. ## Changes - The change log file is `docs/change_log.rst` file in this project, use that to track changes to the project. -- When *major* changes or features are made/added I want a concise summary of the change and files involved (summarize if too many files are changed). Use the change_log.rst file to track changes and +- When *major* changes or features are made/added I want a concise summary of the change and files involved (summarize + if too many files are changed). Use the change_log.rst file to track changes and as a template. ## Quick checklist for new contributions @@ -80,7 +98,7 @@ even if suboptimal. - [ ] Always do a git pull when starting a new chat on a project that has a git repo. - [ ] Add focused tests next to code, with assets under `tests/data` pattern - [ ] Run `uv run --no-sync pytest -m "not slow"` locally; consider `-n auto`. -- [ ] Run `uv run python -m black` on any python files you have edited. +- [ ] Run `uv run python -m ruff .` on any python files you have edited. - [ ] Document assumptions (columns/dtypes, units, timezones). --- @@ -89,7 +107,7 @@ even if suboptimal. ### General Principles -- Follow PEP 8 with max line length 120 (Black enforces formatting and import sorting when enabled) +- Follow PEP 8 with max line length 120 (Ruff enforces formatting and import sorting when enabled) - Always prioritize readability and clarity - Write concise, efficient, and idiomatic code - Avoid duplicate code @@ -111,7 +129,8 @@ even if suboptimal. ### Type Hints & Syntax - Always include type hints -- Use built-in generics (`list[str]`, `dict[str, int]`, `set[str]`) and `| None` for optionals; avoid `List`, `Dict`, `Optional` from typing module +- Use built-in generics (`list[str]`, `dict[str, int]`, `set[str]`) and `| None` for optionals; avoid `List`, `Dict`, + `Optional` from typing module - Prefer list/dict comprehensions over loops when clear - Use f-strings; no string concatenation with `+` - Use `pathlib.Path` exclusively for filesystem paths @@ -174,7 +193,8 @@ even if suboptimal. - Test both positive and negative scenarios - Prefer inline over parameterization - Avoid fixtures unless they are used more than 4 times -- If there are more than 10 items in the parameterize list for a test, split into multiple test functions with no more than 10 parameterizations in the list for each function, split by scenario to aid +- If there are more than 10 items in the parameterize list for a test, split into multiple test functions with no more + than 10 parameterizations in the list for each function, split by scenario to aid failure diagnosis - Always include test cases for critical paths of the application. - Account for common edge cases like empty inputs, invalid data types, and large datasets. diff --git a/docs/change_log.rst b/docs/change_log.rst index c03f5ac..51bc78a 100644 --- a/docs/change_log.rst +++ b/docs/change_log.rst @@ -198,3 +198,9 @@ an installation requirement. ~~~~~~~~~~~~~~~~ - Added setup for using uv - Added 3.14 to the list of working pythons + +3.3.1 (12/24/25) +~~~~~~~~~~~~~~~~ +- Added setup for using agentic AI +- Used AI to add type hints +- Used AI to make performance optimizations diff --git a/pyproject.toml b/pyproject.toml index f1f56d8..6050cfb 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "raccoon" -version = "3.3.0" +version = "3.3.1" authors = [ { name = "Ryan Sheftel", email = "rsheftel@alumni.upenn.edu" }, ] diff --git a/raccoon/dataframe.py b/raccoon/dataframe.py index 0f2c705..aa6e7c5 100644 --- a/raccoon/dataframe.py +++ b/raccoon/dataframe.py @@ -28,13 +28,13 @@ class DataFrame(object): __slots__ = ["_data", "_index", "_index_name", "_columns", "_sort", "_dropin"] def __init__( - self, - data: dict[Any, list | Any] | None = None, - columns: list | None = None, - index: list | None = None, - index_name: str | tuple | None = "index", - sort: bool | None = None, - dropin: Callable = None, + self, + data: dict[Any, list | Any] | None = None, + columns: list | None = None, + index: list | None = None, + index_name: str | tuple | None = "index", + sort: bool | None = None, + dropin: Callable = None, ): """ :param data: (optional) dictionary of lists. The keys of the dictionary will be used for the column names and\ @@ -81,10 +81,11 @@ def __init__( # set data from dict values. If dict value is not a list, wrap it to make a single element list self._data = ( dropin( - [dropin(x) if ((type(x) == dropin) or (type(x) == list)) else dropin([x]) for x in data.values()] + [dropin(x) if ((isinstance(x, dropin)) or (isinstance(x, list))) else dropin([x]) for x in + data.values()] ) if dropin - else [x if type(x) == list else [x] for x in data.values()] + else [x if isinstance(x, list) else [x] for x in data.values()] ) # setup columns from directory keys self.columns = data.keys() @@ -123,8 +124,8 @@ def __repr__(self): def __str__(self) -> str: return self._make_table() - def _check_list(self, x: list) -> bool: - return type(x) == (self._dropin if self._dropin else list) + def _check_list(self, x: Any) -> bool: + return isinstance(x, self._dropin) if self._dropin else isinstance(x, list) def _make_table(self, index: bool = True, **kwargs) -> str: kwargs["headers"] = "keys" if "headers" not in kwargs.keys() else kwargs["headers"] @@ -148,7 +149,7 @@ def _sort_columns(self, columns_list: list) -> None: :param columns_list: list of column names. Must include all column names :return: nothing """ - if not (all([x in columns_list for x in self._columns]) and all([x in self._columns for x in columns_list])): + if set(columns_list) != set(self._columns): raise ValueError( "columns_list must be all in current columns, and all current columns must be in columns_list" ) @@ -256,11 +257,11 @@ def select_index(self, compare: Any | tuple, result: Literal["boolean", "value"] raise ValueError("only valid values for result parameter are: boolean or value.") def get( - self, - indexes: Any | list[Any | bool] = None, - columns: Any | list = None, - as_list: bool = False, - as_dict: bool = False, + self, + indexes: Any | list[Any | bool] = None, + columns: Any | list = None, + as_list: bool = False, + as_dict: bool = False, ) -> Self | list | dict | Any: """ Given indexes and columns will return a sub-set of the DataFrame. This method will direct to the below methods @@ -314,7 +315,7 @@ def get_rows(self, indexes: list[bool | Any], column: Any, as_list: bool = False :return: DataFrame is as_list if False, a list if as_list is True """ c = self._columns.index(column) - if all([isinstance(i, bool) for i in indexes]): # boolean list + if indexes and isinstance(indexes[0], bool) and all(isinstance(i, bool) for i in indexes): # boolean list if len(indexes) != len(self._index): raise ValueError("boolean index list must be same size of existing index") if all(indexes): # the entire column @@ -338,13 +339,13 @@ def get_rows(self, indexes: list[bool | Any], column: Any, as_list: bool = False ) def get_columns( - self, - index: Any, - columns: list[Any] = None, - as_dict: bool = False, - as_namedtuple: bool = False, - name: str = "raccoon", - include_index: bool = True, + self, + index: Any, + columns: list[Any] = None, + as_dict: bool = False, + as_namedtuple: bool = False, + name: str = "raccoon", + include_index: bool = True, ) -> Self | dict | namedtuple: """ For a single index and list of column names return a DataFrame of the values in that index as either a dict @@ -392,7 +393,7 @@ def get_matrix(self, indexes: list[Any | bool], columns: list[Any]) -> Self: """ bool_indexes = [] locations = [] - if all([isinstance(i, bool) for i in indexes]): # boolean list + if indexes and isinstance(indexes[0], bool) and all(isinstance(i, bool) for i in indexes): # boolean list is_bool_indexes = True if len(indexes) != len(self._index): raise ValueError("boolean index list must be same size of existing index") @@ -406,7 +407,7 @@ def get_matrix(self, indexes: list[Any | bool], columns: list[Any]) -> Self: else [self._index.index(x) for x in indexes] ) - if all([isinstance(i, bool) for i in columns]): # boolean list + if columns and isinstance(columns[0], bool) and all(isinstance(i, bool) for i in columns): # boolean list if len(columns) != len(self._columns): raise ValueError("boolean column list must be same size of existing columns") columns = list(compress(self._columns, columns)) @@ -424,13 +425,13 @@ def get_matrix(self, indexes: list[Any | bool], columns: list[Any]) -> Self: return DataFrame(data=data_dict, index=indexes, columns=columns, index_name=self._index_name, sort=self._sort) def get_location( - self, - location: int, - columns: Any | list | None = None, - as_dict: bool = False, - as_namedtuple: bool = False, - name: str = "raccoon", - index: bool = True, + self, + location: int, + columns: Any | list | None = None, + as_dict: bool = False, + as_namedtuple: bool = False, + name: str = "raccoon", + index: bool = True, ) -> Self | dict | namedtuple | Any: """ For an index location and either (1) list of columns return a DataFrame or dictionary of the values or @@ -452,14 +453,14 @@ def get_location( elif not isinstance(columns, list): # single value for columns c = self._columns.index(columns) return self._data[c][location] - elif all([isinstance(i, bool) for i in columns]): + elif columns and isinstance(columns[0], bool) and all(isinstance(i, bool) for i in columns): if len(columns) != len(self._columns): raise ValueError("boolean column list must be same size of existing columns") columns = list(compress(self._columns, columns)) + col_to_idx = {col: i for i, col in enumerate(self._columns)} data = dict() for column in columns: - c = self._columns.index(column) - data[column] = self._data[c][location] + data[column] = self._data[col_to_idx[column]][location] index_value = self._index[location] if as_dict: if index: @@ -489,7 +490,7 @@ def get_locations(self, locations: list, columns: Any | list | None = None, **kw return self.get(indexes, columns, **kwargs) def get_slice( - self, start_index: Any = None, stop_index: Any = None, columns: list | None = None, as_dict: bool = False + self, start_index: Any = None, stop_index: Any = None, columns: list | None = None, as_dict: bool = False ) -> Self | tuple: """ For sorted DataFrames will return either a DataFrame or dict of all the rows where the index is greater than @@ -508,7 +509,7 @@ def get_slice( if columns is None: columns = self._columns - elif all([isinstance(i, bool) for i in columns]): + elif columns and isinstance(columns[0], bool) and all(isinstance(i, bool) for i in columns): if len(columns) != len(self._columns): raise ValueError("boolean column list must be same size of existing columns") columns = list(compress(self._columns, columns)) @@ -517,10 +518,10 @@ def get_slice( stop_location = bisect_right(self._index, stop_index) if stop_index is not None else None index = self._index[start_location:stop_location] + col_to_idx = {col: i for i, col in enumerate(self._columns)} data = dict() for column in columns: - c = self._columns.index(column) - data[column] = self._data[c][start_location:stop_location] + data[column] = self._data[col_to_idx[column]][start_location:stop_location] if as_dict: return index, data @@ -558,7 +559,8 @@ def _insert_missing_rows(self, indexes: list[Any]) -> None: :param indexes: list of indexes :return: nothing """ - new_indexes = [x for x in indexes if x not in self._index] + existing = set(self._index) + new_indexes = [x for x in indexes if x not in existing] for x in new_indexes: self._insert_row(bisect_left(self._index, x), x) @@ -581,7 +583,8 @@ def _add_missing_rows(self, indexes: list[Any]) -> None: :param indexes: list of indexes :return: nothing """ - new_indexes = [x for x in indexes if x not in self._index] + existing = set(self._index) + new_indexes = [x for x in indexes if x not in existing] for x in new_indexes: self._add_row(x) @@ -599,7 +602,7 @@ def _add_column(self, column: Any) -> None: self._data.append([None] * len(self._index)) def set( - self, indexes: Any | list | list[bool] = None, columns: Any | None = None, values: Any | list = None + self, indexes: Any | list | list[bool] = None, columns: Any | None = None, values: Any | list = None ) -> None: """ Given indexes and columns will set a sub-set of the DataFrame to the values provided. This method will direct @@ -626,7 +629,7 @@ def set( else: raise ValueError("either or both of indexes or columns must be provided") - def set_cell(self, index, column, value): + def set_cell(self, index: Any, column: Any, value: Any) -> None: """ Sets the value of a single cell. If the index and/or column is not in the current index/columns then a new index and/or column will be created. @@ -653,7 +656,7 @@ def set_cell(self, index, column, value): self._add_column(column) self._data[c][i] = value - def set_row(self, index, values): + def set_row(self, index: Any, values: dict[str, Any] | Any) -> None: """ Sets the values of the columns in a single row. @@ -697,7 +700,7 @@ def set_column(self, index=None, column=None, values=None): c = len(self._columns) self._add_column(column) if index: # index was provided - if all([isinstance(i, bool) for i in index]): # boolean list + if index and isinstance(index[0], bool) and all(isinstance(i, bool) for i in index): # boolean list if not self._check_list(values): # single value provided, not a list, so turn values into list values = [values for x in index if x] if len(index) != len(self._index): @@ -714,9 +717,12 @@ def set_column(self, index=None, column=None, values=None): raise ValueError("length of values and index must be the same.") # insert or append indexes as needed if self._sort: - exists_tuples = list(zip(*[sorted_exists(self._index, x) for x in index])) - exists = exists_tuples[0] - indexes = exists_tuples[1] + exists = [] + indexes = [] + for x in index: + e, i = sorted_exists(self._index, x) + exists.append(e) + indexes.append(i) if not all(exists): self._insert_missing_rows(index) indexes = [sorted_index(self._index, x) for x in index] @@ -769,7 +775,7 @@ def set_locations(self, locations, column, values): indexes = [self._index[x] for x in locations] self.set(indexes, column, values) - def append_row(self, index, values, new_cols=True): + def append_row(self, index: Any, values: dict[str, Any], new_cols: bool = True) -> None: """ Appends a row of values to the end of the data. If there are new columns in the values and new_cols is True they will be added. Be very careful with this function as for sort DataFrames it will not enforce sort order. @@ -796,7 +802,7 @@ def append_row(self, index, values, new_cols=True): for c, col in enumerate(self._columns): self._data[c].append(values.get(col, None)) - def append_rows(self, indexes, values, new_cols=True): + def append_rows(self, indexes: list[Any], values: dict[str, list[Any]], new_cols: bool = True) -> None: """ Appends rows of values to the end of the data. If there are new columns in the values and new_cols is True they will be added. Be very careful with this function as for sort DataFrames it will not enforce sort order. @@ -843,13 +849,8 @@ def _slice_index(self, slicer): if end_index < start_index: raise IndexError("end of slice is before start of slice") - pre_list = [False] * start_index - mid_list = [True] * (end_index - start_index + 1) - post_list = [False] * (len(self._index) - 1 - end_index) - - pre_list.extend(mid_list) - pre_list.extend(post_list) - return pre_list + index_len = len(self._index) + return [False] * start_index + [True] * (end_index - start_index + 1) + [False] * (index_len - 1 - end_index) def __getitem__(self, index): """ @@ -952,7 +953,8 @@ def to_json(self) -> str: for key in self.__slots__: if key not in ["_data", "_index"]: value = self.__getattribute__(key) - meta_data[key.lstrip("_")] = value if not type(value) == self._dropin else list(value) + meta_data[key.lstrip("_")] = value if not (self._dropin and isinstance(value, self._dropin)) else list( + value) input_dict["meta_data"] = meta_data return json.dumps(input_dict, default=repr) @@ -998,7 +1000,7 @@ def delete_rows(self, indexes): :return: nothing """ indexes = [indexes] if not self._check_list(indexes) else indexes - if all([isinstance(i, bool) for i in indexes]): # boolean list + if indexes and isinstance(indexes[0], bool) and all(isinstance(i, bool) for i in indexes): # boolean list if len(indexes) != len(self._index): raise ValueError("boolean indexes list must be same size of existing indexes") indexes = [i for i, x in enumerate(indexes) if x] @@ -1009,11 +1011,9 @@ def delete_rows(self, indexes): else [self._index.index(x) for x in indexes] ) indexes = sorted(indexes, reverse=True) # need to sort and reverse list so deleting works - for c, _ in enumerate(self._columns): - for i in indexes: - del self._data[c][i] - # now remove from index for i in indexes: + for c in range(len(self._columns)): + del self._data[c][i] del self._index[i] def delete_all_rows(self): @@ -1154,10 +1154,10 @@ def _get_lists(self, left_column, right_column, indexes): right_list = self.get_rows(indexes, right_column, as_list=True) return left_list, right_list - def add(self, left_column, right_column, indexes=None): + def add(self, left_column: Any, right_column: Any, indexes: list | list[bool] | None = None) -> list: """ - Math helper method that adds element-wise two columns. If indexes are not None then will only perform the math - on that sub-set of the columns. + Math helper method that adds element-wise two columns. If indexes are not None then will only perform the + math on that sub-set of the columns. :param left_column: first column name :param right_column: second column name @@ -1166,9 +1166,9 @@ def add(self, left_column, right_column, indexes=None): :return: list """ left_list, right_list = self._get_lists(left_column, right_column, indexes) - return [l + r for l, r in zip(left_list, right_list)] + return [left_val + r for left_val, r in zip(left_list, right_list)] - def subtract(self, left_column, right_column, indexes=None): + def subtract(self, left_column: Any, right_column: Any, indexes: list | list[bool] | None = None) -> list: """ Math helper method that subtracts element-wise two columns. If indexes are not None then will only perform the math on that sub-set of the columns. @@ -1180,9 +1180,9 @@ def subtract(self, left_column, right_column, indexes=None): :return: list """ left_list, right_list = self._get_lists(left_column, right_column, indexes) - return [l - r for l, r in zip(left_list, right_list)] + return [left_val - r for left_val, r in zip(left_list, right_list)] - def multiply(self, left_column, right_column, indexes=None): + def multiply(self, left_column: Any, right_column: Any, indexes: list | list[bool] | None = None) -> list: """ Math helper method that multiplies element-wise two columns. If indexes are not None then will only perform the math on that sub-set of the columns. @@ -1194,9 +1194,9 @@ def multiply(self, left_column, right_column, indexes=None): :return: list """ left_list, right_list = self._get_lists(left_column, right_column, indexes) - return [l * r for l, r in zip(left_list, right_list)] + return [left_val * r for left_val, r in zip(left_list, right_list)] - def divide(self, left_column, right_column, indexes=None): + def divide(self, left_column: Any, right_column: Any, indexes: list | list[bool] | None = None) -> list: """ Math helper method that divides element-wise two columns. If indexes are not None then will only perform the math on that sub-set of the columns. @@ -1208,7 +1208,7 @@ def divide(self, left_column, right_column, indexes=None): :return: list """ left_list, right_list = self._get_lists(left_column, right_column, indexes) - return [l / r for l, r in zip(left_list, right_list)] + return [left_val / r for left_val, r in zip(left_list, right_list)] def isin(self, column: Any, compare_list: list) -> list[bool]: """ @@ -1218,7 +1218,8 @@ def isin(self, column: Any, compare_list: list) -> list[bool]: :param compare_list: list of items to compare to :return: list of booleans """ - return [x in compare_list for x in self._data[self._columns.index(column)]] + compare_set = set(compare_list) + return [x in compare_set for x in self._data[self._columns.index(column)]] def iterrows(self, index: bool = True) -> Iterator[dict]: """ diff --git a/raccoon/series.py b/raccoon/series.py index ee7de9e..3ac854f 100644 --- a/raccoon/series.py +++ b/raccoon/series.py @@ -43,11 +43,11 @@ def __repr__(self) -> str: def __str__(self) -> str: return self._make_table() - def _make_table(self, index: bool = True, **kwargs) -> str: + def _make_table(self, index: bool = True, **kwargs: Any) -> str: kwargs["headers"] = "keys" if "headers" not in kwargs.keys() else kwargs["headers"] return tabulate(self.to_dict(ordered=True, index=index), **kwargs) - def print(self, index: bool = True, **kwargs) -> None: + def print(self, index: bool = True, **kwargs: Any) -> None: """ Print the contents of the Series. This method uses the tabulate function from the tabulate package. Use the kwargs to pass along any arguments to the tabulate function. @@ -95,7 +95,7 @@ def sort(self): return def _check_list(self, x: Any) -> bool: - return type(x) == (self._dropin if self._dropin else list) + return isinstance(x, self._dropin) if self._dropin else isinstance(x, list) def get(self, indexes: Any | list | list[bool], as_list: bool = False) -> Self | list | Any: """ @@ -130,7 +130,7 @@ def get_rows(self, indexes: Any | list | list[bool], as_list: bool = False) -> S :param as_list: if True return a list, if False return Series :return: Series if as_list if False, a list if as_list is True """ - if all([isinstance(i, bool) for i in indexes]): # boolean list + if indexes and isinstance(indexes[0], bool) and all(isinstance(i, bool) for i in indexes): # boolean list if len(indexes) != len(self._index): raise ValueError("boolean index list must be same size of existing index") if all(indexes): # the entire column @@ -184,7 +184,7 @@ def get_locations(self, locations: list[int], as_list: bool = False) -> Self | l return self.get(indexes, as_list) def get_slice( - self, start_index: Any = None, stop_index: Any = None, as_list: bool = False + self, start_index: Any = None, stop_index: Any = None, as_list: bool = False ) -> Self | tuple[list, list]: """ For sorted Series will return either a Series or list of all the rows where the index is greater than @@ -230,13 +230,8 @@ def _slice_index(self, slicer: slice) -> list: if end_index < start_index: raise IndexError("end of slice is before start of slice") - pre_list = [False] * start_index - mid_list = [True] * (end_index - start_index + 1) - post_list = [False] * (len(self._index) - 1 - end_index) - - pre_list.extend(mid_list) - pre_list.extend(post_list) - return pre_list + index_len = len(self._index) + return [False] * start_index + [True] * (end_index - start_index + 1) + [False] * (index_len - 1 - end_index) def _validate_index(self, indexes: list) -> None: """ @@ -245,7 +240,7 @@ def _validate_index(self, indexes: list) -> None: :param list indexes: list of indexes :return: nothing """ - if not (self._check_list(indexes) or type(indexes) == list or indexes is None): + if not (self._check_list(indexes) or isinstance(indexes, list) or indexes is None): raise TypeError("indexes must be list, %s or None" % self._dropin) if len(indexes) != len(set(indexes)): # noqa raise ValueError("index contains duplicates") @@ -339,7 +334,8 @@ def isin(self, compare_list: list) -> list[bool]: :param compare_list: list of items to compare to :return: list of booleans """ - return [x in compare_list for x in self._data] + compare_set = set(compare_list) + return [x in compare_set for x in self._data] def equality(self, indexes: list | list[bool] = None, value: Any = None) -> list[bool]: """ @@ -368,13 +364,13 @@ class Series(SeriesBase): """ def __init__( - self, - data: dict | list | None = None, - index: list | None = None, - data_name: str | tuple | None = "value", - index_name: str | tuple | None = "index", - sort: bool = None, - dropin: Callable = None, + self, + data: dict | list | None = None, + index: list | None = None, + data_name: str | tuple | None = "value", + index_name: str | tuple | None = "index", + sort: bool = None, + dropin: Callable = None, ): """ :param data: (optional) list of values. @@ -403,7 +399,7 @@ def __init__( self.index = index else: self.index = list() - elif self._check_list(data) or type(data) == list: + elif self._check_list(data) or isinstance(data, list): self._data = dropin([x for x in data]) if dropin else [x for x in data] # setup index if index: @@ -518,7 +514,8 @@ def _add_missing_rows(self, indexes: list) -> None: :param indexes: list of indexes :return: nothing """ - new_indexes = [x for x in indexes if x not in self._index] + existing = set(self._index) + new_indexes = [x for x in indexes if x not in existing] for x in new_indexes: self._add_row(x) @@ -530,7 +527,8 @@ def _insert_missing_rows(self, indexes: list) -> None: :param indexes: list of indexes :return: nothing """ - new_indexes = [x for x in indexes if x not in self._index] + existing = set(self._index) + new_indexes = [x for x in indexes if x not in existing] for x in new_indexes: self._insert_row(bisect_left(self._index, x), x) @@ -565,7 +563,7 @@ def set_rows(self, index: list | list[bool], values: Any | list = None) -> None: list is values, or the length of the True values in the index list if the index list is booleans :return: nothing """ - if all([isinstance(i, bool) for i in index]): # boolean list + if index and isinstance(index[0], bool) and all(isinstance(i, bool) for i in index): # boolean list if not self._check_list(values): # single value provided, not a list, so turn values into list values = [values for x in index if x] if len(index) != len(self._index): @@ -582,9 +580,12 @@ def set_rows(self, index: list | list[bool], values: Any | list = None) -> None: raise ValueError("length of values and index must be the same.") # insert or append indexes as needed if self._sort: - exists_tuples = list(zip(*[sorted_exists(self._index, x) for x in index])) - exists = exists_tuples[0] - indexes = exists_tuples[1] + exists = [] + indexes = [] + for x in index: + e, i = sorted_exists(self._index, x) + exists.append(e) + indexes.append(i) if not all(exists): self._insert_missing_rows(index) indexes = [sorted_index(self._index, x) for x in index] @@ -701,7 +702,7 @@ def delete(self, indexes: Any | list | list[bool]) -> None: :return: nothing """ indexes = [indexes] if not self._check_list(indexes) else indexes - if all([isinstance(i, bool) for i in indexes]): # boolean list + if indexes and isinstance(indexes[0], bool) and all(isinstance(i, bool) for i in indexes): # boolean list if len(indexes) != len(self._index): raise ValueError("boolean indexes list must be same size of existing indexes") indexes = [i for i, x in enumerate(indexes) if x] @@ -714,8 +715,6 @@ def delete(self, indexes: Any | list | list[bool]) -> None: indexes = sorted(indexes, reverse=True) # need to sort and reverse list so deleting works for i in indexes: del self._data[i] - # now remove from index - for i in indexes: del self._index[i] def reset_index(self) -> None: @@ -737,13 +736,13 @@ class ViewSeries(SeriesBase): """ def __init__( - self, - data: list | tuple | None = None, - index: list | None = None, - data_name: str | tuple | None = "value", - index_name: str | tuple | None = "index", - sort: bool = False, - offset: int = 0, + self, + data: list | tuple | None = None, + index: list | None = None, + data_name: str | tuple | None = "value", + index_name: str | tuple | None = "index", + sort: bool = False, + offset: int = 0, ): """ :param data: (optional) list of values. diff --git a/raccoon/sort_utils.py b/raccoon/sort_utils.py index a5dee6e..8d116c2 100644 --- a/raccoon/sort_utils.py +++ b/raccoon/sort_utils.py @@ -2,7 +2,7 @@ Utility functions for sorting and dealing with sorted Series and DataFrames """ -from bisect import bisect_left, bisect_right +from bisect import bisect_left from typing import Any, Callable @@ -17,8 +17,7 @@ def sorted_exists(values: list, x: Any) -> tuple[bool, int]: :return: (exists, index) tuple """ i = bisect_left(values, x) - j = bisect_right(values, x) - exists = x in values[i:j] + exists = i < len(values) and values[i] == x return exists, i @@ -31,8 +30,9 @@ def sorted_index(values: list, x: Any) -> int: :return: integer index """ i = bisect_left(values, x) - j = bisect_right(values, x) - return values[i:j].index(x) + i + if i < len(values) and values[i] == x: + return i + raise ValueError(f"{x!r} is not in list") def sorted_list_indexes(list_to_sort: list, key: Callable | Any = None, reverse: bool = False) -> list[int]: diff --git a/raccoon/utils.py b/raccoon/utils.py index 01918b6..503ecde 100644 --- a/raccoon/utils.py +++ b/raccoon/utils.py @@ -2,10 +2,17 @@ Raccoon utilities """ +from typing import Callable + import raccoon as rc -def assert_frame_equal(left: rc.DataFrame, right: rc.DataFrame, data_function=None, data_args=None) -> None: +def assert_frame_equal( + left: rc.DataFrame, + right: rc.DataFrame, + data_function: Callable | None = None, + data_args: dict | None = None, +) -> None: """ For unit testing equality of two DataFrames. @@ -28,7 +35,8 @@ def assert_frame_equal(left: rc.DataFrame, right: rc.DataFrame, data_function=No def assert_series_equal( - left: rc.Series | rc.ViewSeries, right: rc.Series | rc.ViewSeries, data_function=None, data_args=None + left: rc.Series | rc.ViewSeries, right: rc.Series | rc.ViewSeries, data_function: Callable | None = None, + data_args: dict | None = None ) -> None: """ For unit testing equality of two Series. @@ -39,7 +47,7 @@ def assert_series_equal( :param data_args: arguments to pass to the data_function :return: nothing """ - assert type(left) == type(right) + assert type(left) is type(right) if data_function: data_args = {} if not data_args else data_args data_function(left.data, right.data, **data_args)