-
Notifications
You must be signed in to change notification settings - Fork 28
Enable Unrestricted LLM Output with Parsing + Add One-Shot Anomaly Detection Support #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sarahmish
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @scherkao31! Great features to add! I have a few minor comments.
...ves/jsons/sigllm.primitives.prompting.timeseries_preprocessing.rolling_window_sequences.json
Outdated
Show resolved
Hide resolved
sigllm/primitives/transformation.py
Outdated
| if normal: | ||
| # Handle as single time series (window_size, 1) | ||
| return _as_string(X) | ||
| else: | ||
| # Handle as multiple windows (num_windows, window_size, 1) | ||
| results = list(map(_as_string, X)) | ||
| return np.array(results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should have it more generalized, regardless of normal or not.
if X.ndim < 2:
return _as_string(X)
else:
...What do you think?
| def parse_anomaly_response(X): | ||
| """Parse a list of lists of LLM responses to extract anomaly values and format them as strings. | ||
| Args: | ||
| X (List[List[str]]): List of lists of response texts from the LLM in the format | ||
| "Answer: no anomalies" or "Answer: [val1, val2, ..., valN]" | ||
| Returns: | ||
| List[List[str]]: List of lists of parsed responses where each element is either | ||
| "val1,val2,...,valN" if anomalies are found, | ||
| or empty string if no anomalies are present | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reformat docstrings to keep arg name and description on separate lines:
"""Extract anomalies from LLM response.
Parse a list of lists of LLM responses to extract anomaly
values and format them as strings.
Args:
X (List[List[str]]):
List of lists of response texts from the LLM in the format
"Answer: no anomalies" or "Answer: [val1, val2, ..., valN]".
Returns:
List[List[str]]:
List of lists of parsed responses where each element is either
"val1,val2,...,valN" if anomalies are found, or empty string if
no anomalies are present.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first line of docstrings should be shorter. I already have a description mentioned in my comment above
sigllm/primitives/transformation.py
Outdated
| or empty string if no anomalies are present | ||
| """ | ||
|
|
||
| def parse_single_response(text: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove type checking here since it's a private function. I also recommend removing comments that are not necessary.
For private functions, we normally make them start with _, so that would be
def _parse_single_response(text):There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you still have str typing in your implementation
sarahmish
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @scherkao31 for making the changes! Some previous comments were not addressed so I kept them as unresolved.
In addition, I want to note:
mistral_detectorshould be modified to setrestrict_tokensto True.- the PR is missing the hyperparameter settings for MSL
- do you plan include hyperparameters of Yahoo?
sigllm/data.py
Outdated
| """Load the CSV with the given name from S3. | ||
| If the CSV has never been loaded before, it will be downloaded | ||
| from the [d3-ai-orion bucket](https://d3-ai-orion.s3.amazonaws.com) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update d3-ai-orion to sintel-sigllm
sigllm/data.py
Outdated
| directory, and then returned. | ||
| Otherwise, if it has been downloaded and cached before, it will be directly | ||
| loaded from the `orion/data` folder without contacting S3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update orion/data to sigllm/data
sigllm/data.py
Outdated
| If a `test_size` value is given, the data will be split in two parts | ||
| without altering its order, making the second one proportionally as | ||
| big as the given value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_size argument doesn't make sense in load_normal. I would remove it. please make sure the code, docstrings, and function arguments are not using this variable.
sigllm/data.py
Outdated
| use_timestamps (bool): | ||
| If True, start and end are interpreted as timestamps. | ||
| If False, start and end are interpreted as row indices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not make this a function argument, but rather I would check if start or end are of integer or timestamp type. See comment below.
sigllm/data.py
Outdated
| # Handle slicing if start or end is specified | ||
| if start is not None or end is not None: | ||
| if use_timestamps: | ||
| # If start and end are timestamps | ||
| mask = (data['timestamp'] >= start) & (data['timestamp'] <= end) | ||
| data = data[mask] | ||
| else: | ||
| # If start and end are indices | ||
| data = data.iloc[start:end] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of things
- we don't know the exact name of the timestamp column, so you should use the variable.
- the user typically is not keen on specifying so many parameters. we can check if start or end are integers then we can use them to slice. Otherwise, assume that they are timestamps.
- check if the edge cases
start=Nonebutendhas a value works and vice versa.
if start is not None or end is not None:
if data.index.isin([start, end]):
# If start and end are indices
data = data.iloc[start:end]
else:
# If start and end are timestamps
mask = (data[timestamp_column] >= start) & (data[timestamp_column] <= end)
data = data[mask]
sigllm/primitives/transformation.py
Outdated
| or empty string if no anomalies are present | ||
| """ | ||
|
|
||
| def parse_single_response(text: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you still have str typing in your implementation
| def parse_anomaly_response(X): | ||
| """Parse a list of lists of LLM responses to extract anomaly values and format them as strings. | ||
| Args: | ||
| X (List[List[str]]): List of lists of response texts from the LLM in the format | ||
| "Answer: no anomalies" or "Answer: [val1, val2, ..., valN]" | ||
| Returns: | ||
| List[List[str]]: List of lists of parsed responses where each element is either | ||
| "val1,val2,...,valN" if anomalies are found, | ||
| or empty string if no anomalies are present | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first line of docstrings should be shorter. I already have a description mentioned in my comment above
| Additional padding token to forecast to reduce short horizon predictions. | ||
| Default to `0`. | ||
| restrict_tokens (bool): | ||
| Whether to restrict tokens or not. Default to `True`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't see a change to mistral_detector
| "user_message": "Below is a [SEQUENCE], please return the anomalies in that sequence in [RESPONSE]. Only return the numbers. [SEQUENCE]" | ||
| "system_message": "You are an expert in time series analysis. Your task is to detect anomalies in time series data.", | ||
| "user_message": "Below is a [SEQUENCE], please return the anomalies in that sequence in [RESPONSE]. Only return the numbers. [SEQUENCE]", | ||
| "user_message_2": "Below is a [SEQUENCE], analyze the following time series and identify any anomalies. If you find anomalies, provide their values in the format [first_anomaly, ..., last_anomaly]. If no anomalies are found, respond with 'no anomalies'. Be concise, do not write code, do not permorm any calculations, just give your answers as told.: [SEQUENCE]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo permorm -> perform
…tection Support (#39) * Core Changed to get normal behavior in pipeline * Transformation changed * anomalies.py changed * Hugginface.py changed : no restrictions token, and also has normal as input if 1-shot * Timeseries preprocessing.py * jsons files added for primitives * jsons files added for primitives * pipelines 0shot and 1shot added * add boolean for restrict_tokens in HF * good messages.json for prompt * Added load_normal in sigllm.data * Fixed load_normal in sigllm.data * Fixed lint format * Fixed lint format Ruff * Fixed from review Sarah * Fixed lint format after working on Sarah's reviews * Dataset prompter parameters * .jons removed from input names in 1_shot pipeline.json * .jons removed from input names in 1_shot pipeline.json * fix PR issues & add unittests * add unittests for parse_anomaly_response * remove unused functions * add new functionality tests * update ubuntu image * change normal->single * fix lint * swap normal -> single --------- Co-authored-by: Salim Cherkaoui <[email protected]> Co-authored-by: Sarah Alnegheimish <[email protected]>
* init benchmark * fix details * fix lint * add benchmark tests * paper benchmark results (#41) * Enable Unrestricted LLM Output with Parsing + Add One-Shot Anomaly Detection Support (#39) * Core Changed to get normal behavior in pipeline * Transformation changed * anomalies.py changed * Hugginface.py changed : no restrictions token, and also has normal as input if 1-shot * Timeseries preprocessing.py * jsons files added for primitives * jsons files added for primitives * pipelines 0shot and 1shot added * add boolean for restrict_tokens in HF * good messages.json for prompt * Added load_normal in sigllm.data * Fixed load_normal in sigllm.data * Fixed lint format * Fixed lint format Ruff * Fixed from review Sarah * Fixed lint format after working on Sarah's reviews * Dataset prompter parameters * .jons removed from input names in 1_shot pipeline.json * .jons removed from input names in 1_shot pipeline.json * fix PR issues & add unittests * add unittests for parse_anomaly_response * remove unused functions * add new functionality tests * update ubuntu image * change normal->single * fix lint * swap normal -> single --------- Co-authored-by: Salim Cherkaoui <[email protected]> Co-authored-by: Sarah Alnegheimish <[email protected]> * support load_normal --------- Co-authored-by: scherkao31 <[email protected]> Co-authored-by: Salim Cherkaoui <[email protected]>
This PR introduces several changes :