tlm-core is a library for generating and scoring chat completions. It implements the Trustworthy Language Model algorithm.
tlm-core is designed to be modular, configurable, and easy to extend.
ReferenceAnswer objects are responsible for generating reference answers. They are the start of a Pipeline.
Implementations of ReferenceAnswer must implement two methods: agenerate and acost. agenerate is an asynchronous method that generates a reference answer. acost is an asynchronous method that calculates the cost of a reference answer.
Two implementations of ReferenceAnswer are provided: GeneratedAnswer and StubAnswer. GeneratedAnswer generates answers using a language model. StubAnswer provides a stub implementation of ReferenceAnswer that is used for scoring external reference answers, as well as for testing and debugging.
tlm/core/reference contains implementations of reference answer generators.
tlm/core/config/reference contains configurations for reference answer generators.
Score objects are responsible for scoring chat completions. They, when aggregated with other scores in a ScoreGroup, form the end of a Pipeline.
Implementations of Score must implement two methods: acalculate and acost. acalculate is an asynchronous method that scores a chat completion. acost is an asynchronous method that calculates the cost of scoring a chat completion.
Score objects are expected to return a score between 0 and weight, where weight is set in the configuration. All weights in a ScoreGroup must sum to 1.
tlm/core/scoring contains implementations of scoring methods.
tlm/core/config/scoring contains configurations for scoring methods.
A Pipeline is a sequence of ReferenceAnswer and a ScoreGroup. It is used to generate a chat completion and score it.
tlm/core/pipeline defines the Pipeline class.
An Ensemble is a collection of Pipeline objects. It is used to a set of completions, score each of them, and return the result with the highest score.
tlm/core/ensemble defines the Ensemble class.
Callbacks are used to trigger actions at various points in a TLM run.
When calling the chat or achat functions, callbacks are defined as Subscription's. These subscriptions define pairs of EventType and Callback objects (functions/classes that implement the Callback protocol).
See the following examples:
from tlm.core.callbacks.events import EventType, BaseEvent
def my_callback(event: EventType) -> None:
# do something with the event
pass
# a subscription that triggers on the CONTENT_INITIALIZE event
initialize_subscription = ([EventType.CONTENT_INITIALIZE], my_callback)
# a subscription that triggers on the CONTENT_DELTA event
delta_subscription = ([EventType.CONTENT_DELTA], my_callback)
# a subscription that triggers on the END event
end_subscription = ([EventType.END], my_callback)
# a subscription that triggers on the CONTENT_INITIALIZE or CONTENT_DELTA event
initialize_or_delta_subscription = ([EventType.CONTENT_INITIALIZE, EventType.CONTENT_DELTA], my_callback)
# a subscription that triggers on any event
all_events_subscription = (None, my_callback)Then, to use the callbacks, pass them into the chat or achat functions:
# subscribe to all events
achat(messages, config, subscriptions=[all_events_subscription])tlm/core/callbacks defines the objects that are used to define callbacks.
The Log object is used to store metadata about a TLM run. It stores data in a dictionary, which can be extended arbitrarily with new keys and data.
The Log data has a flexible schema. However, data is expected to be added in a consistent manner, such that the metadata can be used to produce aggregate statistics about TLM runs.
To prevent data from being clobbered, reference_index and score_index are added to LogUpsertEvent objects.
See the following for an example of how the log data is structured:
{'0': {'0': {'custom_criteria': {'explanation': 'The answer is both helpful '
'and accurate, as it correctly '
'identifies Paris as the '
'capital of France, which is a '
'well-known fact.',
'rating': 5,
'scores': {'perplexity': np.float64(0.8568962810889648),
'raw_score': 1.0,
'score': np.float64(0.4642240702722412)}}},
'1': {'reflection': {'choice': True,
'explanation': 'The proposed answer is correct '
'because Paris is indeed the '
'capital city of France.',
'scores': {'perplexity': np.float64(0.9411705333782624),
'raw_score': 1.0,
'score': np.float64(0.4852926333445656)}}},
'result': {'completion': {'content': 'The capital of France is Paris.',
'logprobs': [{'logprob': 0.0, 'token': 'The'},
{'logprob': 0.0,
'token': ' capital'},
{'logprob': 0.0, 'token': ' of'},
{'logprob': 0.0,
'token': ' France'},
{'logprob': 0.0, 'token': ' is'},
{'logprob': -1.6240566e-06,
'token': ' Paris'},
{'logprob': -3.1281633e-07,
'token': '.'}],
'role': 'assistant',
'top_logprobs': [[{'logprob': 0.0,
'token': 'The'},
{'logprob': -17.125,
'token': ' The'},
{'logprob': -20.875,
'token': 'the'}],
[{'logprob': 0.0,
'token': ' capital'},
{'logprob': -19.625,
'token': 'capital'},
{'logprob': -21.0,
'token': ' Capital'}],
[{'logprob': 0.0,
'token': ' of'},
{'logprob': -25.375,
'token': 'of'},
{'logprob': -25.75,
'token': ' city'}],
[{'logprob': 0.0,
'token': ' France'},
{'logprob': -17.75,
'token': 'France'},
{'logprob': -20.125,
'token': ' Paris'}],
[{'logprob': 0.0,
'token': ' is'},
{'logprob': -21.875,
'token': ' هو'},
{'logprob': -22.5,
'token': ' Is'}],
[{'logprob': -1.6240566e-06,
'token': ' Paris'},
{'logprob': -13.625002,
'token': 'Paris'},
{'logprob': -15.750002,
'token': ' Пари'}],
[{'logprob': -3.1281633e-07,
'token': '.'},
{'logprob': -15.125,
'token': '.\n'},
{'logprob': -17.0,
'token': '.\n\n'}]]},
'score': np.float64(0.9495167036168068)}}}
tlm/core/log defines the Log class.
from tlm.core import chat
from tlm.core.config import Config
from tlm.core.config.reference import GeneratedAnswerConfig
from tlm.core.config.scoring import PerplexityConfig
config = Config(
reference=[GeneratedAnswerConfig(model="gpt-4o-mini", temperature=0.0)],
scoring=[PerplexityConfig(weight=1.0)],
)
messages = [Message(role="user", content="What is the capital of France?")]
completion, score, log = chat(messages, config)from tlm.core import score
from tlm.core.config import Config
from tlm.core.config.reference import StubAnswerConfig
from tlm.core.config.scoring import PerplexityConfig
from tlm.core.types import Completion
config = Config(
reference=[StubAnswerConfig(completion=Completion(role="assistant", content="The capital of France is Paris."))],
scoring=[PerplexityConfig(weight=1.0)],
)
messages = [Message(role="user", content="What is the capital of France?")]
_, score, log = chat(messages, config)After activating the tlm-core venv, run pytest to run the test suite.
source .venv/bin/activate
python -m coverage run -m pytestTo view the test coverage report, run python -m coverage report.
tlm-core implements the TLM algorithm using asynchronous methods. This is to enable streaming of chat completions and scoring and ensure that many parallel TLM runs can be performed efficiently (like on a web server). To this end, all methods that perform I/O (e.g. calling a language model) are asynchronous. All I/O calls must be performed asynchronously.
Metadata is added to the log primarily through emitting LogUpsertEvent's from within reference and scoring functions.
You can set the path for the log entry by setting the jsonpath parameter in the LogUpsertEvent constructor. For more information on the jsonpath syntax, see the jsonpath language documentation. The data will be inserted into the log under the specified path. Ensure that your data is JSON serializable.
When creating a new LogUpsertEvent, you can set the reference_index and score_index parameters to specify the path to the log entry. reference_index is always required and score_index is required if you are logging from a Score.
See the following example:
await self.forward(LogUpsertEvent(jsonpath="$.my.nested.key", data={"my": "data"}, reference_index=0))
await self.forward(LogUpsertEvent(jsonpath="$.my.score.key", data={"my": "data"}, reference_index=0, score_index=0))tlm-core runs are highly parallelizable. It is encouraged to use the achat method to perform benchmarking, as it will allow for many parallel runs to be performed efficiently.
!! TODO: Add benchmarking instructions !!