Hecks Takehome Assignment

Summary

Imagine this: you are an AI engineer at a purely fictitious company named "Hecks", and you have been put in charge of improving the company's "natural language question answering" product. This product is a compound AI system built on top of a large language model with access to some tools.

Your users have provided you with a database (hecks.duckdb, see the zip file), and your agent should use it to answer their questions!

Additionally, they have provided you with a collection of markdown files that capture some important business rules for their company.

For the sake of simplicity, your agent will ultimately answer user questions with a SQL query - this query is submitted via the "submit_answer" tool.

In order to compare you against some other vendors providing similar services, the customer has given you a pair of evaluation datasets (evaluation/data/evals_easy.json and evaluation/data/evals_hard.json, respectively).

These datasets are a collection of pairs of the form:

natural language question
query whose result defines the expected dataframe

For example, a simple evaluation pair might be:

prompt: "How many users are there at our company?"
query: "SELECT COUNT(*) FROM our_company.users"

The easy set does not require information from the rules files, but the hard set does.

To determine how well your agent performs, the queries in the eval will be executed, and the resulting dataframes will be compared against the dataframe that your agent's SQL queries returned.

This comparison is "loose" - in your agent's output dataframe,

column order is ignored (if your agent returns a different order of columns, it's still correct)
column names are ignored (if your agent returns a different name for a column with the right values, it's still correct)
extra columns on the returned dataframe are not penalized (if your agent returns extra columns, it's still correct)

Your goal is to build tools and structures for an agent that can reproduce the appropriate answers to as many of the given HARD questions as possible!

Be careful not to overfit, though! They have a held-out test set that will be used to evaluate your agent's performance.

Rules

You're strongly encouraged to take no more than four hours here. More than six hours is out-of-scope.
You have seven days to complete the take-home; please find time to work on it that's compatible with your schedule.
Feel free to fragment the time over several days if that's preferable for you.
Please present your work in a format that communicates what you've done clearly. We'd love to see two components:
- Some sort of prose writeup (PDF, document, etc) including communication around choices / tradeoffs you made during the course of the exercise, and your ultimate conclusions and guidance.
- A forked version of this repo w/ your code changes.
Once complete, please email a link to your fork of this repository to your recruiter.

Getting Started

The duckdb database file is included in this repository via Git LFS as hecks.duckdb.zip. After cloning the repository, you'll need to unzip it:

unzip hecks.duckdb.zip

This will extract hecks.duckdb (~635 MB) to the root directory of the repository.

Additionally, you'll need an OpenRouter API Key. Your recruiter will provision you with a key to use for the week.

Install the dependencies with uv:

uv sync

Once you have the duckdb file and the OpenRouter API Key, you can run the following command to start the agent:

uv run interactive --api-key YOUR_API_KEY

This will start the agent in a REPL-like environment. You can enter natural language questions, and the agent will respond. You can exit the REPL-like environment by typing "quit". You can start a new conversation by typing "reset".

Evaluation

To evaluate your agent, you can run the following command:

uv run evaluate --api-key YOUR_API_KEY --concurrency 16

This will run the agent against the evaluation dataset and report the results.

Goal

Your goal is to make whatever changes necessary (outside of outright reward hacking the evaluation script) to get as many of the HARD questions correct as possible.

This might take a number of forms! Tweaking parameters, adding more tools, prompt engineering, context manipulation, training custom models (ok, probably not this one)...

Please let us know what you tried, and how far you got! The implementation we've got here is intentionally pretty bare-bones. You're free to use any resources you like, including AI assistants, to help you complete the assignment.

Submission

Once you have completed the assignment, please email a link to your repository to your recruiter.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
evaluation		evaluation
framework		framework
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
hecks.duckdb.zip		hecks.duckdb.zip
interactive.py		interactive.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hecks Takehome Assignment

Summary

Rules

Getting Started

Evaluation

Goal

Submission

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hecks Takehome Assignment

Summary

Rules

Getting Started

Evaluation

Goal

Submission

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages