Skip to content

Conversation

@836hardik-agrawal
Copy link
Contributor

Added a new problem implementation for TD(0) Policy Evaluation. This function performs a single pass of value updates over an episode of (state, action, reward, nextstate) transitions that follow a deterministic policy π. Includes:

1.Core implementation of TD(0) update rule.

2.Markdown version of algorithm.

3.Structured test cases with reasoning and expected outputs.

Copy link
Collaborator

@moe18 moe18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had a small comment on the TD imp

@836hardik-agrawal
Copy link
Contributor Author

@moe18
Thanks for reviewing.
Actually in the question description I have mentioned for taking discounting factor to be 1 so that's why not included in the solution.
If you want I can remove that constraint from question description and add it in solution .

@moe18
Copy link
Collaborator

moe18 commented Aug 15, 2025

that makes sense so will push the TD Q

Copy link
Collaborator

@moe18 moe18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had a few small comments, sorry for the late response back

@@ -0,0 +1,3 @@
def td0_policy_evaluation(episode, V, pi, alpha):
for (s, a, r, s_next) in episode:
V[s] += alpha * (r + V[s_next] - V[s]) No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to return the V value

},
{
"test": "episode = [\n ('A', 'left', 5.0, 'B'),\n ('B', 'right', 0.0, 'C'),\n ('C', 'down', 1.0, 'terminal')\n]\nV = {'A': 0.0, 'B': 0.0, 'C': 0.0, 'terminal': 0.0}\npi = {'A': 'left', 'B': 'right', 'C': 'down'}\nalpha = 0.5\nV_updated = td0_policy_evaluation(episode, V, pi, alpha)\nprint({k: round(v, 2) for k, v in V_updated.items()})",
"expected_output": "{'A': 2.5, 'B': 0.5, 'C': 0.5, 'terminal': 0.0}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the solution you provided gave this output
{'A': 2.5, 'B': 0.0, 'C': 0.5, 'terminal': 0.0}

@@ -0,0 +1,23 @@

# Learn Section
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to say learn section

Copy link
Collaborator

@moe18 moe18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small changes to the SARSA question, but looks good

@@ -0,0 +1,15 @@
{
"id": "173",
"title": "implement_the_SARSA_Algorithm_on_policy",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for '_' should just be Implement the SARSA Algorithm on policy

},
{
"test": "transitions = {\n ('A', 'x'): (0.0, 'terminal'),\n ('A', 'y'): (5.0, 'B'),\n ('B', 'z'): (2.0, 'terminal')\n}\ninitial_states = ['A']\nalpha = 0.4\ngamma = 0.9\nmax_steps = 3\nQ = sarsa_update(transitions, initial_states, alpha, gamma, max_steps)\nfor k in sorted(Q):\n print(f\"Q{str(k):15} = {Q[k]:.4f}\")",
"expected_output": "Q('A', 'x') = 0.0000\nQ('A', 'y') = 0.0000\nQ('B', 'z') = 0.0000"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got this output from the solution Q('A', 'x') = 0.0000 Q('A', 'y') = 0.0000

@836hardik-agrawal
Copy link
Contributor Author

@moe18
Thanks for reviewing will update on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants