Skip to content

Commit ab56cce

Browse files
committed
Add Clay agent evaluation workshop images and detailed notes
1 parent d1bf515 commit ab56cce

File tree

7 files changed

+205
-3
lines changed

7 files changed

+205
-3
lines changed

assets/image_1740270980969_0.png

1.79 MB
Loading

assets/image_1740271062509_0.png

807 KB
Loading

assets/image_1740271184217_0.png

1.13 MB
Loading

assets/image_1740271572484_0.png

2.35 MB
Loading

assets/image_1740271686595_0.png

1.13 MB
Loading

journals/2025_02_22.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
- #Filed
1818
- [[CLI/Tool/ffmpeg]]
1919
- [[CLI/Tool/yt-dlp]]
20+
- [[GitHub Personal Access Token]] https://github.com/settings/personal-access-tokens/new
2021
- #Met
2122
- [[Person/Vitaly Kleban]]
2223
- [[Crypto/Smart Contract]]s expert with statistical enforcement

pages/AI___ES___25___ws___3___How Clay Performs Agent Evaluation.md

Lines changed: 204 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -328,7 +328,7 @@ tags:: [[AI/Agent]], [[LangChain]], [[Workshop]], [[Tutorial]]
328328
- Observability insights continuously refine development evaluations in a feedback loop.
329329
- ## Evals at Clay - **Evaluation Pipeline Overview**
330330
- image here
331-
- tbd
331+
- ![image.png](../assets/image_1740270980969_0.png)
332332
- **Production / Observability Evals**
333333
- **Tools Used:**
334334
- Segment
@@ -352,5 +352,206 @@ tags:: [[AI/Agent]], [[LangChain]], [[Workshop]], [[Tutorial]]
352352
- Integration Test CI
353353
- **GitHub Actions handles CI runs**
354354
- **Final step: Deploy! 🚀**
355-
-
356-
-
355+
- ## Development Evals – Ensuring Quality & Stability
356+
- **Blackbox E2E Smoke Testing**
357+
- Environment parity
358+
- Early problem detection
359+
- Confidence in real-world conditions
360+
- **Integration Testing**
361+
- Performance tracking on key use cases
362+
- Regression prevention
363+
- Comprehensive coverage
364+
- ## **Claygent Blackbox E2E Smoke Test CI Workflow**
365+
- ![image.png](../assets/image_1740271062509_0.png)
366+
- **Workflow Steps:**
367+
1. **Checkout Branch Changes** (Git icon)
368+
2. **Install Packages** (npm icon)
369+
3. **Deploy to Staging** (AWS Lambda icon)
370+
4. **Run Smoketest Script** (`smoke_test.ts` file)
371+
5. **Clay API**
372+
- Response: `200`
373+
6. **Clay Tables in Staging**
374+
- ## **Claygent Blackbox E2E Smoke Test: Sample Staging Clay Table**
375+
- ![image.png](../assets/image_1740271184217_0.png)
376+
- **Table Overview:**
377+
- **Columns:**
378+
- `domain`
379+
- `b2b_or_b2c`
380+
- `TRUTH: claygent`
381+
- `Send Test Case to C`
382+
- `testid`
383+
- `model`
384+
- **Example Entries:**
385+
- `horizonvalleygroup.com` → `B2C` → `Pass` → `gpt-4o`
386+
- `bostrom.com` → `B2B` → `Pass` → `gpt-4o`
387+
- `owshousing.com` → `B2B` → `Bug` → `gpt-4o`
388+
- `culinaryproperties.com` → `B2B` → `Pass` → `gpt-4o`
389+
- `globalmarininsurance.com` → `B2B` → `Fail` → `gpt-4o`
390+
- `greatgood.org` → `B2C` → `Pass` → `gpt-4o`
391+
- `fortcapitalp.com` → `B2B` → `Bug` → `gpt-4o`
392+
- `servicethread.com` → `B2B` → `Pass` → `gpt-4o`
393+
- **Key Status Indicators:**
394+
- ✅ Pass
395+
- ❌ Fail
396+
- 🐞 Bug
397+
- ## **Claygent Integration Testing: Types of Evals**
398+
- **Key Points:**
399+
- Check your agent’s performance in development to ensure high-quality output in production.
400+
- Comparing Agent Output to Ground Truth.
401+
- **Comparison Methods – [The Langchain Openevals Framework](https://github.com/langchain-ai/openevals) Approach:**
402+
- Exact Match
403+
- Levenshtein Distance
404+
- Embedding Similarity
405+
- LLM as a Judge
406+
- ## **Claygent Integration Test Example 1**
407+
- **Exact Matching**
408+
- Integration Test
409+
- LangSmith integration with Vitest
410+
- **Code Example (JavaScript)**
411+
- **Test Definition**
412+
```javascript
413+
ls.test(
414+
'horizonrealtygroup',
415+
{
416+
inputs: {
417+
mission: createTestActionInputs('http://www.horizonrealtygroup.com').mission,
418+
},
419+
referenceOutputs: {
420+
answer: 'B2C',
421+
},
422+
}
423+
);
424+
```
425+
- **Execute Test**
426+
```javascript
427+
async ({ referenceOutputs }) => {
428+
// call claygent action code
429+
const actionInputs = createTestActionInputs('http://www.horizonrealtygroup.com');
430+
const response = await runClaygentActionCode(actionInputs, testContext);
431+
ls.logOutputs({ response: response.data.result });
432+
```
433+
- **Assert Exact Match**
434+
```javascript
435+
const exactMatchResult = await exactMatch({
436+
outputs: response.data.result,
437+
referenceOutputs: referenceOutputs?.answer,
438+
});
439+
expect(exactMatchResult.score).toBe(true);
440+
}
441+
```
442+
- ## LangSmith Debugging: Inspecting Model Discrepancies
443+
- **Bucket Overview:**
444+
- C Bucket: ✅ Passed
445+
- Ideal Candidate Bucket: ❌ Meta failed, ✅ OpenAI passed
446+
- **Using LangSmith for Debugging**
447+
- Navigate to the **Datasets and Experiments** page on LangSmith.
448+
- Identify and select the **Ideal Candidate** bucket.
449+
- Click on the **latest run** to inspect errors.
450+
- **Error Inspection Process**
451+
- **Input:** "Give me an Ideal Candidate for the job at this particular link, at company Meta."
452+
- **Reference Output:** Seems to be related to an iOS job.
453+
- **Model Error:** Discrepancy detected.
454+
- **Analyzing the Judge's Thought Process**
455+
- **Judge's Observation:**
456+
- Input suggests scraping a job for Mechanical Engineering.
457+
- Reference output describes an iOS job.
458+
- **Final Evaluation:**
459+
- "Upon analyzing the reference and model outputs under the given rubric, it is evident that they describe candidates for entirely different roles."
460+
- **Key Takeaway:**
461+
- LangSmith provides insight into model performance discrepancies by exposing errors in judgment and alignment with expected outputs.
462+
- ## Observability Evals – Real-Time Insights & Optimization
463+
- **User Prompt Classification**
464+
- Capture user patterns and trends.
465+
- Sharpen evaluation for precise tuning.
466+
- **Methods:** Classifier training, LLM judging, clustering, semantic analysis.
467+
- **Realtime Agent Log Analysis**
468+
- Monitor responses live.
469+
- Receive immediate, context-aware feedback.
470+
- Detect anomalies as they occur.
471+
- Extract data for training and/or dev evals.
472+
- ## Extracting Value from Customer Usage Data
473+
- **Importance of Customer Logs**
474+
- Understanding how customers interact with the platform provides the most valuable insights.
475+
- Logs help assess whether an agent performed well or poorly on specific interactions.
476+
- Extracting logs for training or development evaluations can refine model performance.
477+
- **Classification, Clustering, and Tagging**
478+
- Once logs are classified, clustering techniques can be applied.
479+
- Tagging strategies help organize and analyze data for targeted improvements.
480+
- **Key Takeaway:**
481+
- Observability-driven insights allow for continuous tuning and refinement of AI agents.
482+
- ## Prompt Classification - Clustering & Tagging
483+
- ![image.png](../assets/image_1740271572484_0.png)
484+
- **Using Embeddings for Clustering**
485+
- Cohere has recently released embeddings tailored for clustering.
486+
- Obtain embeddings for prompt properties.
487+
- Perform clustering using methods such as:
488+
- Hack-rewarded clustering
489+
- K-means clustering
490+
- Experimentation is necessary to determine the most effective clustering approach.
491+
- **Generating Tags for Clusters**
492+
- Once clusters are formed, send a request to the LLM to generate category tags.
493+
- Results in well-defined buckets of prompt categories.
494+
- **Analyzing Prompt Categories**
495+
- Count the number of prompts in each category.
496+
- Identify the most common customer use cases.
497+
- Enables deeper insights into how customers are using the platform.
498+
- **Key Takeaway:**
499+
- Clustering prompts enables structured categorization and prioritization of real customer needs.
500+
- ## Prompt Classification - Claygent Categories
501+
- ![image.png](../assets/image_1740271686595_0.png)
502+
- **Identified Prompt Categories:**
503+
- **Company Profile**
504+
- **Users:** 776
505+
- **Events:** 306,944
506+
- **Example Task:** Find the Facebook and Instagram profile links for a company.
507+
- **Google Search**
508+
- **Users:** 767
509+
- **Events:** 171,276
510+
- **Example Task:** Google an address and check if a company is listed.
511+
- **Boolean**
512+
- **Users:** 756
513+
- **Events:** 162,082
514+
- **Example Task:** Determine whether a statement is true or false.
515+
- **Business Website**
516+
- **Users:** 706
517+
- **Events:** 160,212
518+
- **Example Task:** Visit a company website and extract relevant details.
519+
- **Analysis & Insights:**
520+
- Understanding which categories are most commonly used helps prioritize improvements.
521+
- Some categories may be less popular but still provide valuable use cases.
522+
- ## Realtime Agent Performance Analysis
523+
- **Live data is the best data!**
524+
- **Key Insights:**
525+
- Examples where the agent performs well → **Great training data**.
526+
- Examples where the agent needs improvement → **Useful for development evals**.
527+
- The feedback loop between **production evals, observability evals, and development evals** helps refine performance.
528+
- **Process:**
529+
- 1. Stream in live data.
530+
- 2. Identify examples where the agent succeeds.
531+
- 3. Export those cases to LangSmith for training.
532+
- 4. Identify failure cases and move them to development evals.
533+
- **Key Takeaway:**
534+
- Creating a structured evaluation pipeline ensures continuous improvements in agent performance.
535+
- ## LLM as a Judge - Example Evaluation
536+
- **Input:**
537+
- **Answer:** `valid`
538+
- **Mission:**
539+
- **Context:** Validate if a given website landing page is operational.
540+
- **Objective:** Determine if the landing page at `kw-platinum.com` is connected.
541+
- **Instructions:**
542+
1. Visit the website provided in the `kw-platinum.com` column.
543+
2. Check the landing page for indications that it is associated with a business.
544+
3. If the page appears to be improperly set up (e.g., showing a generic hosting page like `wix.com` default), mark it as `'broken'`.
545+
4. If the page loads correctly and is associated with a company, mark it as `'valid'`.
546+
5. Return **only** `'broken'` or `'valid'` based on the page status.
547+
6. Do **not** include any additional commentary or information in the response.
548+
- **Examples:**
549+
- If a Wix default page appears → Respond `'broken'`.
550+
- If the landing page shows a functioning business site → Respond `'valid'`.
551+
- **Output Evaluation:**
552+
- **Reasoning:**
553+
- The student's answer (`valid`) is concise and correctly follows the instructions.
554+
- The response format adheres to the requirement to return **only** `'broken'` or `'valid'`.
555+
- The student has determined that the landing page is operational and associated with a business, aligning with the objective.
556+
- **Score:** 10 (Perfect response)
557+
- **Conclusion:** The answer is both relevant and helpful, fully meeting evaluation criteria.

0 commit comments

Comments
 (0)