@@ -328,7 +328,7 @@ tags:: [[AI/Agent]], [[LangChain]], [[Workshop]], [[Tutorial]]
328328 - Observability insights continuously refine development evaluations in a feedback loop.
329329 - ## Evals at Clay - **Evaluation Pipeline Overview**
330330 - image here
331- - tbd
331+ - 
332332 - **Production / Observability Evals**
333333 - **Tools Used:**
334334 - Segment
@@ -352,5 +352,206 @@ tags:: [[AI/Agent]], [[LangChain]], [[Workshop]], [[Tutorial]]
352352 - Integration Test CI
353353 - **GitHub Actions handles CI runs**
354354 - **Final step: Deploy! 🚀**
355- -
356- -
355+ - ## Development Evals – Ensuring Quality & Stability
356+ - **Blackbox E2E Smoke Testing**
357+ - Environment parity
358+ - Early problem detection
359+ - Confidence in real-world conditions
360+ - **Integration Testing**
361+ - Performance tracking on key use cases
362+ - Regression prevention
363+ - Comprehensive coverage
364+ - ## **Claygent Blackbox E2E Smoke Test CI Workflow**
365+ - 
366+ - **Workflow Steps:**
367+ 1. **Checkout Branch Changes** (Git icon)
368+ 2. **Install Packages** (npm icon)
369+ 3. **Deploy to Staging** (AWS Lambda icon)
370+ 4. **Run Smoketest Script** (`smoke_test.ts` file)
371+ 5. **Clay API**
372+ - Response: `200`
373+ 6. **Clay Tables in Staging**
374+ - ## **Claygent Blackbox E2E Smoke Test: Sample Staging Clay Table**
375+ - 
376+ - **Table Overview:**
377+ - **Columns:**
378+ - `domain`
379+ - `b2b_or_b2c`
380+ - `TRUTH: claygent`
381+ - `Send Test Case to C`
382+ - `testid`
383+ - `model`
384+ - **Example Entries:**
385+ - `horizonvalleygroup.com` → `B2C` → `Pass` → `gpt-4o`
386+ - `bostrom.com` → `B2B` → `Pass` → `gpt-4o`
387+ - `owshousing.com` → `B2B` → `Bug` → `gpt-4o`
388+ - `culinaryproperties.com` → `B2B` → `Pass` → `gpt-4o`
389+ - `globalmarininsurance.com` → `B2B` → `Fail` → `gpt-4o`
390+ - `greatgood.org` → `B2C` → `Pass` → `gpt-4o`
391+ - `fortcapitalp.com` → `B2B` → `Bug` → `gpt-4o`
392+ - `servicethread.com` → `B2B` → `Pass` → `gpt-4o`
393+ - **Key Status Indicators:**
394+ - ✅ Pass
395+ - ❌ Fail
396+ - 🐞 Bug
397+ - ## **Claygent Integration Testing: Types of Evals**
398+ - **Key Points:**
399+ - Check your agent’s performance in development to ensure high-quality output in production.
400+ - Comparing Agent Output to Ground Truth.
401+ - **Comparison Methods – [The Langchain Openevals Framework](https://github.com/langchain-ai/openevals) Approach:**
402+ - Exact Match
403+ - Levenshtein Distance
404+ - Embedding Similarity
405+ - LLM as a Judge
406+ - ## **Claygent Integration Test Example 1**
407+ - **Exact Matching**
408+ - Integration Test
409+ - LangSmith integration with Vitest
410+ - **Code Example (JavaScript)**
411+ - **Test Definition**
412+ ```javascript
413+ ls.test(
414+ 'horizonrealtygroup',
415+ {
416+ inputs: {
417+ mission: createTestActionInputs('http://www.horizonrealtygroup.com').mission,
418+ },
419+ referenceOutputs: {
420+ answer: 'B2C',
421+ },
422+ }
423+ );
424+ ```
425+ - **Execute Test**
426+ ```javascript
427+ async ({ referenceOutputs }) => {
428+ // call claygent action code
429+ const actionInputs = createTestActionInputs('http://www.horizonrealtygroup.com');
430+ const response = await runClaygentActionCode(actionInputs, testContext);
431+ ls.logOutputs({ response: response.data.result });
432+ ```
433+ - **Assert Exact Match**
434+ ```javascript
435+ const exactMatchResult = await exactMatch({
436+ outputs: response.data.result,
437+ referenceOutputs: referenceOutputs?.answer,
438+ });
439+ expect(exactMatchResult.score).toBe(true);
440+ }
441+ ```
442+ - ## LangSmith Debugging: Inspecting Model Discrepancies
443+ - **Bucket Overview:**
444+ - C Bucket: ✅ Passed
445+ - Ideal Candidate Bucket: ❌ Meta failed, ✅ OpenAI passed
446+ - **Using LangSmith for Debugging**
447+ - Navigate to the **Datasets and Experiments** page on LangSmith.
448+ - Identify and select the **Ideal Candidate** bucket.
449+ - Click on the **latest run** to inspect errors.
450+ - **Error Inspection Process**
451+ - **Input:** "Give me an Ideal Candidate for the job at this particular link, at company Meta."
452+ - **Reference Output:** Seems to be related to an iOS job.
453+ - **Model Error:** Discrepancy detected.
454+ - **Analyzing the Judge's Thought Process**
455+ - **Judge's Observation:**
456+ - Input suggests scraping a job for Mechanical Engineering.
457+ - Reference output describes an iOS job.
458+ - **Final Evaluation:**
459+ - "Upon analyzing the reference and model outputs under the given rubric, it is evident that they describe candidates for entirely different roles."
460+ - **Key Takeaway:**
461+ - LangSmith provides insight into model performance discrepancies by exposing errors in judgment and alignment with expected outputs.
462+ - ## Observability Evals – Real-Time Insights & Optimization
463+ - **User Prompt Classification**
464+ - Capture user patterns and trends.
465+ - Sharpen evaluation for precise tuning.
466+ - **Methods:** Classifier training, LLM judging, clustering, semantic analysis.
467+ - **Realtime Agent Log Analysis**
468+ - Monitor responses live.
469+ - Receive immediate, context-aware feedback.
470+ - Detect anomalies as they occur.
471+ - Extract data for training and/or dev evals.
472+ - ## Extracting Value from Customer Usage Data
473+ - **Importance of Customer Logs**
474+ - Understanding how customers interact with the platform provides the most valuable insights.
475+ - Logs help assess whether an agent performed well or poorly on specific interactions.
476+ - Extracting logs for training or development evaluations can refine model performance.
477+ - **Classification, Clustering, and Tagging**
478+ - Once logs are classified, clustering techniques can be applied.
479+ - Tagging strategies help organize and analyze data for targeted improvements.
480+ - **Key Takeaway:**
481+ - Observability-driven insights allow for continuous tuning and refinement of AI agents.
482+ - ## Prompt Classification - Clustering & Tagging
483+ - 
484+ - **Using Embeddings for Clustering**
485+ - Cohere has recently released embeddings tailored for clustering.
486+ - Obtain embeddings for prompt properties.
487+ - Perform clustering using methods such as:
488+ - Hack-rewarded clustering
489+ - K-means clustering
490+ - Experimentation is necessary to determine the most effective clustering approach.
491+ - **Generating Tags for Clusters**
492+ - Once clusters are formed, send a request to the LLM to generate category tags.
493+ - Results in well-defined buckets of prompt categories.
494+ - **Analyzing Prompt Categories**
495+ - Count the number of prompts in each category.
496+ - Identify the most common customer use cases.
497+ - Enables deeper insights into how customers are using the platform.
498+ - **Key Takeaway:**
499+ - Clustering prompts enables structured categorization and prioritization of real customer needs.
500+ - ## Prompt Classification - Claygent Categories
501+ - 
502+ - **Identified Prompt Categories:**
503+ - **Company Profile**
504+ - **Users:** 776
505+ - **Events:** 306,944
506+ - **Example Task:** Find the Facebook and Instagram profile links for a company.
507+ - **Google Search**
508+ - **Users:** 767
509+ - **Events:** 171,276
510+ - **Example Task:** Google an address and check if a company is listed.
511+ - **Boolean**
512+ - **Users:** 756
513+ - **Events:** 162,082
514+ - **Example Task:** Determine whether a statement is true or false.
515+ - **Business Website**
516+ - **Users:** 706
517+ - **Events:** 160,212
518+ - **Example Task:** Visit a company website and extract relevant details.
519+ - **Analysis & Insights:**
520+ - Understanding which categories are most commonly used helps prioritize improvements.
521+ - Some categories may be less popular but still provide valuable use cases.
522+ - ## Realtime Agent Performance Analysis
523+ - **Live data is the best data!**
524+ - **Key Insights:**
525+ - Examples where the agent performs well → **Great training data**.
526+ - Examples where the agent needs improvement → **Useful for development evals**.
527+ - The feedback loop between **production evals, observability evals, and development evals** helps refine performance.
528+ - **Process:**
529+ - 1. Stream in live data.
530+ - 2. Identify examples where the agent succeeds.
531+ - 3. Export those cases to LangSmith for training.
532+ - 4. Identify failure cases and move them to development evals.
533+ - **Key Takeaway:**
534+ - Creating a structured evaluation pipeline ensures continuous improvements in agent performance.
535+ - ## LLM as a Judge - Example Evaluation
536+ - **Input:**
537+ - **Answer:** `valid`
538+ - **Mission:**
539+ - **Context:** Validate if a given website landing page is operational.
540+ - **Objective:** Determine if the landing page at `kw-platinum.com` is connected.
541+ - **Instructions:**
542+ 1. Visit the website provided in the `kw-platinum.com` column.
543+ 2. Check the landing page for indications that it is associated with a business.
544+ 3. If the page appears to be improperly set up (e.g., showing a generic hosting page like `wix.com` default), mark it as `'broken'`.
545+ 4. If the page loads correctly and is associated with a company, mark it as `'valid'`.
546+ 5. Return **only** `'broken'` or `'valid'` based on the page status.
547+ 6. Do **not** include any additional commentary or information in the response.
548+ - **Examples:**
549+ - If a Wix default page appears → Respond `'broken'`.
550+ - If the landing page shows a functioning business site → Respond `'valid'`.
551+ - **Output Evaluation:**
552+ - **Reasoning:**
553+ - The student's answer (`valid`) is concise and correctly follows the instructions.
554+ - The response format adheres to the requirement to return **only** `'broken'` or `'valid'`.
555+ - The student has determined that the landing page is operational and associated with a business, aligning with the objective.
556+ - **Score:** 10 (Perfect response)
557+ - **Conclusion:** The answer is both relevant and helpful, fully meeting evaluation criteria.
0 commit comments