Refactor feed processing into testable utilities with comprehensive tests

arpith · web-flow · commit 69dd9814a74c · 2025-10-12T16:02:53.000-07:00
This PR refactors feed processing code into small, testable utility functions with comprehensive test coverage using YAML test data and GitHub Actions CI.

Key improvements:
- Extracted feed processing logic into pure utility functions
- Added comprehensive test coverage with YAML test data
- Set up GitHub Actions CI to run tests automatically
- Split tests into separate files for article and feed utilities
- Added Redis key prefix constants for maintainability
- Improved code documentation with explanatory comments

All tests passing (26/26) ✓
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,32 @@
+name: Run Tests
+
+on:
+  push:
+    branches: [ master ]
+  pull_request:
+    branches: [ master ]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+
+    strategy:
+      matrix:
+        node-version: [18.x, 20.x]
+
+    steps:
+    - uses: actions/checkout@v3
+
+    - name: Use Node.js ${{ matrix.node-version }}
+      uses: actions/setup-node@v3
+      with:
+        node-version: ${{ matrix.node-version }}
+
+    - name: Install dependencies
+      run: npm install
+
+    - name: Run article utility tests
+      run: node src/lib/articleUtils.test.js
+
+    - name: Run feed utility tests
+      run: node src/lib/feedUtils.test.js
diff --git a/.gitignore b/.gitignore
@@ -7,4 +7,4 @@ aws-config.json
 .env
 dump.rdb
 npm-debug.log
-lib
+/lib
diff --git a/TESTING.md b/TESTING.md
@@ -0,0 +1,103 @@
+# Testing Guide for Feed Processing Functions
+
+This document describes the refactored, testable feed processing utilities and how to run tests.
+
+## Overview
+
+The feed processing logic has been extracted into small, testable functions in `src/lib/`:
+
+- **`articleUtils.js`** - Pure functions for article hashing and scoring (no external dependencies)
+- **`feedUtils.js`** - Utility functions for feed processing (headers, Redis keys, validation, etc.)
+
+## Running Tests
+
+```bash
+node src/lib/feedUtils.test.js
+```
+
+All tests use the test data in `testdata/test-cases.json` which contains expected inputs and outputs generated from the actual Node.js implementation.
+
+## Test Coverage
+
+### Article Functions (`articleUtils.js`)
+
+1. **`hash(article)`** - MD5 hash of article GUID
+   - Tests: 3 test cases verifying hash consistency
+   - Implementation matches `src/articles.js` exactly
+
+2. **`score(article)`** - Unix timestamp score
+   - Tests: 3 test cases with different date field names (pubDate, pubdate, date)
+   - Implementation matches `src/articles.js` exactly
+
+### Feed Functions (`feedUtils.js`)
+
+1. **`buildRequestHeaders(storedFeed)`** - Builds HTTP headers for conditional GET
+   - Tests: 4 test cases (no headers, If-Modified-Since, If-None-Match, both)
+
+2. **`buildRedisKeys(feedURI)`** - Creates Redis key names
+   - Tests: 2 test cases with different feed URLs
+
+3. **`buildArticleKey(hash)`** - Creates article key for Redis sorted set
+   - Tests: 1 test case verifying format
+
+4. **`processArticle(article, feedURI, hashFn, scoreFn)`** - Adds computed fields
+   - Tests: 1 test case verifying hash, score, and feedurl are added
+
+5. **`shouldStoreArticle(oldScore, newScore)`** - Determines if article needs S3 storage
+   - Tests: 4 test cases (new article, changed score, unchanged score, type coercion)
+
+6. **`isValidArticle(article)`** - Validates article has required fields
+   - Tests: 4 test cases (valid, missing guid, missing description, null)
+
+7. **`extractFeedMetadata(meta)`** - Extracts title and link from parser meta
+   - Tests: 1 test case
+
+8. **`extractArticleIds(articleKeys)`** - Strips "article:" prefix from Redis keys
+   - Tests: 1 test case
+
+## Test Data Format
+
+The `testdata/test-cases.json` file contains test cases organized by function:
+
+```json
+{
+  "hash_function_tests": [...],
+  "score_function_tests": [...],
+  "request_headers_tests": [...],
+  ...
+}
+```
+
+Each test case has:
+- `description` - Human-readable test description
+- `input` - Input value(s) for the function
+- `expected` - Expected output value
+
+## Adding New Tests
+
+1. Add test data to `testdata/test-cases.json`
+2. Add corresponding test code in `src/lib/feedUtils.test.js`
+3. Run tests to verify
+
+## Future Work
+
+Next steps:
+1. Refactor `src/feeds.js` to use these utility functions
+2. Add integration tests for Redis and S3 operations
+3. Create Go implementation with matching behavior (in `feedfetcher/` directory)
+4. Create Go tests that use the same `testdata/test-cases.json` file
+
+## Why These Functions?
+
+These functions were extracted because they are:
+1. **Pure or nearly pure** - Deterministic output for given input
+2. **Core business logic** - Critical for feed processing correctness
+3. **Reusable** - Can be used by both Node.js and Go implementations
+4. **Independently testable** - No mocking of Redis/S3 needed
+
+The goal is to ensure both Node.js and Go implementations produce identical results for:
+- Article hashing (critical for deduplication)
+- Article scoring (critical for sorting)
+- Request headers (critical for conditional GET optimization)
+- Redis key naming (critical for data storage)
+- S3 storage decisions (critical for performance)
diff --git a/package.json b/package.json
@@ -47,6 +47,7 @@
     "eslint-config-airbnb": "^11.1.0",
     "eslint-plugin-import": "^1.15.0",
     "eslint-plugin-jsx-a11y": "^2.2.2",
-    "eslint-plugin-react": "^6.2.0"
+    "eslint-plugin-react": "^6.2.0",
+    "js-yaml": "^4.1.0"
   }
 }
diff --git a/src/articles.js b/src/articles.js
@@ -1,16 +1,11 @@
-import crypto from 'crypto';
 import AWS from 'aws-sdk';
 import labels from './labels';
+// Import hash and score functions from testable utilities
+import { hash as hashArticle, score as scoreArticle } from './lib/articleUtils.js';
 
-export function hash(article) {
-  return crypto.createHash('md5').update(article.guid).digest('hex');
-}
-
-export function score(article) {
-  const articleDate = article.pubDate || article.pubdate || article.date;
-  const articleScore = Date.parse(articleDate) || Date.now();
-  return articleScore;
-}
+// Re-export for backward compatibility
+export const hash = hashArticle;
+export const score = scoreArticle;
 
 function post(req, res) {
   res.json({
diff --git a/src/feeds.js b/src/feeds.js
@@ -3,6 +3,16 @@ import FeedParser from 'feedparser';
 import request from 'request';
 import AWS from 'aws-sdk';
 import { hash, score } from './articles';
+import {
+  buildRequestHeaders,
+  buildRedisKeys,
+  buildArticleKey,
+  processArticle,
+  shouldStoreArticle,
+  isValidArticle,
+  extractArticleIds,
+  generateArticleBody,
+} from './lib/feedUtils.js';
 
 const redisURL = process.env.REDIS_URL;
 const redisClient = redis.createClient(redisURL);
@@ -83,7 +93,7 @@ function get(req, res) {
                           const feed = storedFeed;
                           feed.key = feedurl;
                           feeds.push(feed);
-                          const articleIDs = articles.map(key => key.substr(8));
+                          const articleIDs = extractArticleIds(articles);
                           if (feedurlPosition === feedurls.length - 1) {
                             res.json({
                               success: true,
@@ -111,17 +121,12 @@ const feed = {
     const params = { Bucket: 'feedreader2018-articles' };
     const s3 = new AWS.S3({ params });
     const feedURI = decodeURIComponent(req.url.slice(10));
-    const feedKey = `feed:${feedURI}`;
-    const articlesKey = `articles:${feedURI}`;
+    const { feedKey, articlesKey } = buildRedisKeys(feedURI);
 
     redisClient.hgetall(feedKey, (e, storedFeed) => {
       let fetchedFeed = {};
       if ((!e) && storedFeed) fetchedFeed = storedFeed;
-      const headers = {
-        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36',
-      };
-      if (fetchedFeed.lastModified) headers['If-Modified-Since'] = fetchedFeed.lastModified;
-      if (fetchedFeed.etag) headers['If-None-Match'] = fetchedFeed.etag;
+      const headers = buildRequestHeaders(fetchedFeed);
 
       const requ = request({
         uri: feedURI,
@@ -187,16 +192,14 @@ const feed = {
         const stream = this;
         for (;;) {
           const article = stream.read();
-          if (!article || !article.guid || !article.description) {
+          if (!isValidArticle(article)) {
             return;
           }
-          article.hash = hash(article);
-          article.score = score(article);
-          article.feedurl = feedURI;
 
-          const key = article.hash;
-          const rank = article.score;
-          const articleKey = `article:${key}`;
+          const processedArticle = processArticle(article, feedURI, hash, score);
+          const key = processedArticle.hash;
+          const rank = processedArticle.score;
+          const articleKey = buildArticleKey(key);
 
           redisClient.zscore(articlesKey, articleKey, (zscoreErr, oldscore) => {
             if (zscoreErr) {
@@ -211,9 +214,9 @@ const feed = {
                   articleAddErr.type = 'Redis Error';
                   articleAddErr.log = zaddErr.message;
                   stream.emit('error', articleAddErr);
-                } else if ((oldscore === null) || (rank !== parseInt(oldscore))) {
+                } else if (shouldStoreArticle(oldscore, rank)) {
                   // Only stringify when we actually need to store it
-                  const body = JSON.stringify(article);
+                  const body = generateArticleBody(processedArticle);
                   s3.putObject({
                     Key: key,
                     Body: body,
@@ -245,7 +248,7 @@ const feed = {
             });
           } else {
             fetchedFeed.success = true;
-            fetchedFeed.articles = allArticles.map(key => key.substr(8));
+            fetchedFeed.articles = extractArticleIds(allArticles);
             res.json(fetchedFeed);
           }
         });
diff --git a/src/lib/articleUtils.js b/src/lib/articleUtils.js
@@ -0,0 +1,28 @@
+// Pure utility functions for article processing (no external dependencies)
+// These can be tested without AWS or Redis
+
+const crypto = require('crypto');
+
+/**
+ * Generates MD5 hash of article GUID
+ * Reference: api/src/articles.js hash() function
+ * @param {Object} article - Article object with guid field
+ * @returns {string} MD5 hash in hex format
+ */
+function hash(article) {
+  return crypto.createHash('md5').update(article.guid).digest('hex');
+}
+
+/**
+ * Generates score (timestamp) for article
+ * Reference: api/src/articles.js score() function
+ * @param {Object} article - Article object with date fields
+ * @returns {number} Unix timestamp in milliseconds
+ */
+function score(article) {
+  const articleDate = article.pubDate || article.pubdate || article.date;
+  const articleScore = Date.parse(articleDate) || Date.now();
+  return articleScore;
+}
+
+module.exports = { hash, score };
diff --git a/src/lib/articleUtils.test.js b/src/lib/articleUtils.test.js
@@ -0,0 +1,63 @@
+// Tests for article utility functions (hash and score)
+// Run with: node src/lib/articleUtils.test.js
+
+const { hash, score } = require('./articleUtils.js');
+const fs = require('fs');
+const yaml = require('js-yaml');
+const assert = require('assert');
+
+// Load test cases from YAML
+const testCasesYaml = fs.readFileSync('./testdata/test-cases.yaml', 'utf8');
+const testCases = yaml.load(testCasesYaml);
+
+// Simple test runner
+let passed = 0;
+let failed = 0;
+
+function test(name, fn) {
+  try {
+    fn();
+    passed++;
+    console.log(`✓ ${name}`);
+  } catch (error) {
+    failed++;
+    console.error(`✗ ${name}`);
+    console.error(`  ${error.message}`);
+  }
+}
+
+// Run all tests
+console.log('\n=== Testing Article Utility Functions ===\n');
+
+// Test hash function
+testCases.hash_function_tests.forEach((testCase) => {
+  test(testCase.description, () => {
+    const result = hash(testCase.input);
+    assert.strictEqual(result, testCase.expected,
+      `Hash mismatch: got ${result}, expected ${testCase.expected}`);
+  });
+});
+
+// Test score function
+testCases.score_function_tests.forEach((testCase) => {
+  test(testCase.description, () => {
+    const result = score(testCase.input);
+    if (testCase.expected_type === 'timestamp') {
+      // For invalid dates that fallback to Date.now(), just check it's a number
+      assert.strictEqual(typeof result, 'number',
+        `Score should be a number: got ${typeof result}`);
+      assert.ok(result > 0, `Score should be positive: got ${result}`);
+    } else {
+      assert.strictEqual(result, testCase.expected,
+        `Score mismatch: got ${result}, expected ${testCase.expected}`);
+    }
+  });
+});
+
+// Print summary
+console.log(`\n=== Test Summary ===`);
+console.log(`Passed: ${passed}`);
+console.log(`Failed: ${failed}`);
+console.log(`Total:  ${passed + failed}\n`);
+
+process.exit(failed > 0 ? 1 : 0);
diff --git a/src/lib/feedUtils.js b/src/lib/feedUtils.js
diff --git a/src/lib/feedUtils.test.js b/src/lib/feedUtils.test.js
diff --git a/testdata/test-cases.yaml b/testdata/test-cases.yaml
diff --git a/testdata/xkcd.xml b/testdata/xkcd.xml

Original file line number	Diff line number	Diff line change
`@@ -47,6 +47,7 @@`
`47`	`47`	`"eslint-config-airbnb": "^11.1.0",`
`48`	`48`	`"eslint-plugin-import": "^1.15.0",`
`49`	`49`	`"eslint-plugin-jsx-a11y": "^2.2.2",`
`50`		`- "eslint-plugin-react": "^6.2.0"`
	`50`	`+ "eslint-plugin-react": "^6.2.0",`
	`51`	`+ "js-yaml": "^4.1.0"`
`51`	`52`	`}`
`52`	`53`	`}`