This guide explains how to set up Archivist for development, whether you're contributing to the project or using it as a base for your own work.
- Development Setup
- Working with Forks
- Creating Feature Branches
- Installing as a Local Tool
- Project Structure
- Development Workflow
- Testing
- Debugging
- Contributing
- Bun v1.0 or higher
- Git
- A Pure.md API key (for testing actual crawling)
-
Clone the repository:
git clone https://github.com/stellarwp/archivist.git cd archivist -
Install dependencies:
bun install
-
Set up environment (optional):
cp .env.example .env # Edit .env with your Pure.md API key
If you plan to make significant changes or use Archivist as a base for your own project:
-
Fork the repository on GitHub
-
Clone your fork:
git clone https://github.com/YOUR_USERNAME/archivist.git cd archivist -
Add upstream remote:
git remote add upstream https://github.com/stellarwp/archivist.git
-
Keep your fork updated:
# Fetch upstream changes git fetch upstream # Merge upstream main into your main git checkout main git merge upstream/main git push origin main
For new features or experiments:
-
Create a new branch:
git checkout -b feature/my-awesome-feature
-
Make your changes and commit regularly:
git add . git commit -m "feat: Add awesome feature"
-
Push to your fork:
git push origin feature/my-awesome-feature
To use Archivist in other projects while developing:
-
In the archivist directory:
bun link
-
In your project directory:
bun link @stellarwp/archivist
-
Use in your project:
archivist crawl --config ./my-config.json
-
Build the package (if needed):
bun run build
-
In your project, install from local path:
bun add file:../path/to/archivist
Install directly from a specific branch or commit:
# From NPM registry
bun add @stellarwp/archivist
# From main branch
bun add git+https://github.com/stellarwp/archivist.git
# From specific branch
bun add git+https://github.com/stellarwp/archivist.git#feature/my-branch
# From your fork
bun add git+https://github.com/YOUR_USERNAME/archivist.git#your-brancharchivist/
├── src/
│ ├── cli.ts # CLI entry point with commands
│ ├── index.ts # Main module export
│ ├── di/ # Dependency injection
│ │ └── container.ts # tsyringe DI container setup
│ ├── services/ # Core service layer
│ │ ├── archive-crawler.service.ts # Crawls individual archives
│ │ ├── web-crawler.service.ts # Orchestrates crawling process
│ │ ├── config.service.ts # Configuration management
│ │ ├── state.service.ts # State and progress tracking
│ │ ├── logger.service.ts # Logging and output formatting
│ │ ├── http.service.ts # HTTP client configuration
│ │ ├── link-discoverer.ts # Cheerio-based link discovery
│ │ └── pure-md.ts # Pure.md API client
│ ├── strategies/ # Source crawling strategies
│ │ ├── base-strategy.ts # Abstract base strategy
│ │ ├── explorer-strategy.ts # Single page link extraction
│ │ ├── pagination-strategy.ts # Paginated content handling
│ │ └── strategy-factory.ts # Strategy creation factory
│ ├── utils/ # Utility functions
│ │ ├── content-formatter.ts # Output formatting (MD/JSON/HTML)
│ │ ├── file-naming.ts # File naming strategies
│ │ ├── link-extractor.ts # Link extraction helper
│ │ ├── pattern-matcher.ts # URL pattern matching
│ │ ├── markdown-parser.ts # Markdown parsing utilities
│ │ ├── pure-api-key.ts # API key resolution
│ │ └── axios-config.ts # Axios configuration
│ ├── types/ # TypeScript type definitions
│ │ └── source-strategy.ts # Strategy type definitions
│ └── version.ts # Version management
├── tests/
│ ├── unit/ # Unit tests
│ │ ├── services/ # Service tests
│ │ ├── utils/ # Utility tests
│ │ └── strategies/ # Strategy tests
│ └── integration/ # Integration tests
├── examples/ # Example configurations
│ ├── simple-single-archive.config.json
│ ├── multi-archive.config.json
│ ├── documentation-crawler.config.json
│ ├── pagination.config.json
│ └── link-collection.config.json
├── .github/
│ └── workflows/ # GitHub Actions
├── archivist.config.ts # Configuration schema
├── package.json # Package metadata
├── tsconfig.json # TypeScript config
└── bun.lockb # Lock file
Watch for changes and auto-restart:
bun run dev# Run all tests
bun test
# Run specific test file
bun test tests/unit/utils/file-naming.test.ts
# Run tests in watch mode
bun test:watch
# Run with coverage
bun test:coveragebunx tsc --noEmitCreate a test configuration:
# Create test config
cat > test.config.json << EOF
{
"archives": [{
"name": "Test Archive",
"sources": [{
"url": "https://example.com",
"depth": 0
}],
"output": {
"directory": "./test-output",
"format": "markdown"
}
}],
"crawl": {
"maxConcurrency": 1,
"delay": 1000
}
}
EOF
# Run crawler with test config
bun run src/cli.ts crawl --config test.config.json# All tests
bun test
# Unit tests only
bun test tests/unit
# Integration tests only
bun test tests/integration
# Specific test suite
bun test tests/unit/utilsExample test structure:
import { describe, expect, it } from 'bun:test';
import { myFunction } from '../src/myModule';
describe('myFunction', () => {
it('should do something', () => {
const result = myFunction('input');
expect(result).toBe('expected output');
});
});For external dependencies like Pure.md:
import { mock } from 'bun:test';
mock.module('axios', () => ({
default: {
create: mock(() => ({
get: mock().mockResolvedValue({ data: 'mocked data' })
}))
}
}));Add debug output:
console.log('Debug:', variable);bun --inspect run src/cli.ts crawlThen open chrome://inspect in Chrome.
Set debug environment variables:
DEBUG=* bun run archive crawl- Use TypeScript for all code
- Follow existing patterns in the codebase
- Prefer functional programming where appropriate
- Add types for all function parameters and returns
Follow conventional commits:
feat: Add new feature
fix: Fix bug
docs: Update documentation
test: Add tests
refactor: Refactor code
chore: Update dependencies
- Create feature branch from
main - Make changes with clear commits
- Add tests for new functionality
- Update documentation if needed
- Run tests locally
- Push branch and create PR
- Address review feedback
Before submitting a PR:
- All tests pass (
bun test) - TypeScript compiles (
bunx tsc --noEmit) - New features have tests
- Documentation is updated
- Commit messages follow convention
Archivist uses tsyringe for dependency injection, providing better testability and separation of concerns:
import { singleton } from 'tsyringe';
@singleton()
export class MyService {
constructor(
private configService: ConfigService,
private logger: LoggerService
) {}
}Main orchestrator that coordinates the crawling process across multiple archives.
const webCrawler = appContainer.resolve(WebCrawlerService);
await webCrawler.collectAllUrls();
await webCrawler.crawlAll({ clean: true });Handles crawling individual archives, including URL discovery and content extraction.
const archiveCrawler = appContainer.resolve(ArchiveCrawlerService);
const urls = await archiveCrawler.collectUrlsFromSource(sourceUrl, sourceConfig);
await archiveCrawler.crawlUrls(archive, archiveState);Manages configuration with environment variable fallbacks.
const configService = appContainer.resolve(ConfigService);
configService.initialize(config, configPath);
const archives = configService.getArchives();Tracks crawling progress and manages state persistence.
const stateService = appContainer.resolve(StateService);
stateService.initializeArchive(archiveName);
stateService.addToQueue(archiveName, url);Source strategies allow flexible URL discovery:
// Creating custom strategy
export class CustomStrategy extends BaseStrategy {
type = 'custom';
async execute(sourceUrl: string, config: any): Promise<StrategyResult> {
// Custom URL discovery logic
return { urls: discoveredUrls };
}
}
// Registering strategy
StrategyFactory.registerStrategy('custom', () => new CustomStrategy());To add a new output format:
- Add the format to the schema in
archivist.config.ts - Implement formatter in
src/utils/content-formatter.ts - Update the crawler to use the new format
- Add tests for the new format
To add a new naming strategy:
- Add the strategy to the schema
- Implement in
src/utils/file-naming.ts - Update the crawler to use the new strategy
- Add tests
To add new crawling features:
- Update the configuration schema
- Modify
src/crawler.ts - Add integration tests
- Update documentation
If Bun commands aren't working:
# Reinstall Bun
curl -fsSL https://bun.sh/install | bashIf you see TypeScript errors:
# Clean and reinstall
rm -rf node_modules bun.lockb
bun installIf tests fail unexpectedly:
# Clear test cache
rm -rf .bun
# Run tests with more output
bun test --verbose