Skip to content

Commit db19dfe

Browse files
authored
Merge pull request #1 from cpetersen/svd
Adding SVD
2 parents e31fe88 + 52d19a8 commit db19dfe

File tree

7 files changed

+299
-117
lines changed

7 files changed

+299
-117
lines changed

.github/workflows/ci.yml

Lines changed: 23 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -1,84 +1,26 @@
1-
name: CI
2-
3-
on:
4-
push:
5-
branches: [ main, develop ]
6-
pull_request:
7-
branches: [ main ]
8-
1+
name: build
2+
on: pull_request
93
jobs:
10-
test:
11-
runs-on: ${{ matrix.os }}
12-
strategy:
13-
fail-fast: false
14-
matrix:
15-
os: [ubuntu-latest, macos-latest, windows-latest]
16-
ruby: ['2.7', '3.0', '3.1', '3.2', '3.3']
17-
exclude:
18-
# Windows doesn't play nice with older Ruby versions
19-
- os: windows-latest
20-
ruby: '2.7'
21-
22-
steps:
23-
- uses: actions/checkout@v4
24-
25-
- name: Set up Ruby
26-
uses: ruby/setup-ruby@v1
27-
with:
28-
ruby-version: ${{ matrix.ruby }}
29-
bundler-cache: true
30-
31-
- name: Install Rust
32-
uses: actions-rs/toolchain@v1
33-
with:
34-
toolchain: stable
35-
override: true
36-
components: rustfmt, clippy
37-
38-
- name: Install dependencies
39-
run: bundle install
40-
41-
- name: Run Rust checks
42-
run: |
43-
bundle exec rake rust:fmt
44-
bundle exec rake rust:clippy
45-
46-
- name: Compile extension
47-
run: bundle exec rake compile
48-
49-
- name: Run tests
50-
run: bundle exec rake test
51-
52-
build-gems:
4+
build:
535
runs-on: ubuntu-latest
54-
needs: test
55-
if: github.ref == 'refs/heads/main'
56-
576
steps:
58-
- uses: actions/checkout@v4
59-
60-
- name: Set up Ruby
61-
uses: ruby/setup-ruby@v1
62-
with:
63-
ruby-version: '3.2'
64-
65-
- name: Install Rust
66-
uses: actions-rs/toolchain@v1
67-
with:
68-
toolchain: stable
69-
override: true
70-
71-
- name: Install dependencies
72-
run: |
73-
gem install rake-compiler rake-compiler-dock
74-
bundle install
75-
76-
- name: Build native gems
77-
run: |
78-
bundle exec rake gem:native
79-
80-
- name: Upload artifacts
81-
uses: actions/upload-artifact@v3
82-
with:
83-
name: native-gems
84-
path: pkg/*.gem
7+
- uses: actions/checkout@v3
8+
- uses: actions/cache@v3
9+
with:
10+
path: |
11+
~/.cargo/registry
12+
~/.cargo/git
13+
tmp
14+
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
15+
16+
- uses: ruby/setup-ruby@v1
17+
with:
18+
ruby-version: ruby
19+
bundler-cache: true
20+
21+
- name: Compile native extension
22+
run: bundle exec rake compile
23+
24+
- name: Run specs
25+
run: |
26+
bundle exec rake spec

README.md

Lines changed: 150 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@ High-performance dimensionality reduction for Ruby, powered by the [annembed](ht
44

55
## Features
66

7-
- **UMAP algorithm**: State-of-the-art dimensionality reduction
7+
- **Multiple algorithms**: UMAP, t-SNE, LargeVis, and Diffusion Maps for dimensionality reduction
8+
- **SVD**: Randomized Singular Value Decomposition for linear dimensionality reduction
89
- **High performance**: Leverages Rust's speed and parallelization
910
- **Easy to use**: Simple, scikit-learn-like API
1011
- **Model persistence**: Save and load trained models
@@ -31,37 +32,169 @@ Or install it yourself as:
3132
- Ruby 2.7 or higher
3233
- Rust toolchain (for building from source)
3334

34-
## Quick Start
35+
## Quick Start - Interactive Example
36+
37+
Copy and paste this into IRB to try out the main features:
3538

3639
```ruby
3740
require 'annembed'
41+
require 'annembed/svd' # For SVD functionality
3842

39-
# Generate some sample data (2D array)
43+
# Generate sample high-dimensional data
44+
# Imagine this is text embeddings, image features, or any high-dim data
45+
puts "Creating sample data: 100 points in 50 dimensions"
4046
data = Array.new(100) { Array.new(50) { rand } }
4147

42-
# Create a UMAP instance
43-
umap = AnnEmbed::UMAP.new(n_components: 2, n_neighbors: 15)
48+
# ============================================================
49+
# 1. UMAP - State-of-the-art non-linear dimensionality reduction
50+
# ============================================================
51+
puts "\n1. UMAP - Reducing to 2D for visualization"
52+
embedder = AnnEmbed::Embedder.new(
53+
method: :umap,
54+
n_components: 2, # Reduce to 2D
55+
n_neighbors: 15 # Balance local/global structure
56+
)
57+
58+
# Fit and transform the data
59+
umap_result = embedder.fit_transform(data)
60+
puts " Shape: #{umap_result.size} points × #{umap_result.first.size} dimensions"
61+
puts " First point: [#{umap_result.first.map { |v| v.round(3) }.join(', ')}]"
4462

45-
# Fit and transform in one step
46-
embedding = umap.fit_transform(data)
63+
# Save the trained model
64+
embedder.save("umap_model.bin")
65+
puts " Model saved to umap_model.bin"
66+
67+
# ============================================================
68+
# 2. t-SNE - Popular for visualization, especially clusters
69+
# ============================================================
70+
puts "\n2. t-SNE - Alternative visualization method"
71+
tsne = AnnEmbed::Embedder.new(
72+
method: :tsne,
73+
n_components: 2,
74+
perplexity: 30.0 # Balances local/global structure
75+
)
4776

48-
# Or fit and transform separately
49-
umap.fit(data)
50-
embedding = umap.transform(data)
77+
tsne_result = tsne.fit_transform(data)
78+
puts " Shape: #{tsne_result.size} points × #{tsne_result.first.size} dimensions"
79+
puts " First point: [#{tsne_result.first.map { |v| v.round(3) }.join(', ')}]"
80+
81+
# ============================================================
82+
# 3. SVD - Fast linear dimensionality reduction
83+
# ============================================================
84+
puts "\n3. SVD - Linear dimensionality reduction (like PCA)"
85+
# Reduce to top 10 components
86+
u, s, vt = AnnEmbed.svd(data, 10, n_iter: 2)
87+
puts " U shape: #{u.size}×#{u.first.size} (transformed data)"
88+
puts " S values: [#{s[0..2].map { |v| v.round(2) }.join(', ')}, ...]"
89+
puts " V^T shape: #{vt.size}×#{vt.first.size} (components)"
90+
91+
# The reduced data is in U
92+
svd_result = u
93+
puts " First point: [#{svd_result.first[0..2].map { |v| v.round(3) }.join(', ')}, ...]"
94+
95+
# ============================================================
96+
# 4. Transform new data with a trained model
97+
# ============================================================
98+
puts "\n4. Transforming new data with saved UMAP model"
99+
# Load the saved model
100+
loaded = AnnEmbed::Embedder.load("umap_model.bin")
101+
102+
# New data (5 new points)
103+
new_data = Array.new(5) { Array.new(50) { rand } }
104+
new_embedding = loaded.transform(new_data)
105+
puts " New data shape: #{new_embedding.size}×#{new_embedding.first.size}"
106+
puts " First new point: [#{new_embedding.first.map { |v| v.round(3) }.join(', ')}]"
107+
108+
# ============================================================
109+
# 5. Comparison - Which method to use?
110+
# ============================================================
111+
puts "\n5. Quick comparison:"
112+
puts " UMAP: Best for preserving both local and global structure"
113+
puts " t-SNE: Great for visualizing clusters, but slower"
114+
puts " SVD: Fastest, linear, good for denoising or pre-processing"
115+
116+
# ============================================================
117+
# 6. Practical tip: Reduce dimensions for faster similarity search
118+
# ============================================================
119+
puts "\n6. Example: Speeding up similarity search"
120+
# Original: 100 points × 50 dimensions = 5000 numbers to store
121+
# After UMAP: 100 points × 2 dimensions = 200 numbers to store
122+
# That's 25× less storage and faster distance calculations!
123+
124+
puts "\nStorage comparison:"
125+
puts " Original: #{data.size * data.first.size} floats"
126+
puts " After UMAP: #{umap_result.size * umap_result.first.size} floats"
127+
puts " Reduction: #{((1 - (umap_result.first.size.to_f / data.first.size)) * 100).round(1)}%"
128+
129+
puts "\n✅ Done! You've just reduced 50D data to 2D using three different methods!"
130+
```
51131

52-
# Check if model is fitted
53-
puts "Model fitted: #{umap.fitted?}"
132+
## Quick Start - Simplified API
54133

55-
# Save the model for later use
56-
umap.save("model.bin")
134+
For convenience, you can also use the simplified API:
57135

58-
# Load and use a saved model
59-
loaded_umap = AnnEmbed::UMAP.load("model.bin")
60-
new_embedding = loaded_umap.transform(new_data)
136+
```ruby
137+
require 'annembed'
138+
require 'annembed/svd'
139+
140+
# Generate sample data
141+
data = Array.new(100) { Array.new(50) { rand } }
142+
143+
# One-line dimensionality reduction
144+
umap_2d = AnnEmbed.umap(data, n_components: 2)
145+
tsne_2d = AnnEmbed.tsne(data, n_components: 2)
146+
u, s, vt = AnnEmbed.svd(data, 10) # Top 10 components
147+
148+
# Results are ready to use!
149+
puts "UMAP result: #{umap_2d.first}"
150+
puts "t-SNE result: #{tsne_2d.first}"
151+
puts "SVD result: #{u.first}"
61152
```
62153

63154
## API Reference
64155

156+
### AnnEmbed::Embedder
157+
158+
The universal class for all dimensionality reduction algorithms.
159+
160+
```ruby
161+
# Create an embedder with any supported method
162+
embedder = AnnEmbed::Embedder.new(
163+
method: :umap, # :umap, :tsne, :largevis, or :diffusion
164+
n_components: 2, # Target dimensions
165+
**options # Method-specific options
166+
)
167+
168+
# Methods work the same for all algorithms
169+
result = embedder.fit_transform(data)
170+
embedder.save("model.bin")
171+
loaded = AnnEmbed::Embedder.load("model.bin")
172+
```
173+
174+
### AnnEmbed::SVD
175+
176+
Randomized Singular Value Decomposition for fast linear dimensionality reduction.
177+
178+
```ruby
179+
# Perform SVD
180+
u, s, vt = AnnEmbed.svd(matrix, k, n_iter: 2)
181+
182+
# Parameters:
183+
# matrix: 2D array of data
184+
# k: Number of components to keep
185+
# n_iter: Number of iterations for randomized algorithm (default: 2)
186+
187+
# Returns:
188+
# u: Left singular vectors (transformed data)
189+
# s: Singular values (importance of each component)
190+
# vt: Right singular vectors transposed (components)
191+
192+
# Example: Reduce 100×50 matrix to 100×10
193+
data = Array.new(100) { Array.new(50) { rand } }
194+
u, s, vt = AnnEmbed.svd(data, 10)
195+
reduced_data = u # This is your reduced 100×10 data
196+
```
197+
65198
### AnnEmbed::UMAP
66199

67200
The main class for UMAP dimensionality reduction.

annembed-ruby.gemspec

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ Gem::Specification.new do |spec|
3232
# spec.add_dependency "numo-narray", "~> 0.9"
3333

3434
# Development dependencies
35+
spec.add_development_dependency "csv"
3536
spec.add_development_dependency "rake", "~> 13.0"
3637
spec.add_development_dependency "rake-compiler", "~> 1.2"
3738
spec.add_development_dependency "rb_sys", "~> 0.9"

ext/annembed_ruby/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ crate-type = ["cdylib"]
1010
magnus = { version = "0.6", features = ["embed"] }
1111
annembed = { git = "https://github.com/jean-pierreBoth/annembed" }
1212
hnsw_rs = "0.3"
13-
ndarray = "0.15"
13+
ndarray = "0.16"
1414
num-traits = "0.2"
1515
rayon = "1.7"
1616
serde = { version = "1.0", features = ["derive"] }

0 commit comments

Comments
 (0)