Skip to content

Commit e8ef4a2

Browse files
alanshawlidel
andauthored
feat: custom chunker registry (#1116)
* feat: custom chunker registry * doc: update changelog * fix: protect chunker registry from concurrent access Register writes and FromString reads the splitters map with no synchronization, which is a data race if called concurrently. - chunker/registry.go: add splittersMu sync.RWMutex, Lock in Register - chunker/parse.go: RLock/RUnlock around map read in FromString * fix: validate inputs and reject duplicates in chunker Register Register silently accepted empty names, names with dashes (unmatchable by FromString which splits on dash), nil constructors, and duplicate registrations that could overwrite built-ins. - chunker/registry.go: panic on empty name, dash in name, nil ctor, duplicate name (follows database/sql.Register convention) - chunker/registry_test.go: use unique names per subtest, add panic test cases, add comment explaining why not parallel * refactor: rename CtorFunc to SplitterFunc CtorFunc is an uncommon abbreviation. SplitterFunc follows Go naming conventions (like http.HandlerFunc) and clearly communicates purpose. - chunker/registry.go: rename type and all usages * docs: improve chunker registry and FromString godocs FromString doc only listed built-in chunkers and didn't mention custom ones. Package doc didn't mention extensibility. SplitterFunc had no doc. - chunker/registry.go: add SplitterFunc doc, rewrite Register doc with init-time usage example and panic contract - chunker/parse.go: rewrite FromString doc to list built-in chunkers and mention Register - chunker/splitting.go: expand package doc to mention extensibility via Register and FromString * test: cover default, unknown, and strict size-{n} parse paths - chunker/parse_test.go: add TestParseDefault ("", "default"), TestParseUnknown, reject "size-123-extra" in TestParseSize * docs: note size- parsing strictness change in CHANGELOG The old FromString silently accepted "size-123-extra" by only reading the first segment after the dash. The refactored parseSizeString now validates the format and rejects extra parameters. * fix: use Register in init to apply validation uniformly Built-in chunkers were written directly to the map, bypassing the name and duplicate checks in Register. Call Register instead so the same rules apply to built-ins and custom chunkers alike. * refactor: use strings.Cut for chunker name extraction Replaces strings.Index + manual slicing with strings.Cut. * refactor: rename ctor parameter to fn in Register ctor was a leftover from the original CtorFunc type name. fn is the idiomatic Go convention for function-typed parameters. * docs: clarify SplitterGen vs SplitterFunc godocs --------- Co-authored-by: Marcin Rataj <lidel@lidel.org>
1 parent 95bf79f commit e8ef4a2

File tree

6 files changed

+272
-31
lines changed

6 files changed

+272
-31
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,12 @@ The following emojis are used to highlight certain changes:
1616

1717
### Added
1818

19+
- `chunker`: added `Register` function to allow custom chunkers to be registered for use with `FromString`.
20+
1921
### Changed
2022

23+
- `chunker`: `FromString` now rejects malformed `size-` strings with extra parameters (e.g. `size-123-extra` was previously silently accepted).
24+
2125
### Removed
2226

2327
- `cmd/boxo-migrate`: removed code for go-ipfs migration -- no longer needed.

chunker/parse.go

Lines changed: 41 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -39,35 +39,51 @@ var (
3939
ErrSizeMax = fmt.Errorf("chunker parameters may not exceed the maximum chunk size of %d", ChunkSizeLimit)
4040
)
4141

42-
// FromString returns a Splitter depending on the given string:
43-
// it supports "default" (""), "size-{size}", "rabin", "rabin-{blocksize}",
44-
// "rabin-{min}-{avg}-{max}" and "buzhash".
42+
// FromString returns a [Splitter] for the given chunker specification string.
43+
//
44+
// Built-in chunkers:
45+
//
46+
// - "" or "default" -- fixed-size chunks using [DefaultBlockSize]
47+
// - "size-{size}" -- fixed-size chunks of the given byte size
48+
// - "rabin" -- Rabin fingerprint chunking with [DefaultBlockSize] average
49+
// - "rabin-{avg}" -- Rabin fingerprint chunking with the given average size
50+
// - "rabin-{min}-{avg}-{max}" -- Rabin with explicit bounds
51+
// - "buzhash" -- Buzhash content-defined chunking
52+
//
53+
// Custom chunkers registered via [Register] are also available.
54+
// The name is extracted as everything before the first dash.
4555
func FromString(r io.Reader, chunker string) (Splitter, error) {
46-
switch {
47-
case chunker == "" || chunker == "default":
56+
if chunker == "" || chunker == "default" {
4857
return DefaultSplitter(r), nil
49-
50-
case strings.HasPrefix(chunker, "size-"):
51-
sizeStr := strings.Split(chunker, "-")[1]
52-
size, err := strconv.Atoi(sizeStr)
53-
if err != nil {
54-
return nil, err
55-
} else if size <= 0 {
56-
return nil, ErrSize
57-
} else if size > ChunkSizeLimit {
58-
return nil, ErrSizeMax
59-
}
60-
return NewSizeSplitter(r, int64(size)), nil
61-
62-
case strings.HasPrefix(chunker, "rabin"):
63-
return parseRabinString(r, chunker)
64-
65-
case chunker == "buzhash":
66-
return NewBuzhash(r), nil
67-
68-
default:
58+
}
59+
name, _, _ := strings.Cut(chunker, "-")
60+
splittersMu.RLock()
61+
ctor, ok := splitters[name]
62+
splittersMu.RUnlock()
63+
if !ok {
6964
return nil, fmt.Errorf("unrecognized chunker option: %s", chunker)
7065
}
66+
return ctor(r, chunker)
67+
}
68+
69+
func parseSizeString(r io.Reader, chunker string) (Splitter, error) {
70+
parts := strings.Split(chunker, "-")
71+
if len(parts) != 2 {
72+
return nil, errors.New("incorrect chunker string format (expected size-{size})")
73+
}
74+
size, err := strconv.Atoi(parts[1])
75+
if err != nil {
76+
return nil, err
77+
} else if size <= 0 {
78+
return nil, ErrSize
79+
} else if size > ChunkSizeLimit {
80+
return nil, ErrSizeMax
81+
}
82+
return NewSizeSplitter(r, int64(size)), nil
83+
}
84+
85+
func parseBuzhashString(r io.Reader, _ string) (Splitter, error) {
86+
return NewBuzhash(r), nil
7187
}
7288

7389
func parseRabinString(r io.Reader, chunker string) (Splitter, error) {

chunker/parse_test.go

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,4 +102,43 @@ func TestParseSize(t *testing.T) {
102102
if err != ErrSizeMax {
103103
t.Fatalf("Expected 'ErrSizeMax', got: %#v", err)
104104
}
105+
106+
_, err = FromString(r, "size-123-extra")
107+
if err == nil {
108+
t.Fatal("expected error for size string with extra parameters")
109+
}
110+
}
111+
112+
func TestParseDefault(t *testing.T) {
113+
t.Parallel()
114+
115+
r := bytes.NewReader(randBuf(t, 1000))
116+
117+
s, err := FromString(r, "")
118+
if err != nil {
119+
t.Fatalf("expected success for empty string, got: %v", err)
120+
}
121+
if s == nil {
122+
t.Fatal("expected non-nil splitter for empty string")
123+
}
124+
125+
r.Reset(randBuf(t, 1000))
126+
s, err = FromString(r, "default")
127+
if err != nil {
128+
t.Fatalf("expected success for \"default\", got: %v", err)
129+
}
130+
if s == nil {
131+
t.Fatal("expected non-nil splitter for \"default\"")
132+
}
133+
}
134+
135+
func TestParseUnknown(t *testing.T) {
136+
t.Parallel()
137+
138+
r := bytes.NewReader(randBuf(t, 1000))
139+
140+
_, err := FromString(r, "unknown-chunker")
141+
if err == nil {
142+
t.Fatal("expected error for unregistered chunker")
143+
}
105144
}

chunker/registry.go

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
package chunk
2+
3+
import (
4+
"io"
5+
"strings"
6+
"sync"
7+
)
8+
9+
// SplitterFunc creates a [Splitter] from a reader and a specification
10+
// string such as "mychunker-param1-param2". It is used to register
11+
// custom chunkers via [Register] so they become available globally
12+
// through [FromString]. The function is responsible for parsing and
13+
// validating any parameters encoded in the string.
14+
type SplitterFunc func(r io.Reader, chunker string) (Splitter, error)
15+
16+
var (
17+
splittersMu sync.RWMutex
18+
splitters = map[string]SplitterFunc{}
19+
)
20+
21+
func init() {
22+
Register("size", parseSizeString)
23+
Register("rabin", parseRabinString)
24+
Register("buzhash", parseBuzhashString)
25+
}
26+
27+
// Register makes a custom chunker available to [FromString] under the given
28+
// name. The name is matched against the portion of the chunker string before
29+
// the first dash. For example, passing "mychunker-128" to [FromString]
30+
// selects the chunker registered as "mychunker", and the [SplitterFunc]
31+
// receives the full string "mychunker-128" so it can parse its own parameters.
32+
//
33+
// Register is typically called from an init function:
34+
//
35+
// func init() {
36+
// chunk.Register("mychunker", func(r io.Reader, s string) (chunk.Splitter, error) {
37+
// // parse parameters from s, return a Splitter
38+
// })
39+
// }
40+
//
41+
// Register panics if name is empty, contains a dash, fn is nil, or a
42+
// chunker with the same name is already registered. This follows the
43+
// convention established by [database/sql.Register].
44+
//
45+
// Register is safe for concurrent use.
46+
func Register(name string, fn SplitterFunc) {
47+
splittersMu.Lock()
48+
defer splittersMu.Unlock()
49+
if name == "" {
50+
panic("chunk: Register name is empty")
51+
}
52+
if strings.Contains(name, "-") {
53+
panic("chunk: Register name must not contain a dash: " + name)
54+
}
55+
if fn == nil {
56+
panic("chunk: Register fn is nil")
57+
}
58+
if _, dup := splitters[name]; dup {
59+
panic("chunk: Register called twice for chunker " + name)
60+
}
61+
splitters[name] = fn
62+
}

chunker/registry_test.go

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
package chunk_test
2+
3+
import (
4+
"bytes"
5+
"io"
6+
"testing"
7+
8+
chunk "github.com/ipfs/boxo/chunker"
9+
)
10+
11+
type noSplits struct {
12+
r io.Reader
13+
drained bool
14+
}
15+
16+
func (ns *noSplits) Reader() io.Reader {
17+
return ns.r
18+
}
19+
20+
func (ns *noSplits) NextBytes() ([]byte, error) {
21+
if ns.drained {
22+
return nil, io.EOF
23+
}
24+
ns.drained = true
25+
return io.ReadAll(ns.r)
26+
}
27+
28+
// TestRegister is not parallel because Register mutates package-level state
29+
// and panics on duplicate names.
30+
func TestRegister(t *testing.T) {
31+
t.Run("name only", func(t *testing.T) {
32+
chunk.Register("mockplain", func(r io.Reader, _ string) (chunk.Splitter, error) {
33+
return &noSplits{r: r}, nil
34+
})
35+
r := bytes.NewReader([]byte{1, 2, 3})
36+
s, err := chunk.FromString(r, "mockplain")
37+
if err != nil {
38+
t.Fatal(err)
39+
}
40+
if _, ok := s.(*noSplits); !ok {
41+
t.Fatal("unexpected splitter type")
42+
}
43+
})
44+
45+
t.Run("name and params", func(t *testing.T) {
46+
const chunkerStr = "mockparams-123"
47+
chunk.Register("mockparams", func(r io.Reader, c string) (chunk.Splitter, error) {
48+
if c != chunkerStr {
49+
t.Fatalf("expected chunker string %q, got %q", chunkerStr, c)
50+
}
51+
return &noSplits{r: r}, nil
52+
})
53+
r := bytes.NewReader([]byte{1, 2, 3})
54+
s, err := chunk.FromString(r, chunkerStr)
55+
if err != nil {
56+
t.Fatal(err)
57+
}
58+
if _, ok := s.(*noSplits); !ok {
59+
t.Fatal("unexpected splitter type")
60+
}
61+
})
62+
63+
t.Run("unregistered name", func(t *testing.T) {
64+
r := bytes.NewReader([]byte{1, 2, 3})
65+
_, err := chunk.FromString(r, "nonexistent")
66+
if err == nil {
67+
t.Fatal("expected error for unregistered chunker")
68+
}
69+
})
70+
71+
t.Run("panic on empty name", func(t *testing.T) {
72+
defer func() {
73+
if r := recover(); r == nil {
74+
t.Fatal("expected panic for empty name")
75+
}
76+
}()
77+
chunk.Register("", func(r io.Reader, _ string) (chunk.Splitter, error) {
78+
return nil, nil
79+
})
80+
})
81+
82+
t.Run("panic on name with dash", func(t *testing.T) {
83+
defer func() {
84+
if r := recover(); r == nil {
85+
t.Fatal("expected panic for name with dash")
86+
}
87+
}()
88+
chunk.Register("my-format", func(r io.Reader, _ string) (chunk.Splitter, error) {
89+
return nil, nil
90+
})
91+
})
92+
93+
t.Run("panic on nil ctor", func(t *testing.T) {
94+
defer func() {
95+
if r := recover(); r == nil {
96+
t.Fatal("expected panic for nil ctor")
97+
}
98+
}()
99+
chunk.Register("nilctor", nil)
100+
})
101+
102+
t.Run("panic on duplicate name", func(t *testing.T) {
103+
defer func() {
104+
if r := recover(); r == nil {
105+
t.Fatal("expected panic for duplicate name")
106+
}
107+
}()
108+
// "mockplain" was already registered in "name only" subtest.
109+
chunk.Register("mockplain", func(r io.Reader, _ string) (chunk.Splitter, error) {
110+
return nil, nil
111+
})
112+
})
113+
}

chunker/splitting.go

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,11 @@
11
// Package chunk implements streaming block splitters.
2-
// Splitters read data from a reader and provide byte slices (chunks)
3-
// The size and contents of these slices depend on the splitting method
4-
// used.
2+
//
3+
// Splitters read data from a reader and produce byte slices (chunks).
4+
// The size and contents of these slices depend on the splitting method.
5+
//
6+
// Built-in methods include fixed-size, Rabin fingerprint, and Buzhash
7+
// content-defined chunking. Additional methods can be registered with
8+
// [Register] and instantiated through [FromString].
59
package chunk
610

711
import (
@@ -26,16 +30,19 @@ type Splitter interface {
2630
NextBytes() ([]byte, error)
2731
}
2832

29-
// SplitterGen is a splitter generator, given a reader.
33+
// SplitterGen creates a [Splitter] from a reader.
34+
// It is used at runtime by callers that already know which chunking
35+
// strategy and parameters they want (e.g. "fixed-size at 256 KiB").
36+
// See [SizeSplitterGen] for a convenient way to build one.
3037
type SplitterGen func(r io.Reader) Splitter
3138

3239
// DefaultSplitter returns a SizeSplitter with the DefaultBlockSize.
3340
func DefaultSplitter(r io.Reader) Splitter {
3441
return NewSizeSplitter(r, DefaultBlockSize)
3542
}
3643

37-
// SizeSplitterGen returns a SplitterGen function which will create
38-
// a splitter with the given size when called.
44+
// SizeSplitterGen returns a [SplitterGen] that creates a fixed-size
45+
// [Splitter] with the given block size.
3946
func SizeSplitterGen(size int64) SplitterGen {
4047
return func(r io.Reader) Splitter {
4148
return NewSizeSplitter(r, size)

0 commit comments

Comments
 (0)