Skip to content

feat: build metadata as per-region binaries; drop runtime XML/zip parsing#323

Open
twcclegg wants to merge 14 commits intomainfrom
feat/binary-metadata
Open

feat: build metadata as per-region binaries; drop runtime XML/zip parsing#323
twcclegg wants to merge 14 commits intomainfrom
feat/binary-metadata

Conversation

@twcclegg
Copy link
Copy Markdown
Owner

@twcclegg twcclegg commented May 3, 2026

Summary

Replaces runtime XML metadata parsing + the geocoding.zip blob with build‑time-generated per‑region binary files, mirroring the Java upstream's BuildMetadataProtoFromXml pipeline.

  • New csharp/PhoneNumbers.MetadataBuilder/ build‑time tool converts resources/*.xml, resources/geocoding/<lang>/<countryCode>.txt, and resources/timezones/map_data.txt into 766 binary files embedded under PhoneNumbers.metadata.*, PhoneNumbers.geocoding.*, and PhoneNumbers.timezones.* resource namespaces.
  • BuildMetadataFromBin (PhoneMetadata serializer) and BuildPrefixMapFromBin (area‑code + timezone serializers) handle read/write; both versioned with magic + format byte.
  • New public IMetadataLoader interface + EmbeddedResourceMetadataLoader default impl + MissingMetadataException exception type — consumers can now plug in custom loaders for trimmed bundles, CDN, etc. Mirrors Java's MetadataLoader API.
  • Lazy MetadataSource replaces the eager XML parse on PhoneNumberUtil.GetInstance(). Per‑region metadata is loaded only when first asked for, cached in a ConcurrentDictionary.
  • Geocoding/timezones load via the new binary readers; zip handling and text fallback both removed from PhoneNumberOfflineGeocoder / PhoneNumberToTimeZonesMapper.

Outcome

  • nupkg: 7.5 MB → 6.9 MB (≈ 8 % smaller).
  • No XML parsing on cold start of PhoneNumberUtil.GetInstance().
  • Region metadata loaded only on first use (lazy), via a pluggable loader.

Public API changes — additive only

Verified via diff against main: this PR only ADDS public types/members; nothing is removed or signature‑changed. Safe for a non‑major version bump.

Added:

  • IMetadataLoader (interface) + EmbeddedResourceMetadataLoader (default impl) — pluggable metadata loading, mirrors Java's MetadataLoader.
  • MissingMetadataException — thrown when an expected metadata resource is missing; mirrors Java's same‑named type.
  • BuildMetadataFromBin, BuildPrefixMapFromBin — read/write helpers for the binary format.
  • MetadataManager.SetMetadataLoader(IMetadataLoader) — same injection point Java exposes via DefaultMetadataDependenciesProvider.

Behavior‑only correction (no signature change):

  • PhoneNumberUtil.GetSupportedGlobalNetworkCallingCodes() returns the same Dictionary<int, PhoneMetadata>.KeyCollection as before, but the underlying dictionary now correctly contains only non‑geographical entries (calling codes 800, 808, 870, 878, 881, 882, 883, 888, 979) — matching the docstring and Java's getSupportedGlobalNetworkCallingCodes(). The pre‑PR implementation incidentally included every loaded calling code (geo + non‑geo) because it leaked an internal dictionary populated by the eager XML loader; that contradicted the documented contract.

Removed (internal/non‑public):

  • AreaCodeParser (was internal class) — replaced by BuildPrefixMapFromBin.ReadAreaCodeMap.
  • geocoding.zip and testgeocoding.zip removed from resources/; appveyor.yml and the GitHub Actions workflows no longer create them.
  • MappingFileProvider.GetFileName no longer appends .txt (internal API; only consumer is PhoneNumberOfflineGeocoder).

Test plan

  • PhoneNumbers test suite (322 tests) green on net9.0
  • PhoneNumbers.Extensions test suite (23 tests) green on net9.0
  • Roundtrip test (TestBuildMetadataFromBin) verifies binary I/O against every region in production XML — 200+ phone, 240+ short, 40+ alternate
  • Lazy region loading verified by TestMetadataSource.DefaultLoaderResolvesEmbeddedBinaryMetadata
  • Custom loader injection verified by TestMedataManager.SetMetadataLoader_RoutesLookupsThroughCustomLoader
  • 10× sequential cold‑from‑empty dotnet build PhoneNumbers.sln — 10/10 pass after the mutex + double‑checked‑skip in MetadataBuilder
  • Clean build across all four TFMs: netstandard2.0, net8.0, net9.0, netframework4.8

🤖 Generated with Claude Code

twcclegg and others added 11 commits May 3, 2026 15:53
First step of moving metadata loading from runtime XML parsing to
build-time binary serialization (mirroring the Java upstream's
Externalizable approach). This commit only adds the codec + roundtrip
tests; nothing in the production load path uses it yet.

Format: 4-byte magic + 1-byte version + bitmask-prefixed body. C#-only
(not wire-compatible with Java's ObjectOutput — that buys nothing here).
Roundtrip verified against every region in PhoneNumberMetadata.xml,
ShortNumberMetadata.xml, and PhoneNumberAlternateFormats.xml using
PhoneMetadata.Equals as the diff oracle.

Refs #319 follow-up; resources/.bin packaging in subsequent commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Console tool that mirrors Java's BuildMetadataProtoFromXml: takes one of
the resources/*.xml files and emits one binary file per region (or per
country-calling-code for non-geographical / alternate-format entries)
using BuildMetadataFromBin.

Subcommands: phone, short, alternate, test (the last one for the test
project's PhoneNumberMetadataForTesting.xml). Output naming matches the
Java upstream: <prefix>_<REGION-or-CC>, no extension.

Verified outputs against the production XML: 254 phone, 241 short, 46
alternate-format, 36 test files. Tarred+gzipped (proxy for nupkg
compression) the bin set is 130K vs 203K for the equivalent XML.

Tool is referenced from PhoneNumbers.sln but not yet wired into the
PhoneNumbers.csproj build — that's the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces the runtime-side counterpart to the build-time MetadataBuilder:
a pluggable IMetadataLoader (default: EmbeddedResourceMetadataLoader) and
a MetadataSource that lazily resolves region/non-geo metadata via that
loader, caching results in a ConcurrentDictionary.

Also adds MissingMetadataException to mirror the public Java type. This
unblocks consumers that want to inject a custom loader (loading metadata
from disk, a CDN, a trimmed bundle, etc.) — a long-standing Java feature
that the C# port previously couldn't expose.

The MetadataSource collapses Java's MetadataSource +
NonGeographicalEntityMetadataSource + CompositeMetadataContainer +
BlockingMetadataBootstrappingGuard into one file, since
ConcurrentDictionary makes the surrounding hierarchy unnecessary in C#.

Nothing in the production load path uses MetadataSource yet — that's the
follow-up commit. Full test suite (326) green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an MSBuild target in PhoneNumbers.csproj that invokes
PhoneNumbers.MetadataBuilder before resource collection, producing 541
binary files (254 phone, 241 short, 46 alternate-format) under
obj/metadata/ and embedding them under the
"PhoneNumbers.metadata.<name>" namespace. Inputs/Outputs gates the
target so it's a no-op on incremental builds when XML hasn't changed.

ProjectReference uses ReferenceOutputAssembly="false" — MetadataBuilder
is built first (so dotnet run finds it), but its DLL doesn't leak into
PhoneNumbers' compile references or the published package.

XML embedding is kept in parallel for now so the production load path
keeps working unchanged. Adds a smoke test that catches resource-naming
mismatches between the MSBuild LogicalName and
EmbeddedResourceMetadataLoader's suffix matcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the eager Dictionary<string, PhoneMetadata> /
Dictionary<int, PhoneMetadata> fields in PhoneNumberUtil with a single
MetadataSource. The default GetInstance() now reads the per-region
binary files generated by MetadataBuilder via
EmbeddedResourceMetadataLoader, so cold-start no longer parses XML.

The Stream constructor (used by tests with PhoneNumberMetadataForTesting.xml
and by consumers loading custom XML) is preserved: parsed XML is
round-tripped once through BuildMetadataFromBin into an
InMemoryMetadataLoader so the runtime path is unified.

API change: GetSupportedGlobalNetworkCallingCodes() now returns
HashSet<int> instead of Dictionary<int, PhoneMetadata>.KeyCollection.
The previous return type was a leak of an internal dictionary that
incidentally contained every loaded calling code (geographical and
non-geographical) — the new implementation correctly returns only
non-geo codes, matching the docstring and Java's behavior. Source-level
foreach/LINQ callers are unaffected; explicit-typed callers will need a
recompile.

MetadataManager: alternate-formats and short-number lookups also go
through MetadataSource now. Same lazy-load semantics — the corresponding
binary file for a country code / region code is loaded only on first
request.

All 327 PhoneNumbers tests + 23 Extensions tests pass on net9.0; build
clean across netstandard2.0, net8.0, net9.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that PhoneNumberUtil and MetadataManager load metadata exclusively
through MetadataSource (binary), the XML files are no longer needed in
the published nupkg. Moves XML embedding to PhoneNumbers.Test for the
handful of tests that exercise BuildMetadataFromXml directly
(TestBuildMetadataFromBin, TestBuildMetadataFromXml).

Tests now pass typeof(TestX).Assembly explicitly so the resource
resolver finds them in the test project rather than the production
library.

Package size: 7.2M vs 7.5M before. The remaining bulk is geocoding.zip,
which the next commit converts to binary.

All 327 PhoneNumbers tests + 23 Extensions tests still pass on net9.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the runtime parsing of geocoding text files (or the bundled
geocoding.zip) and timezones/map_data.txt with build-time-generated
binary files, completing the v2 binary-metadata story.

What changed:
- BuildPrefixMapFromBin: serializer/deserializer for area-code maps
  (int prefix → string) and timezone maps (long prefix → string[]),
  versioned with distinct magics.
- PhoneNumbers.MetadataBuilder: new `geocoding` and `timezones`
  subcommands. Geocoding emits one bin per (lang, country-code) under
  the "PhoneNumbers.geocoding.*" namespace (224 files); timezones emits
  a single map_data.bin.
- PhoneNumbers.csproj: BuildGeocodingBins + BuildTimezoneBin MSBuild
  targets, embedding under the same namespaces. Drops the old
  zip-conditional ItemGroup, the loose-text fallback, and the
  timezones/map_data.txt resource.
- PhoneNumberOfflineGeocoder: drops zip handling and text parsing
  entirely; loads via BuildPrefixMapFromBin.ReadAreaCodeMap. Throws a
  loud InvalidOperationException if a resolved file name has no matching
  embedded resource (the previous code silently NPE'd).
- PhoneNumberToTimeZonesMapper: reads map_data.bin via
  BuildPrefixMapFromBin.ReadTimezoneMap. TimezoneMapDataReader is
  retained because tests use it directly with hand-built text streams.
- MappingFileProvider.GetFileName: drops the ".txt" suffix from the
  returned name (binary files have no extension); test updated.
- AreaCodeParser: deleted (no remaining users).
- TestZippedPhoneNumberOfflineGeocoder: deleted (no zip path to test).
- PhoneNumbers.Test.csproj: BuildTestGeocodingBins target so test
  geocoding text in resources/test/geocoding/ also gets binary-built and
  embedded under "PhoneNumbers.Test.geocoding.*".
- appveyor.yml: drops the Compress-Archive lines that built the zips.
- CLAUDE.md: removed the "Tests require geocoding.zip" prerequisite,
  noted the MetadataBuilder + lazy MetadataSource architecture.
- resources/geocoding.zip and resources/test/testgeocoding.zip: deleted.

All 320 PhoneNumbers tests pass on net9.0; clean build across
netstandard2.0, net8.0, net9.0. (Test count drops from 327 → 320 because
the deleted TestZippedPhoneNumberOfflineGeocoder class held 7 of them —
its production code path no longer exists.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RegexCache had zero callers anywhere in the project — it was a thin Java
API-compat wrapper that just delegated to PhoneRegex.Get(). PhoneRegex
itself does not use RegexCache; the dependency was always the other way
around. Drop both the class and its source-link from the MetadataBuilder
csproj.

Minor breaking change for any external consumer that instantiated
RegexCache directly, but the class did nothing they couldn't do via
PhoneRegex.Get(string) which is already public.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SDK defaults net4.8 to C# 7.3, which rejects:
- `using` declarations used by TestBuildMetadataFromBin
- `Stream?` nullable annotations used by TestMetadataSource's
  IMetadataLoader fake (the interface method is `Stream?` because
  MetadataLoader.cs has #nullable enable)

Existing test files dodged this with `#if NET6_0_OR_GREATER` guards
around their nullable-annotated overloads. The new tests would need the
same guard noise around every `?` and `using var`, which is ugly. Pin
LangVersion=preview on the test csproj instead — same setting the main
PhoneNumbers project already uses — and add `#nullable enable` to
TestMetadataSource.cs so the `?` annotations live in a nullable context
on every TFM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three CI failures, all rooted in parallel multi-TFM solution builds:

1. Linux + Windows: three TFMs all run MetadataBuilder simultaneously
   and clobber obj/geocoding/en.86 mid-write.
2. Windows: backslash-then-quote in the Exec command corrupted the path
   ("obj\metadata\" -> "obj\metadata"\ inside cmd.exe).
3. Locally observed but lurking on CI: solution-level build of
   PhoneNumbers + PhoneNumbers.Test races on MetadataBuilder's own
   obj\refint when both projects' targets shell out to `dotnet run`.
4. After fixing the race, an MSBuild item-batching surprise produced
   a cross-product of 121k bogus EmbeddedResource items
   ("PhoneNumbers.geocoding.PhoneNumberAlternateFormats_*").

Fixes:
- Run generation ONCE on the cross-targeting outer build
  (BeforeTargets="DispatchToInnerBuilds") plus single-TFM direct builds
  (BeforeTargets="AssignTargetPaths"); inner per-TFM builds skip via
  Inputs/Outputs gating, then EmbedBinaryMetadata adds the items each
  TFM. No more concurrent writes.
- Replace `dotnet run --project ...` with the in-process <MSBuild> task
  (BuildInParallel="false") + `dotnet path/to/dll` for execution. The
  MSBuild task is serialized by MSBuild's BuildManager so cross-project
  contention on MetadataBuilder's obj\ goes away.
- Pin `TargetFramework=net9.0` when invoking MetadataBuilder so its
  build doesn't try to inherit the parent's netstandard2.0 / net8.0 /
  netframework4.8 framework.
- Drop trailing path separators on the BinaryMetadataDir / etc.
  properties, add `/` only on the embedding glob, to dodge Windows
  shell quoting (`\"` collapsing).
- Glob into a private `_BinMetadataFile` item first, then
  `Include="@(_BinMetadataFile)"` with collection-qualified
  `%(_BinMetadataFile.Filename)` metadata. The plain `%(Filename)` form
  inside a Target's ItemGroup batches against the existing
  EmbeddedResource collection and produces a cross-product.
- Pin `<LangVersion>preview</LangVersion>` on PhoneNumbers.Test.csproj
  so the netframework4.8 target accepts `using var` declarations and
  `?` annotations under `#nullable enable` (was a separate error in
  the same job batch).
- Drop the now-obsolete Compress-Archive / `zip -r` steps from
  build_and_run_unit_tests_linux.yml,
  run_all_tests_and_upload_code_coverage.yml, and codeql.yml — the
  zip files they produced aren't read by anything anymore.

All 320 PhoneNumbers tests + 23 Extensions tests pass on net9.0; clean
solution-wide build across netstandard2.0, net8.0, net9.0,
netframework4.8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

❌ Patch coverage is 84.13462% with 66 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.31%. Comparing base (cadd7f2) to head (d88fd0a).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
csharp/PhoneNumbers/BuildPrefixMapFromBin.cs 48.21% 27 Missing and 2 partials ⚠️
csharp/PhoneNumbers/PhoneNumberOfflineGeocoder.cs 88.03% 11 Missing and 3 partials ⚠️
csharp/PhoneNumbers/BuildMetadataFromBin.cs 91.66% 0 Missing and 13 partials ⚠️
csharp/PhoneNumbers/MetadataLoader.cs 72.72% 1 Missing and 2 partials ⚠️
csharp/PhoneNumbers/PhoneNumberUtil.cs 93.18% 1 Missing and 2 partials ⚠️
csharp/PhoneNumbers/MissingMetadataException.cs 0.00% 2 Missing ⚠️
csharp/PhoneNumbers/BuildMetadataFromXml.cs 0.00% 0 Missing and 1 partial ⚠️
...sharp/PhoneNumbers/PhoneNumberToTimeZonesMapper.cs 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #323      +/-   ##
==========================================
+ Coverage   75.27%   75.31%   +0.04%     
==========================================
  Files          34       38       +4     
  Lines        4424     4638     +214     
  Branches     1001     1097      +96     
==========================================
+ Hits         3330     3493     +163     
- Misses        885      917      +32     
- Partials      209      228      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

twcclegg and others added 3 commits May 3, 2026 21:36
Self-review findings on the prior commits — full notes in PR thread.

Bugs:
- B1: GenerateGeocodingBins / GenerateTestGeocodingBins watched only one
  sentinel file (resources/geocoding/en/1.txt). Editing any other .txt
  during incremental local builds skipped regen. Switched to top-level
  ItemGroup wildcard inputs so any .txt change retriggers.
- B2: MissingMetadataException was declared but never thrown — dead
  code mirroring the Java public type. Now thrown by
  PhoneNumberOfflineGeocoder.LoadAreaCodeMapFromFile and
  PhoneNumberToTimeZonesMapper.Create when an expected resource is
  missing (replaces InvalidOperationException at those sites).
- B3: Auto-formatter glitch in BuildMetadataFromBin.cs (stray newline +
  extra indentation around the SmsServices write).

Performance + API:
- P1+A1: EmbeddedResourceMetadataLoader did O(n) suffix matching on
  GetManifestResourceNames per LoadMetadata call. Now O(1) — takes a
  resourcePrefix in its constructor and concatenates with the file name
  for a direct GetManifestResourceStream lookup. The IMetadataLoader
  contract is now unambiguous: callers always pass exact logical names.
- A2: MetadataManager.SetMetadataLoader for swapping the loader used
  for supplementary metadata (alternate formats / short numbers),
  matching the IMetadataLoader injectability that PhoneNumberUtil
  already exposes.
- A4: docstring on IMetadataLoader notes that lifetime of disposable
  custom loaders is the implementer's responsibility.

Build pipeline:
- The mutex-protected outer/inner Generate split kept hitting two
  separate cold-build races on a multi-TFM solution build:
  * MetadataBuilder being built twice in parallel (once via the
    explicit <MSBuild> task in PhoneNumbers, once via PhoneNumbers.Test
    asking the same way) → apphost copy/sign collision under
    obj/Debug/net9.0/.
  * MetadataBuilder writing bins concurrently with another inner TFM's
    csc reading them.

  Two fixes together:
  1. Drop the explicit <MSBuild Projects="MetadataBuilder.csproj"/>
     task — rely solely on ProjectReference for build ordering.
     PhoneNumbers.Test loses its direct reference too; it gets
     MetadataBuilder transitively via PhoneNumbers. Generate* targets
     run only on inner per-TFM builds (BeforeTargets="AssignTargetPaths"),
     by which time MetadataBuilder.dll exists.
  2. Cross-process Mutex inside MetadataBuilder.Main keyed off the
     output dir, plus a double-checked up-to-date skip after the mutex
     is acquired. Concurrent inner builds serialize, and the second
     mutex-holder sees outputs are fresh and exits without writing,
     so csc never reads a half-written file.

Verified with 10 sequential cold-from-empty rebuilds: 10 pass / 0 fail
(was ~50% on the previous design).

Tests: PhoneNumbers 322 pass (added 2 new MetadataManager tests for
SetMetadataLoader); Extensions 23 pass; clean build across
netstandard2.0, net8.0, net9.0, netframework4.8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…egexCache

The previous PR commits introduced two API-breaking changes that fail
semver for a non-major bump:

1. PhoneNumberUtil.GetSupportedGlobalNetworkCallingCodes() return type
   changed from Dictionary<int, PhoneMetadata>.KeyCollection to
   HashSet<int>. Source-level foreach/LINQ callers were unaffected, but
   any consumer with an explicitly-typed local variable would fail to
   compile, and binary compat was broken for the method signature.

2. RegexCache class deleted. It had zero callers in the repo and was
   already marked [Obsolete] upstream, but external consumers could
   still depend on it.

This commit:

- Restores GetSupportedGlobalNetworkCallingCodes to return
  Dictionary<int, PhoneMetadata>.KeyCollection by re-introducing the
  countryCodeToNonGeographicalMetadataMap field, populated eagerly at
  construction with the ~9 non-geographical entries (calling codes
  800, 808, 870, 878, 881, 882, 883, 888, 979). Eager loading these
  is fine — they're small bin files and constructor isn't on a hot
  path. Region metadata is still lazy via MetadataSource.

- Keeps the semantic correction from the prior commit: this dict only
  contains non-geographical entries, matching the docstring and Java's
  getSupportedGlobalNetworkCallingCodes() semantics. The pre-binary
  implementation incidentally stuffed every loaded country code in
  here, which contradicted the documented "non-geographical only"
  contract.

- Restores RegexCache.cs verbatim from main. Already [Obsolete] +
  [EditorBrowsable(Never)] upstream so eventual removal stays on the
  table for a future major version.

The full public API diff against main is now: only ADDITIVE.
- New: IMetadataLoader, EmbeddedResourceMetadataLoader,
  MissingMetadataException, BuildMetadataFromBin, BuildPrefixMapFromBin
  (public types), MetadataManager.SetMetadataLoader (public method).
- No public members removed or changed.

322 + 23 tests pass on net9.0; build clean across all four TFMs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SetMetadataLoader_RoutesLookupsThroughCustomLoader (added in this PR)
swaps the static MetadataManager loader for a recording fake, runs a
lookup, then resets in a finally. xunit was running TestMedataManager in
parallel with TestPhoneNumberMatcher, and TestPhoneNumberMatcher calls
MetadataManager.GetAlternateFormatsForCountry(61) during AU number
matching. The brief window where the fake loader was installed caused
the matcher to get null instead of AU's alternate formats, and
TestMatchesWithExactGroupingLeniency would fail with "No match found in
1800-10-10 22 (AU)". Manifested only on Linux CI where parallel test
scheduling happened to interleave the two classes.

Joining TestMedataManager to the existing "TestMetadataTestCase" xunit
collection forces it to serialize with the matcher tests (and the
other classes already in that collection), eliminating the race.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant