feat: build metadata as per-region binaries; drop runtime XML/zip parsing#323
Open
feat: build metadata as per-region binaries; drop runtime XML/zip parsing#323
Conversation
First step of moving metadata loading from runtime XML parsing to build-time binary serialization (mirroring the Java upstream's Externalizable approach). This commit only adds the codec + roundtrip tests; nothing in the production load path uses it yet. Format: 4-byte magic + 1-byte version + bitmask-prefixed body. C#-only (not wire-compatible with Java's ObjectOutput — that buys nothing here). Roundtrip verified against every region in PhoneNumberMetadata.xml, ShortNumberMetadata.xml, and PhoneNumberAlternateFormats.xml using PhoneMetadata.Equals as the diff oracle. Refs #319 follow-up; resources/.bin packaging in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Console tool that mirrors Java's BuildMetadataProtoFromXml: takes one of the resources/*.xml files and emits one binary file per region (or per country-calling-code for non-geographical / alternate-format entries) using BuildMetadataFromBin. Subcommands: phone, short, alternate, test (the last one for the test project's PhoneNumberMetadataForTesting.xml). Output naming matches the Java upstream: <prefix>_<REGION-or-CC>, no extension. Verified outputs against the production XML: 254 phone, 241 short, 46 alternate-format, 36 test files. Tarred+gzipped (proxy for nupkg compression) the bin set is 130K vs 203K for the equivalent XML. Tool is referenced from PhoneNumbers.sln but not yet wired into the PhoneNumbers.csproj build — that's the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces the runtime-side counterpart to the build-time MetadataBuilder: a pluggable IMetadataLoader (default: EmbeddedResourceMetadataLoader) and a MetadataSource that lazily resolves region/non-geo metadata via that loader, caching results in a ConcurrentDictionary. Also adds MissingMetadataException to mirror the public Java type. This unblocks consumers that want to inject a custom loader (loading metadata from disk, a CDN, a trimmed bundle, etc.) — a long-standing Java feature that the C# port previously couldn't expose. The MetadataSource collapses Java's MetadataSource + NonGeographicalEntityMetadataSource + CompositeMetadataContainer + BlockingMetadataBootstrappingGuard into one file, since ConcurrentDictionary makes the surrounding hierarchy unnecessary in C#. Nothing in the production load path uses MetadataSource yet — that's the follow-up commit. Full test suite (326) green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an MSBuild target in PhoneNumbers.csproj that invokes PhoneNumbers.MetadataBuilder before resource collection, producing 541 binary files (254 phone, 241 short, 46 alternate-format) under obj/metadata/ and embedding them under the "PhoneNumbers.metadata.<name>" namespace. Inputs/Outputs gates the target so it's a no-op on incremental builds when XML hasn't changed. ProjectReference uses ReferenceOutputAssembly="false" — MetadataBuilder is built first (so dotnet run finds it), but its DLL doesn't leak into PhoneNumbers' compile references or the published package. XML embedding is kept in parallel for now so the production load path keeps working unchanged. Adds a smoke test that catches resource-naming mismatches between the MSBuild LogicalName and EmbeddedResourceMetadataLoader's suffix matcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the eager Dictionary<string, PhoneMetadata> / Dictionary<int, PhoneMetadata> fields in PhoneNumberUtil with a single MetadataSource. The default GetInstance() now reads the per-region binary files generated by MetadataBuilder via EmbeddedResourceMetadataLoader, so cold-start no longer parses XML. The Stream constructor (used by tests with PhoneNumberMetadataForTesting.xml and by consumers loading custom XML) is preserved: parsed XML is round-tripped once through BuildMetadataFromBin into an InMemoryMetadataLoader so the runtime path is unified. API change: GetSupportedGlobalNetworkCallingCodes() now returns HashSet<int> instead of Dictionary<int, PhoneMetadata>.KeyCollection. The previous return type was a leak of an internal dictionary that incidentally contained every loaded calling code (geographical and non-geographical) — the new implementation correctly returns only non-geo codes, matching the docstring and Java's behavior. Source-level foreach/LINQ callers are unaffected; explicit-typed callers will need a recompile. MetadataManager: alternate-formats and short-number lookups also go through MetadataSource now. Same lazy-load semantics — the corresponding binary file for a country code / region code is loaded only on first request. All 327 PhoneNumbers tests + 23 Extensions tests pass on net9.0; build clean across netstandard2.0, net8.0, net9.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that PhoneNumberUtil and MetadataManager load metadata exclusively through MetadataSource (binary), the XML files are no longer needed in the published nupkg. Moves XML embedding to PhoneNumbers.Test for the handful of tests that exercise BuildMetadataFromXml directly (TestBuildMetadataFromBin, TestBuildMetadataFromXml). Tests now pass typeof(TestX).Assembly explicitly so the resource resolver finds them in the test project rather than the production library. Package size: 7.2M vs 7.5M before. The remaining bulk is geocoding.zip, which the next commit converts to binary. All 327 PhoneNumbers tests + 23 Extensions tests still pass on net9.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the runtime parsing of geocoding text files (or the bundled geocoding.zip) and timezones/map_data.txt with build-time-generated binary files, completing the v2 binary-metadata story. What changed: - BuildPrefixMapFromBin: serializer/deserializer for area-code maps (int prefix → string) and timezone maps (long prefix → string[]), versioned with distinct magics. - PhoneNumbers.MetadataBuilder: new `geocoding` and `timezones` subcommands. Geocoding emits one bin per (lang, country-code) under the "PhoneNumbers.geocoding.*" namespace (224 files); timezones emits a single map_data.bin. - PhoneNumbers.csproj: BuildGeocodingBins + BuildTimezoneBin MSBuild targets, embedding under the same namespaces. Drops the old zip-conditional ItemGroup, the loose-text fallback, and the timezones/map_data.txt resource. - PhoneNumberOfflineGeocoder: drops zip handling and text parsing entirely; loads via BuildPrefixMapFromBin.ReadAreaCodeMap. Throws a loud InvalidOperationException if a resolved file name has no matching embedded resource (the previous code silently NPE'd). - PhoneNumberToTimeZonesMapper: reads map_data.bin via BuildPrefixMapFromBin.ReadTimezoneMap. TimezoneMapDataReader is retained because tests use it directly with hand-built text streams. - MappingFileProvider.GetFileName: drops the ".txt" suffix from the returned name (binary files have no extension); test updated. - AreaCodeParser: deleted (no remaining users). - TestZippedPhoneNumberOfflineGeocoder: deleted (no zip path to test). - PhoneNumbers.Test.csproj: BuildTestGeocodingBins target so test geocoding text in resources/test/geocoding/ also gets binary-built and embedded under "PhoneNumbers.Test.geocoding.*". - appveyor.yml: drops the Compress-Archive lines that built the zips. - CLAUDE.md: removed the "Tests require geocoding.zip" prerequisite, noted the MetadataBuilder + lazy MetadataSource architecture. - resources/geocoding.zip and resources/test/testgeocoding.zip: deleted. All 320 PhoneNumbers tests pass on net9.0; clean build across netstandard2.0, net8.0, net9.0. (Test count drops from 327 → 320 because the deleted TestZippedPhoneNumberOfflineGeocoder class held 7 of them — its production code path no longer exists.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RegexCache had zero callers anywhere in the project — it was a thin Java API-compat wrapper that just delegated to PhoneRegex.Get(). PhoneRegex itself does not use RegexCache; the dependency was always the other way around. Drop both the class and its source-link from the MetadataBuilder csproj. Minor breaking change for any external consumer that instantiated RegexCache directly, but the class did nothing they couldn't do via PhoneRegex.Get(string) which is already public. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SDK defaults net4.8 to C# 7.3, which rejects: - `using` declarations used by TestBuildMetadataFromBin - `Stream?` nullable annotations used by TestMetadataSource's IMetadataLoader fake (the interface method is `Stream?` because MetadataLoader.cs has #nullable enable) Existing test files dodged this with `#if NET6_0_OR_GREATER` guards around their nullable-annotated overloads. The new tests would need the same guard noise around every `?` and `using var`, which is ugly. Pin LangVersion=preview on the test csproj instead — same setting the main PhoneNumbers project already uses — and add `#nullable enable` to TestMetadataSource.cs so the `?` annotations live in a nullable context on every TFM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three CI failures, all rooted in parallel multi-TFM solution builds:
1. Linux + Windows: three TFMs all run MetadataBuilder simultaneously
and clobber obj/geocoding/en.86 mid-write.
2. Windows: backslash-then-quote in the Exec command corrupted the path
("obj\metadata\" -> "obj\metadata"\ inside cmd.exe).
3. Locally observed but lurking on CI: solution-level build of
PhoneNumbers + PhoneNumbers.Test races on MetadataBuilder's own
obj\refint when both projects' targets shell out to `dotnet run`.
4. After fixing the race, an MSBuild item-batching surprise produced
a cross-product of 121k bogus EmbeddedResource items
("PhoneNumbers.geocoding.PhoneNumberAlternateFormats_*").
Fixes:
- Run generation ONCE on the cross-targeting outer build
(BeforeTargets="DispatchToInnerBuilds") plus single-TFM direct builds
(BeforeTargets="AssignTargetPaths"); inner per-TFM builds skip via
Inputs/Outputs gating, then EmbedBinaryMetadata adds the items each
TFM. No more concurrent writes.
- Replace `dotnet run --project ...` with the in-process <MSBuild> task
(BuildInParallel="false") + `dotnet path/to/dll` for execution. The
MSBuild task is serialized by MSBuild's BuildManager so cross-project
contention on MetadataBuilder's obj\ goes away.
- Pin `TargetFramework=net9.0` when invoking MetadataBuilder so its
build doesn't try to inherit the parent's netstandard2.0 / net8.0 /
netframework4.8 framework.
- Drop trailing path separators on the BinaryMetadataDir / etc.
properties, add `/` only on the embedding glob, to dodge Windows
shell quoting (`\"` collapsing).
- Glob into a private `_BinMetadataFile` item first, then
`Include="@(_BinMetadataFile)"` with collection-qualified
`%(_BinMetadataFile.Filename)` metadata. The plain `%(Filename)` form
inside a Target's ItemGroup batches against the existing
EmbeddedResource collection and produces a cross-product.
- Pin `<LangVersion>preview</LangVersion>` on PhoneNumbers.Test.csproj
so the netframework4.8 target accepts `using var` declarations and
`?` annotations under `#nullable enable` (was a separate error in
the same job batch).
- Drop the now-obsolete Compress-Archive / `zip -r` steps from
build_and_run_unit_tests_linux.yml,
run_all_tests_and_upload_code_coverage.yml, and codeql.yml — the
zip files they produced aren't read by anything anymore.
All 320 PhoneNumbers tests + 23 Extensions tests pass on net9.0; clean
solution-wide build across netstandard2.0, net8.0, net9.0,
netframework4.8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #323 +/- ##
==========================================
+ Coverage 75.27% 75.31% +0.04%
==========================================
Files 34 38 +4
Lines 4424 4638 +214
Branches 1001 1097 +96
==========================================
+ Hits 3330 3493 +163
- Misses 885 917 +32
- Partials 209 228 +19 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Self-review findings on the prior commits — full notes in PR thread.
Bugs:
- B1: GenerateGeocodingBins / GenerateTestGeocodingBins watched only one
sentinel file (resources/geocoding/en/1.txt). Editing any other .txt
during incremental local builds skipped regen. Switched to top-level
ItemGroup wildcard inputs so any .txt change retriggers.
- B2: MissingMetadataException was declared but never thrown — dead
code mirroring the Java public type. Now thrown by
PhoneNumberOfflineGeocoder.LoadAreaCodeMapFromFile and
PhoneNumberToTimeZonesMapper.Create when an expected resource is
missing (replaces InvalidOperationException at those sites).
- B3: Auto-formatter glitch in BuildMetadataFromBin.cs (stray newline +
extra indentation around the SmsServices write).
Performance + API:
- P1+A1: EmbeddedResourceMetadataLoader did O(n) suffix matching on
GetManifestResourceNames per LoadMetadata call. Now O(1) — takes a
resourcePrefix in its constructor and concatenates with the file name
for a direct GetManifestResourceStream lookup. The IMetadataLoader
contract is now unambiguous: callers always pass exact logical names.
- A2: MetadataManager.SetMetadataLoader for swapping the loader used
for supplementary metadata (alternate formats / short numbers),
matching the IMetadataLoader injectability that PhoneNumberUtil
already exposes.
- A4: docstring on IMetadataLoader notes that lifetime of disposable
custom loaders is the implementer's responsibility.
Build pipeline:
- The mutex-protected outer/inner Generate split kept hitting two
separate cold-build races on a multi-TFM solution build:
* MetadataBuilder being built twice in parallel (once via the
explicit <MSBuild> task in PhoneNumbers, once via PhoneNumbers.Test
asking the same way) → apphost copy/sign collision under
obj/Debug/net9.0/.
* MetadataBuilder writing bins concurrently with another inner TFM's
csc reading them.
Two fixes together:
1. Drop the explicit <MSBuild Projects="MetadataBuilder.csproj"/>
task — rely solely on ProjectReference for build ordering.
PhoneNumbers.Test loses its direct reference too; it gets
MetadataBuilder transitively via PhoneNumbers. Generate* targets
run only on inner per-TFM builds (BeforeTargets="AssignTargetPaths"),
by which time MetadataBuilder.dll exists.
2. Cross-process Mutex inside MetadataBuilder.Main keyed off the
output dir, plus a double-checked up-to-date skip after the mutex
is acquired. Concurrent inner builds serialize, and the second
mutex-holder sees outputs are fresh and exits without writing,
so csc never reads a half-written file.
Verified with 10 sequential cold-from-empty rebuilds: 10 pass / 0 fail
(was ~50% on the previous design).
Tests: PhoneNumbers 322 pass (added 2 new MetadataManager tests for
SetMetadataLoader); Extensions 23 pass; clean build across
netstandard2.0, net8.0, net9.0, netframework4.8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…egexCache The previous PR commits introduced two API-breaking changes that fail semver for a non-major bump: 1. PhoneNumberUtil.GetSupportedGlobalNetworkCallingCodes() return type changed from Dictionary<int, PhoneMetadata>.KeyCollection to HashSet<int>. Source-level foreach/LINQ callers were unaffected, but any consumer with an explicitly-typed local variable would fail to compile, and binary compat was broken for the method signature. 2. RegexCache class deleted. It had zero callers in the repo and was already marked [Obsolete] upstream, but external consumers could still depend on it. This commit: - Restores GetSupportedGlobalNetworkCallingCodes to return Dictionary<int, PhoneMetadata>.KeyCollection by re-introducing the countryCodeToNonGeographicalMetadataMap field, populated eagerly at construction with the ~9 non-geographical entries (calling codes 800, 808, 870, 878, 881, 882, 883, 888, 979). Eager loading these is fine — they're small bin files and constructor isn't on a hot path. Region metadata is still lazy via MetadataSource. - Keeps the semantic correction from the prior commit: this dict only contains non-geographical entries, matching the docstring and Java's getSupportedGlobalNetworkCallingCodes() semantics. The pre-binary implementation incidentally stuffed every loaded country code in here, which contradicted the documented "non-geographical only" contract. - Restores RegexCache.cs verbatim from main. Already [Obsolete] + [EditorBrowsable(Never)] upstream so eventual removal stays on the table for a future major version. The full public API diff against main is now: only ADDITIVE. - New: IMetadataLoader, EmbeddedResourceMetadataLoader, MissingMetadataException, BuildMetadataFromBin, BuildPrefixMapFromBin (public types), MetadataManager.SetMetadataLoader (public method). - No public members removed or changed. 322 + 23 tests pass on net9.0; build clean across all four TFMs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SetMetadataLoader_RoutesLookupsThroughCustomLoader (added in this PR) swaps the static MetadataManager loader for a recording fake, runs a lookup, then resets in a finally. xunit was running TestMedataManager in parallel with TestPhoneNumberMatcher, and TestPhoneNumberMatcher calls MetadataManager.GetAlternateFormatsForCountry(61) during AU number matching. The brief window where the fake loader was installed caused the matcher to get null instead of AU's alternate formats, and TestMatchesWithExactGroupingLeniency would fail with "No match found in 1800-10-10 22 (AU)". Manifested only on Linux CI where parallel test scheduling happened to interleave the two classes. Joining TestMedataManager to the existing "TestMetadataTestCase" xunit collection forces it to serialize with the matcher tests (and the other classes already in that collection), eliminating the race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces runtime XML metadata parsing + the
geocoding.zipblob with build‑time-generated per‑region binary files, mirroring the Java upstream'sBuildMetadataProtoFromXmlpipeline.csharp/PhoneNumbers.MetadataBuilder/build‑time tool convertsresources/*.xml,resources/geocoding/<lang>/<countryCode>.txt, andresources/timezones/map_data.txtinto 766 binary files embedded underPhoneNumbers.metadata.*,PhoneNumbers.geocoding.*, andPhoneNumbers.timezones.*resource namespaces.BuildMetadataFromBin(PhoneMetadata serializer) andBuildPrefixMapFromBin(area‑code + timezone serializers) handle read/write; both versioned with magic + format byte.IMetadataLoaderinterface +EmbeddedResourceMetadataLoaderdefault impl +MissingMetadataExceptionexception type — consumers can now plug in custom loaders for trimmed bundles, CDN, etc. Mirrors Java'sMetadataLoaderAPI.MetadataSourcereplaces the eager XML parse onPhoneNumberUtil.GetInstance(). Per‑region metadata is loaded only when first asked for, cached in aConcurrentDictionary.PhoneNumberOfflineGeocoder/PhoneNumberToTimeZonesMapper.Outcome
PhoneNumberUtil.GetInstance().Public API changes — additive only
Verified via diff against
main: this PR only ADDS public types/members; nothing is removed or signature‑changed. Safe for a non‑major version bump.Added:
IMetadataLoader(interface) +EmbeddedResourceMetadataLoader(default impl) — pluggable metadata loading, mirrors Java'sMetadataLoader.MissingMetadataException— thrown when an expected metadata resource is missing; mirrors Java's same‑named type.BuildMetadataFromBin,BuildPrefixMapFromBin— read/write helpers for the binary format.MetadataManager.SetMetadataLoader(IMetadataLoader)— same injection point Java exposes viaDefaultMetadataDependenciesProvider.Behavior‑only correction (no signature change):
PhoneNumberUtil.GetSupportedGlobalNetworkCallingCodes()returns the sameDictionary<int, PhoneMetadata>.KeyCollectionas before, but the underlying dictionary now correctly contains only non‑geographical entries (calling codes 800, 808, 870, 878, 881, 882, 883, 888, 979) — matching the docstring and Java'sgetSupportedGlobalNetworkCallingCodes(). The pre‑PR implementation incidentally included every loaded calling code (geo + non‑geo) because it leaked an internal dictionary populated by the eager XML loader; that contradicted the documented contract.Removed (internal/non‑public):
AreaCodeParser(wasinternal class) — replaced byBuildPrefixMapFromBin.ReadAreaCodeMap.geocoding.zipandtestgeocoding.zipremoved fromresources/;appveyor.ymland the GitHub Actions workflows no longer create them.MappingFileProvider.GetFileNameno longer appends.txt(internal API; only consumer isPhoneNumberOfflineGeocoder).Test plan
TestBuildMetadataFromBin) verifies binary I/O against every region in production XML — 200+ phone, 240+ short, 40+ alternateTestMetadataSource.DefaultLoaderResolvesEmbeddedBinaryMetadataTestMedataManager.SetMetadataLoader_RoutesLookupsThroughCustomLoaderdotnet build PhoneNumbers.sln— 10/10 pass after the mutex + double‑checked‑skip in MetadataBuilder🤖 Generated with Claude Code