Skip to content

Conversation

@yaroslav-vasylyshyn
Copy link
Contributor

@yaroslav-vasylyshyn yaroslav-vasylyshyn commented May 4, 2025

dev

JIRA

Summary of issue

The task aims to refactor the logic for fetching and updating toponyms data in WebPasingUtils.cs file. Currently, the system downloads a ZIP file from a predefined URL, extracts a CSV file (houses.csv), processes it, compares the results with a local data.csv, and then saves the toponyms into an MS SQL database. The new requirement is to eliminate local CSV file storage.

Summary of change

Deleted the local files, implemented the new logic of fetching and parsing data from a db to temporary files and then deleting them.

Summary by CodeRabbit

  • New Features

    • Improved street toponym processing with enhanced address normalization and geocoding.
    • Automatic deduplication of street entries, now ignoring postal code differences.
    • More accurate mapping and restoration of street name formats.
  • Bug Fixes

    • Resolved issues with inconsistent address parsing and duplicate street entries.
  • Chores

    • Streamlined data import and cleanup processes for improved reliability.

@coderabbitai
Copy link

coderabbitai bot commented May 4, 2025

Walkthrough

The WebParsingUtils class in the Streetcode.WebApi.Utils namespace has been fully implemented and enhanced. The updates introduce static mappings for street type normalization, improved parsing and deduplication logic, and deeper integration with the database. The processing workflow now reconstructs CSV data from the database, deduplicates entries while ignoring postal code differences, fetches coordinates asynchronously, and manages file cleanup. The class also features a more systematic approach to parsing and reconstructing street names and types. These changes collectively establish a robust pipeline for handling, geocoding, and saving street toponym data.

Changes

File(s) Change Summary
Streetcode/.../Utils/WebParsingUtils.cs Fully implemented WebParsingUtils with static street type mappings and reverse map, improved street name parsing (OptimizeStreetname), address reconstruction, deduplication excluding postal code column, async coordinate fetching with fallback, CSV reconstruction, file cleanup, and integrated database save/load logic.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant WebParsingUtils
    participant Database
    participant OpenStreetMapAPI
    participant FileSystem

    User->>WebParsingUtils: ParseZipFileFromWebAsync()
    WebParsingUtils->>FileSystem: Download and extract ZIP file
    WebParsingUtils->>Database: Fetch existing toponyms
    WebParsingUtils->>FileSystem: Write reconstructed CSV (data.csv)
    WebParsingUtils->>FileSystem: Read and deduplicate extracted CSV (ignore postal code)
    loop For each new row
        WebParsingUtils->>OpenStreetMapAPI: FetchCoordsByAddressAsync(address)
        alt If fetch fails
            WebParsingUtils->>OpenStreetMapAPI: FetchCoordsByAddressAsync(shorter address)
        end
        WebParsingUtils->>FileSystem: Append row with coordinates to data.csv
    end
    WebParsingUtils->>FileSystem: Clean up temporary files (if flagged)
    WebParsingUtils->>Database: SaveToponymsToDbAsync(data.csv)
    Database-->>WebParsingUtils: Save completion status
Loading

Poem

In the land of streets and names so neat,
A mapping was born, making parsing complete.
With zip files unzipped and data deduped,
Coordinates fetched, old rows regrouped.
Now toponyms thrive in the database hive,
As WebParsingUtils helps addresses survive!
🏙️✨

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
Streetcode/Streetcode.WebApi/Utils/WebParsingUtils.cs (3)

250-275: Coordinate look-ups run sequentially – consider limited parallelism

FetchCoordsByAddressAsync is awaited inside a foreach, so N calls are serialized.
With thousands of rows this can take hours and risks hitting Nominatim’s rate limits.

A balanced approach:

  • Batch with Parallel.ForEachAsync or Task.WhenAll but cap concurrency to, say, 2–3 requests/second (respecting Nominatim’s usage policy).
  • Cache failures to avoid retrying the same address.

This change is optional but will drastically cut processing time.


377-384: GetDistinctRows – index handling is fragile

The postal-code removal relies on positional index 5 within the truncated array (cols.Take(beforeColumn)), not the original one.
If beforeColumn is ever changed, the constant 5 may point to the wrong column.

Pass the column index explicitly:

-private static List<string> GetDistinctRows(IEnumerable<string> rows, byte beforeColumn = 7) =>
+private static List<string> GetDistinctRows(
+        IEnumerable<string> rows,
+        byte beforeColumn = 7,
+        byte columnToIgnore = 5) =>
 rows.Select(x =>
 {
     var cols = x.Split(';');
-    var filtered = cols.Take(beforeColumn).Where((_, idx) => idx != 5).ToList();
-    filtered.Insert(5, string.Empty);
+    var filtered = cols.Take(beforeColumn)
+                       .Where((_, idx) => idx != columnToIgnore)
+                       .ToList();
+    filtered.Insert(columnToIgnore, string.Empty);
     return string.Join(";", filtered);
 })
 .Distinct()
 .ToList();

300-302: Full table truncation may cause data-loss race conditions

RemoveRange + SaveChangesAsync deletes all toponyms before new ones are inserted.
If the import fails halfway, the table remains empty.

Wrap the whole operation in a transaction or use SqlBulkCopy into a temp table and ALTER TABLE … SWITCH. At minimum:

await using var tx = await _streetcodeContext.Database.BeginTransactionAsync();
// delete + insert
await tx.CommitAsync();

This guarantees atomicity.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d60822a and c157790.

⛔ Files ignored due to path filters (1)
  • Streetcode/Streetcode.DAL/houses.csv is excluded by !**/*.csv, !**/Streetcode/Streetcode.DAL/houses.csv
📒 Files selected for processing (1)
  • Streetcode/Streetcode.WebApi/Utils/WebParsingUtils.cs (1 hunks)

Comment on lines 83 to 88
var zipPath = $"houses.zip";

// Change the path of extractTo to your path, smth like "/Users/admin/Documents/GitHub/StreetCode/Streetcode/Streetcode.DAL";
var extractTo = $"/root/build/StreetCode/Streetcode/Streetcode.DAL";

var cancellationToken = new CancellationTokenSource().Token;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Hard-coded extractTo path reduces portability

extractTo is set to /root/build/StreetCode/Streetcode/Streetcode.DAL.
On any non-CI machine (local dev, staging, Windows runner, etc.) this path will be invalid, causing the ZIP extraction to fail.

Make the target directory configurable (e.g. via IConfiguration, env-var, or constructor parameter) and fall back to a temp folder when no explicit path is provided:

-var extractTo = $"/root/build/StreetCode/Streetcode/Streetcode.DAL";
+var extractTo = _configuration["ToponymExtractionPath"]                // appsettings.json
+               ?? Path.Combine(Path.GetTempPath(), "StreetcodeToponyms");
+Directory.CreateDirectory(extractTo);

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +130 to +132
clientHandler.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
clientHandler.ServerCertificateCustomValidationCallback = (_, _, _, _) => true;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Disabling TLS certificate validation is a security risk

clientHandler.ServerCertificateCustomValidationCallback = (_, _, _, _) => true; accepts any certificate, exposing you to MITM attacks.

Unless you have a very good reason (e.g., an internal test CA), remove this override or restrict it to DEBUG builds only:

-clientHandler.ServerCertificateCustomValidationCallback = (_, _, _, _) => true;
+#if DEBUG
+clientHandler.ServerCertificateCustomValidationCallback = (_, _, _, _) => true;
+#endif

Comment on lines +219 to +226
string excelPath = Directory.GetFiles(extractTo).First(fName => fName.EndsWith("houses.csv"));

var rows = new List<string>(await File.ReadAllLinesAsync(excelPath, Encoding.GetEncoding(1251)));

// Grouping all rows from initial csv in order to get rid of duplicated streets

var forParsingRows = GetDistinctRows(rows);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

First(…EndsWith("houses.csv")) can throw when the file is missing

If the ZIP’s content changes or extraction fails, First will raise InvalidOperationException.

Safer:

-var excelPath = Directory.GetFiles(extractTo)
-               .First(fName => fName.EndsWith("houses.csv"));
+var excelPath = Directory.GetFiles(extractTo)
+               .FirstOrDefault(f => f.EndsWith("houses.csv"));
+if (excelPath is null)
+{
+    Console.WriteLine("houses.csv not found after extraction");
+    return;
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
string excelPath = Directory.GetFiles(extractTo).First(fName => fName.EndsWith("houses.csv"));
var rows = new List<string>(await File.ReadAllLinesAsync(excelPath, Encoding.GetEncoding(1251)));
// Grouping all rows from initial csv in order to get rid of duplicated streets
var forParsingRows = GetDistinctRows(rows);
var excelPath = Directory.GetFiles(extractTo)
.FirstOrDefault(f => f.EndsWith("houses.csv"));
if (excelPath is null)
{
Console.WriteLine("houses.csv not found after extraction");
return;
}
var rows = new List<string>(await File.ReadAllLinesAsync(excelPath, Encoding.GetEncoding(1251)));
// Grouping all rows from initial csv in order to get rid of duplicated streets
var forParsingRows = GetDistinctRows(rows);

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
Streetcode/Streetcode.WebApi/Utils/WebParsingUtils.cs (2)

81-83: Hard-coded extraction path still limits portability
The absolute path /root/build/StreetCode/Streetcode/Streetcode.DAL will break on most local or Windows environments and was flagged in a previous review. Please make the target directory configurable (e.g., IConfiguration, env-var, or fallback to Path.GetTempPath()).


125-128: TLS validation is disabled – serious security risk
ServerCertificateCustomValidationCallback = (_, _, _, _) => true; accepts any certificate and leaves the download vulnerable to MITM attacks. Either remove this line or wrap it in #if DEBUG so it never reaches production.

🧹 Nitpick comments (2)
Streetcode/Streetcode.WebApi/Utils/WebParsingUtils.cs (2)

78-80: Remove dead code to keep ParseZipFileFromWebAsync tidy

Directory.GetParent(Environment.CurrentDirectory)?.FullName is evaluated and immediately discarded.
Unless you intend to use the parent directory later, delete the statement to avoid confusing future readers.

-        _ = Directory.GetParent(Environment.CurrentDirectory)?.FullName!;

295-319: Empty latitude/longitude strings will throw FormatException

decimal.Parse is executed even when the CSV cell is empty (see rows originating from legacy data), caught and ignored. This silently drops those toponyms.

Either:

  1. Skip rows without coordinates earlier, or
  2. Use decimal.TryParse and decide how to handle missing values (e.g., keep Coordinate = null).

Fail-fast or log with context so data loss is visible.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c157790 and 92e4b98.

📒 Files selected for processing (1)
  • Streetcode/Streetcode.WebApi/Utils/WebParsingUtils.cs (1 hunks)
🔇 Additional comments (1)
Streetcode/Streetcode.WebApi/Utils/WebParsingUtils.cs (1)

30-58: Verify street-type mapping correctness

Prefix "жилий масив " is mapped to street type "парк", which looks semantically incorrect. Please double-check with domain experts; an erroneous mapping will skew search queries and downstream analytics.

Comment on lines +246 to +271
foreach (var row in remainsToParse)
{
var communityCol = row[CommunityColumn];
string cityStringSearchOptimized = communityCol.Substring(communityCol.IndexOf(" ", StringComparison.Ordinal) + 1);

var (streetName, streetType) = OptimizeStreetname(row[AddressColumn]);
string addressRow = $"{cityStringSearchOptimized} {streetName} {streetType}";

var (latitude, longitude) = await FetchCoordsByAddressAsync(addressRow);

if (latitude is null || longitude is null)
{
addressRow = cityStringSearchOptimized;
(latitude, longitude) = await FetchCoordsByAddressAsync(addressRow);
}

var newRow = string.Empty;
for (int i = 0; i <= AddressColumn; i++)
{
newRow += $"{row[i]};";
}

newRow += $"{latitude};{longitude}";

await File.AppendAllTextAsync(csvPath, newRow + "\n", Encoding.GetEncoding(1251));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Coordinate fetch loop creates a new HttpClient on every iteration

FetchCoordsByAddressAsync instantiates HttpClient each call (see L345). Thousands of requests will exhaust sockets and slow the job.

Refactor to reuse a single static HttpClient, or inject one. Example:

+private static readonly HttpClient _nominatimClient = CreateNominatimClient();

+private static HttpClient CreateNominatimClient()
+{
+    var client = new HttpClient();
+    client.DefaultRequestHeaders.Add("User-Agent", "HistoryCode");
+    return client;
+}

And in FetchCoordsByAddressAsync:

-            using var client = new HttpClient();
-            client.DefaultRequestHeaders.Add("User-Agent", "HistoryCode");
+            var client = _nominatimClient;

Committable suggestion skipped: line range outside the PR's diff.

@sonarqubecloud
Copy link

sonarqubecloud bot commented May 6, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot
0.0% Coverage on New Code (required ≥ 80%)
D Security Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants