Skip to content

eladaus/URL-Detector

Repository files navigation

URL Detector

NuGet NuGet Downloads

This codebase is an open source C# port of the Java code in https://github.com/URL-Detector/URL-Detector, which in turn was a fork of the LinkedIn Engineering team's open sourced https://github.com/linkedin/URL-Detector, which seems to be abandoned. This port was originally based the Java code as at July 20, 2019, and was created to allow continued maintenance by the OS C# community.

For any and all future updates, the code releases will utilizing SemVer semantic versioning style.

Conveniently, eladaus.urldetector is available on NuGet as eladaus.urldetector. Install it from NuGet Package Manager Console with:

Install-Package eladaus.urldetector

Known Issues

None

Description

The url detector is a library originally created by the Linkedin Security Team to detect and extract urls in a body of text.

It is able to find and detect any urls such as:

Note: This C# port improves upon the original LinkedIn library by detecting emails with RFC 5322 dot-atom local-parts, including Gmail/Fastmail-style sub-addressing with a + tag (e.g. user.name+tag@gmail.com) as a single match. The full set of dot-atom specials (! # $ % & ' * + - = ? ^ _ \ { | } ~`) is also accepted in local-parts, including in combination with dots.

Note: Keep in mind that for security purposes, its better to overdetect urls and check more against blacklists than to not detect a url that was submitted. As such, some things that we detect might not be urls but somewhat look like urls. Also, instead of complying with RFC 3986 (http://www.ietf.org/rfc/rfc3986.txt), we try to detect based on browser behavior, optimizing detection for urls that are visitable through the address bar of Chrome, Firefox, Internet Explorer, and Safari.

It is also able to identify the parts of the identified urls. For example, for the url: http://user@linkedin.com:39000/hello?boo=ff#frag

  • Scheme - "http"
  • Username - "user"
  • Password - null
  • Host - "linkedin.com"
  • Port - 39000
  • Path - "/hello"
  • Query - "?boo=ff"
  • Fragment - "#frag"

How to Use:

Using the URL detector library is simple. Simply import the UrlDetector object and give it some options. In response, you will get a list of urls which were detected.

For example, the following code will find the url linkedin.com

    UrlDetector parser = new UrlDetector("hello this is a url Linkedin.com", UrlDetectorOptions.Default);
    List<Url> found = parser.Detect();

    foreach (Url url in found)
    {
        Console.WriteLine("Scheme: " + url.GetScheme());
        Console.WriteLine("Host: " + url.GetHost());
        Console.WriteLine("Path: " + url.GetPath());
    }

Quote Matching and HTML

Depending on your input string, you may want to handle certain characters in a special way. For example if you are parsing HTML, you probably want to break out of things like quotes and brackets. For example, if your input looks like

<a href="http://linkedin.com/abc"&gt;linkedin.com&lt;/a>

You probably want to make sure that the quotes and brackets are extracted. For that reason, using UrlDetectorOptions will allow you to change the sensitivity level of detection based on your expected input type. This way you can detect linkedin.com instead of linkedin.com</a>.

In code this looks like:

    UrlDetector parser = new UrlDetector("<a href="linkedin.com/abc">linkedin.com</a>", UrlDetectorOptions.HTML);
    List<Url> found = parser.Detect();

Contributing

Prerequisites

Getting Started

After cloning the repository, restore the local dotnet tools and install the Husky git hooks:

dotnet tool restore
dotnet husky install

This installs CSharpier (code formatter) and Husky.Net (git hooks) as local tools, and sets up a pre-commit hook that automatically formats staged .cs files with CSharpier before each commit.

Code Style

This project enforces consistent code style through several layers:

Tool / File Purpose
CSharpier (.csharpierrc.yaml) Opinionated code formatter — runs automatically on pre-commit
.editorconfig Enforces brace requirements and Rider/ReSharper formatting rules
EnforceCodeStyleInBuild (csproj) Promotes IDE code-style analyzers (e.g. IDE0011) to build errors

Key rules:

  • Braces are always required for if, else, for, foreach, while, do-while, using, lock, and fixed statements. The build will fail (error IDE0011) if braces are missing.
  • Line width is 100 characters (CSharpier).
  • Do not modify files under src/urldetector.tests/ or src/urldetector.tests.custom/ unless specifically adding new test cases.

Formatting

CSharpier runs automatically via the pre-commit hook. To manually format the entire solution:

dotnet csharpier format src/

If you use JetBrains Rider, install the CSharpier plugin for format-on-save support. The .editorconfig rules are picked up automatically by Rider's code cleanup and inspections.

Building & Testing

# Build
dotnet build src/eladaus.urldetector.sln

# Run tests
dotnet test src/eladaus.urldetector.sln

The build enforces code-style rules as errors — fix any IDE0011 (or similar) violations before committing.


Benchmarks

The project includes a BenchmarkDotNet suite that measures CPU and memory usage across different input sizes, URL densities, URL types, detector options, and structured content formats.

Benchmarks must be run in Release mode. From the repository root:

dotnet run -c Release --project src/urldetector.benchmarks

You will be prompted to select a benchmark class to run. To run a specific suite directly, pass a --filter:

# Run only the input-size scaling benchmarks
dotnet run -c Release --project src/urldetector.benchmarks -- --filter '*InputSizeBenchmarks*'

# Run only the URL density benchmarks
dotnet run -c Release --project src/urldetector.benchmarks -- --filter '*UrlDensityBenchmarks*'

# Run all benchmarks (no filter prompt)
dotnet run -c Release --project src/urldetector.benchmarks -- --filter '*'

Available benchmark suites:

Suite What it measures
InputSizeBenchmarks Scaling from 200 B to 5 MB inputs
UrlDensityBenchmarks Impact of URL density (0%–90%) on 50 KB input
UrlTypeBenchmarks Cost per URL type (web, email, IPv4, IPv6, mixed)
DetectorOptionsBenchmarks Overhead of each UrlDetectorOptions mode
StructuredContentBenchmarks Matching options against HTML/JSON/XML/JS content
RealWorldHtmlBenchmarks Real HTML sample files with default vs. all flags
TestCpuAndMemoryUsage Legacy suite — real HTML files with all flags enabled

Results are written to src/urldetector.benchmarks/BenchmarkDotNet.Artifacts/.


About:

This C# port was originally created by Dale Holborow of eladaus oy. Future contributions are welcome.

The original Java library was written by the security team and Linkedin when other options did not exist. Some of the primary authors are:


License

The C# port provided by eladaus is released under Apache License Version 2.0, as per the original LinkedIn java library.

Original Java code is Copyright 2015 LinkedIn Corp. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the license at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an " AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

About

C# port of the popular LinkedIn Java library to detect and normalize URLs in text

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors