Domain Crawler API

A Java-based web crawler API built with Spring Boot that fetches domain pricing data from Namecheap's domain search page and saves it to both CSV and JSON formats. The API exposes RESTful endpoints to trigger crawling, retrieve crawled data, and download the generated files.

🚀 Features

Web Crawling: Extracts domain TLDs, free domain privacy status, price per year, and renewal price.
Data Storage: Saves crawled data to domains.csv and domains.json.
REST API:
- POST /api/v1/crawl: Triggers a crawl and saves data to files.
- GET /api/v1/domains: Retrieves the most recent crawled data.
- GET /api/v1/files/{type}: Downloads the CSV or JSON file.
Modular Design: Follows OOP principles with separation of concerns.
Spring Boot: Leverages dependency injection and robust configuration options.

🛠 Prerequisites

Java 17 or higher
Maven 3.6+
Internet connection (to fetch pages and download dependencies)

⚙️ Setup Instructions

1. Clone the Repository

    git clone <repository-url>
    cd domain-crawler

2. Install Dependencies

Make sure Maven is installed, then run:s)

   mvn clean install

This will download dependencies (Spring Boot, Jsoup, Jackson, etc.).

3. Update CSS Selectors

The crawler uses placeholder selectors in DomainParser.java.

Open Namecheap's domain search page.
Inspect the HTML (right-click → Inspect).
Identify correct classes/IDs for:
- TLDs
- Prices
- Privacy status
- Renewal prices

Update the parse() method in:

    src/main/java/com/crawler/parser/DomainParser.java

Example:

    String tld = row.select("div.domain-name").text().trim(); // Replace with actual selector

4. Run the Application

    mvn spring-boot:run

The API will be available at http://localhost:8080.

📡 API Usage

1. Trigger a Crawl

    curl -X POST http://localhost:8080/api/v1/crawl

*Response: JSON array of crawled domain data.

[
  {
  "TLD": "sale.com",
  "Free Domain Privacy": true,
  "Price / Year": "$11.28",
  "Renewal Price": "$16.98"
  },
  ...
]

Side effect: Creates domains.csv and domains.json in the project root. Which you can change to the resources directory.

2. Get Crawled Data

  curl http://localhost:8080/api/domains

Response: Latest crawled data (or empty array if none).

3. Download Files

CSV:

  curl http://localhost:8080/api/v1/files/csv -o domains.csv

JSON:

  curl http://localhost:8080/api/v1/files/json -o domains.json

📄 File Examples

Example domains.csv

  TLD,Free Domain Privacy,Price / Year,Renewal Price
  sale.com,true,$11.28,$16.98
  sale.org,true,$7.48,$14.98
  sale.eu,false,$4.48,$8.98

Example domains.json

[
  {
    "TLD": "sale.com",
    "Free Domain Privacy": true,
    "Price / Year": "$11.28",
    "Renewal Price": "$16.98"
  }
]

⚠️ Important Notes

CSS Selectors
- The default ones are placeholders.
- Always inspect Namecheap's current HTML structure to update them.
- If the page changes, repeat the inspection.
Dynamic Content (If Jsoup Doesn't Work)
- If the data is loaded via JavaScript, use Selenium instead:

  import org.openqa.selenium.WebDriver;
  import org.openqa.selenium.chrome.ChromeDriver;
  
  public Document fetchPage() throws IOException {
      WebDriver driver = new ChromeDriver();
      try {
          driver.get(url);
          return Jsoup.parse(driver.getPageSource());
      } finally {
          driver.quit();
      }
}

Add Selenium to pom.xml

  <dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>4.15.0</version>
  </dependency>

Ensure ChromeDriver is installed.
Rate Limiting
- Add delay to avoid being blocked:

  Thread.sleep(1000);

Respect Namecheap’s robots.txt.

File Storage

Default: Root of the project (domains.csv, domains.json)
You can configure a custom path in application.properties.

Error Handling

Basic error handling is present.
Improve with custom exceptions, logging, and retry mechanisms for production.

Security

The current version is unauthenticated.
Add Spring Security for production use (JWT, API keys, etc.).

📦 Dependencies

Spring Boot 3.2.5: API framework
Jsoup 1.17.2: HTML parsing
Jackson 2.17.2: JSON handling Full list in pom.xml.

🧩 Extending the Project

New Output Formats: Add a XmlSaver or other DataSaver implementations.
Database Storage: Use Spring Data JPA for persistence.
Async Crawling: Annotate crawl method with @Async.
Authentication: Add JWT or OAuth2 security with Spring Security.

🛠 Troubleshooting

No Data Crawled: Check/Update selectors, verify URL access, and use Selenium if needed.
API Errors: Check logs (mvn spring-boot:run).
Dependency Issues: Ensure Java/Maven are up-to-date, try mvn clean install.

📄 License

This project is licensed under the MIT License. See LICENSE for details.

🤝 Contributing

Contributions are welcome!

Fork the repo
Create a new branch

  git checkout -b feature/your-feature

Commit changes

  git commit -m "Add your feature"

Push your branch

  git push origin feature/your-feature

Open a pull request 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Domain Crawler API

🚀 Features

🛠 Prerequisites

⚙️ Setup Instructions

1. Clone the Repository

2. Install Dependencies

3. Update CSS Selectors

4. Run the Application

📡 API Usage

1. Trigger a Crawl

2. Get Crawled Data

3. Download Files

📄 File Examples

⚠️ Important Notes

File Storage

Error Handling

Security

🧩 Extending the Project

🛠 Troubleshooting

📄 License

🤝 Contributing

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Domain Crawler API

🚀 Features

🛠 Prerequisites

⚙️ Setup Instructions

1. Clone the Repository

2. Install Dependencies

3. Update CSS Selectors

4. Run the Application

📡 API Usage

1. Trigger a Crawl

2. Get Crawled Data

3. Download Files

📄 File Examples

⚠️ Important Notes

File Storage

Error Handling

Security

🧩 Extending the Project

🛠 Troubleshooting

📄 License

🤝 Contributing