Skip to content

Latest commit

 

History

History
212 lines (192 loc) · 5.44 KB

File metadata and controls

212 lines (192 loc) · 5.44 KB

Domain Crawler API

A Java-based web crawler API built with Spring Boot that fetches domain pricing data from Namecheap's domain search page and saves it to both CSV and JSON formats. The API exposes RESTful endpoints to trigger crawling, retrieve crawled data, and download the generated files.


🚀 Features

  • Web Crawling: Extracts domain TLDs, free domain privacy status, price per year, and renewal price.
  • Data Storage: Saves crawled data to domains.csv and domains.json.
  • REST API:
    • POST /api/v1/crawl: Triggers a crawl and saves data to files.
    • GET /api/v1/domains: Retrieves the most recent crawled data.
    • GET /api/v1/files/{type}: Downloads the CSV or JSON file.
  • Modular Design: Follows OOP principles with separation of concerns.
  • Spring Boot: Leverages dependency injection and robust configuration options.

🛠 Prerequisites

  • Java 17 or higher
  • Maven 3.6+
  • Internet connection (to fetch pages and download dependencies)

⚙️ Setup Instructions

1. Clone the Repository

    git clone <repository-url>
    cd domain-crawler

2. Install Dependencies

Make sure Maven is installed, then run:s)

   mvn clean install
  • This will download dependencies (Spring Boot, Jsoup, Jackson, etc.).

3. Update CSS Selectors

The crawler uses placeholder selectors in DomainParser.java.

  • Open Namecheap's domain search page.
  • Inspect the HTML (right-click → Inspect).
  • Identify correct classes/IDs for:
    • TLDs
    • Prices
    • Privacy status
    • Renewal prices

Update the parse() method in:

    src/main/java/com/crawler/parser/DomainParser.java
  • Example:
    String tld = row.select("div.domain-name").text().trim(); // Replace with actual selector

4. Run the Application

    mvn spring-boot:run

📡 API Usage

1. Trigger a Crawl

    curl -X POST http://localhost:8080/api/v1/crawl
  • *Response: JSON array of crawled domain data.
    [
      {
      "TLD": "sale.com",
      "Free Domain Privacy": true,
      "Price / Year": "$11.28",
      "Renewal Price": "$16.98"
      },
      ...
    ]
  • Side effect: Creates domains.csv and domains.json in the project root. Which you can change to the resources directory.

2. Get Crawled Data

  curl http://localhost:8080/api/domains
  • Response: Latest crawled data (or empty array if none).

3. Download Files

  • CSV:
  curl http://localhost:8080/api/v1/files/csv -o domains.csv
  • JSON:
  curl http://localhost:8080/api/v1/files/json -o domains.json

📄 File Examples

  • Example domains.csv
  TLD,Free Domain Privacy,Price / Year,Renewal Price
  sale.com,true,$11.28,$16.98
  sale.org,true,$7.48,$14.98
  sale.eu,false,$4.48,$8.98
  • Example domains.json
[
  {
    "TLD": "sale.com",
    "Free Domain Privacy": true,
    "Price / Year": "$11.28",
    "Renewal Price": "$16.98"
  }
]

⚠️ Important Notes

  • CSS Selectors

    • The default ones are placeholders.
    • Always inspect Namecheap's current HTML structure to update them.
    • If the page changes, repeat the inspection.
  • Dynamic Content (If Jsoup Doesn't Work)

    • If the data is loaded via JavaScript, use Selenium instead:
  import org.openqa.selenium.WebDriver;
  import org.openqa.selenium.chrome.ChromeDriver;
  
  public Document fetchPage() throws IOException {
      WebDriver driver = new ChromeDriver();
      try {
          driver.get(url);
          return Jsoup.parse(driver.getPageSource());
      } finally {
          driver.quit();
      }
}
  • Add Selenium to pom.xml
  <dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>4.15.0</version>
  </dependency>
  • Ensure ChromeDriver is installed.

  • Rate Limiting

    • Add delay to avoid being blocked:
  Thread.sleep(1000);
  • Respect Namecheap’s robots.txt.

File Storage

  • Default: Root of the project (domains.csv, domains.json)
  • You can configure a custom path in application.properties.

Error Handling

  • Basic error handling is present.
  • Improve with custom exceptions, logging, and retry mechanisms for production.

Security

  • The current version is unauthenticated.
  • Add Spring Security for production use (JWT, API keys, etc.).

📦 Dependencies

  • Spring Boot 3.2.5: API framework
  • Jsoup 1.17.2: HTML parsing
  • Jackson 2.17.2: JSON handling Full list in pom.xml.

🧩 Extending the Project

  • New Output Formats: Add a XmlSaver or other DataSaver implementations.
  • Database Storage: Use Spring Data JPA for persistence.
  • Async Crawling: Annotate crawl method with @Async.
  • Authentication: Add JWT or OAuth2 security with Spring Security.

🛠 Troubleshooting

  • No Data Crawled: Check/Update selectors, verify URL access, and use Selenium if needed.
  • API Errors: Check logs (mvn spring-boot:run).
  • Dependency Issues: Ensure Java/Maven are up-to-date, try mvn clean install.

📄 License

  • This project is licensed under the MIT License. See LICENSE for details.

🤝 Contributing

Contributions are welcome!

  1. Fork the repo
  2. Create a new branch
  git checkout -b feature/your-feature
  1. Commit changes
  git commit -m "Add your feature"
  1. Push your branch
  git push origin feature/your-feature
  1. Open a pull request 🚀