Skip to content

Commit 3ba4ebd

Browse files
committed
amazon
1 parent 28b32b7 commit 3ba4ebd

File tree

7 files changed

+9583
-0
lines changed

7 files changed

+9583
-0
lines changed

amazon_scrapping/Amazon-dataset/Product listing.csv

Lines changed: 3155 additions & 0 deletions
Large diffs are not rendered by default.
Binary file not shown.

amazon_scrapping/README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
2+
3+
<h1 align="center">Amazon Scrapping</h1>
4+
<blockquote align="center">Scrapping the product lisitng✏️ using python programming language💻. </blockquote>
5+
<p align="center">For new data generation <b>Classification part</b> we have writtern a python script to fetch📊, data from the 💻, Amazon website 🌐 and converted into csv files. </p>
6+
7+
8+
9+
# Introduction
10+
11+
**`Semi-supervised-sequence-learning-Project`** :computer: replication process is done over here and for further analysis creation of new data is required.
12+
13+
- The following script includes the following.
14+
- `scrapping.py` - Script to scrap the data from Amazon website
15+
- Product label listing: `Laptop`, `Phones`, `Printers`, `Desktops`, `Monitors`, `Mouse`, `Pendrive`, `Earphones`, `Smart TV`, `Power banks`
16+
17+
18+
19+
## Dependencies
20+
21+
- Install Selenium using `pip install -U selenium`
22+
23+
- Install Python 3 using the [MSI available in python.org download page](http://www.python.org/download).
24+
25+
- Load the drivers
26+
27+
```python
28+
from selenium import webdriver
29+
30+
browser = webdriver.Firefox()
31+
browser.get('http://selenium.dev/')
32+
```
33+
34+
- Selenium Server (optional)
35+
36+
```python
37+
java -jar selenium-server-standalone-4.0.0.jar
38+
```
39+
40+
## Installation
41+
42+
**1️⃣ Fork the `Semi-supervised-sequence-learning-Project/` repository**
43+
Follow these instructions on [how to fork a repository](https://help.github.com/en/articles/fork-a-repo)
44+
45+
**2️⃣ Cloning the repository**
46+
Once you have set up your fork of the `/Semi-supervised-sequence-learning-Project` repository, you'll want to clone it to your local machine. This is so you can make and test all of your personal edits before adding it to the master version of `/Semi-supervised-sequence-learning-Project`.
47+
48+
Navigate to the location on your computer where you want to host your code. Once in the appropriate folder, run the following command to clone the repository to your local machine.
49+
50+
```bash
51+
git clone https://github.com/sanjay-kv/Semi-supervised-sequence-learning-Project.git
52+
```
53+
54+
## Final Dataset
55+
56+
1️⃣ Here is the Link to **Final Dataset:** [Drive Link](https://drive.google.com/drive/folders/1HB8FCUVqkQpSbV7syq2ZsZ59kT5B-SGo?usp=sharing)
57+
58+
59+

amazon_scrapping/data_scrapped/Product listing.csv

Lines changed: 3157 additions & 0 deletions
Large diffs are not rendered by default.

amazon_scrapping/data_scrapped/test.csv

Lines changed: 395 additions & 0 deletions
Large diffs are not rendered by default.

amazon_scrapping/data_scrapped/train.csv

Lines changed: 2759 additions & 0 deletions
Large diffs are not rendered by default.

amazon_scrapping/scrapping.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# product name
2+
from selenium import webdriver
3+
from selenium.webdriver.common.by import By
4+
from selenium.webdriver.support.ui import WebDriverWait
5+
from selenium.webdriver.support import expected_conditions as EC
6+
import time
7+
import json
8+
import csv
9+
import pandas as pd
10+
11+
## One way to load chrome webdirver
12+
#from webdriver_manager.chrome import ChromeDriverManager
13+
#driver = webdriver.Chrome(ChromeDriverManager().install())
14+
15+
## another way to load chrome webdriver
16+
path = '/Users/mohammedrizwan/Downloads/chromedriver'
17+
driver = webdriver.Chrome(path)
18+
19+
def product_listing(txt):
20+
21+
driver.get("https://www.amazon.in/")
22+
driver.implicitly_wait(2)
23+
search = driver.find_element_by_id('twotabsearchtextbox').send_keys(txt)
24+
driver.implicitly_wait(2)
25+
search_button = driver.find_element_by_id('nav-search-submit-button').click()
26+
driver.implicitly_wait(5)
27+
28+
items = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.XPATH, '//a[@class="a-link-normal a-text-normal"]')))
29+
30+
for item in items:
31+
name_list.append(item.text)
32+
33+
driver.implicitly_wait(5)
34+
c1 = driver.find_element_by_class_name("a-pagination")
35+
c2 = c1.text
36+
c3 = c2.splitlines()
37+
num_of_pg = c3[-2]
38+
39+
for i in range(int(num_of_pg)-5):
40+
print(i)
41+
items = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.XPATH, '//a[@class="a-link-normal a-text-normal"]')))
42+
for item in items:
43+
name_list.append(item.text)
44+
link = driver.find_element_by_class_name("a-section.a-spacing-none.a-padding-base")
45+
next_lin = link.find_element_by_class_name("a-last").find_element_by_tag_name("a").get_attribute("href")
46+
driver.get(next_lin)
47+
driver.implicitly_wait(2)
48+
49+
50+
names = ['Laptop', 'Phones', 'Printers', 'Desktops', 'Monitors', 'Mouse', 'Pendrive', 'Earphones', 'Smart TV', 'Power banks']
51+
name_list = []
52+
for i in names:
53+
product_listing(i)
54+
df=pd.DataFrame(name_list)
55+
df.to_csv('./prod_listings.csv')
56+
print(df)
57+
driver.quit()
58+

0 commit comments

Comments
 (0)