Skip to content

Commit ec813b5

Browse files
committed
[ADDED]: LinkedIn Profile
1 parent 08b7bdd commit ec813b5

File tree

4 files changed

+199
-0
lines changed

4 files changed

+199
-0
lines changed

LinkedIn_Profile_Info/README.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# LinkedIn Profile Picture
2+
3+
Python package to crawl LinkedIn profile pictures using the Google Custom Search API.
4+
5+
## Overview
6+
7+
This Python package allows you to retrieve LinkedIn profile pictures by providing the profile URL. The package uses the Google Custom Search API to search for the profile pictures associated with the LinkedIn ID.
8+
9+
## Features
10+
11+
- Retrieve LinkedIn profile picture URL using the profile URL.
12+
- Get additional profile information such as name, headline, and public URL.
13+
14+
## Installation
15+
16+
You can install the package using pip:
17+
18+
```bash
19+
pip install linkedin-profile-picture
20+
```
21+
## Information about Google API
22+
23+
The proper code for google API is in google_API.py
24+
25+
The provided code defines a Python class called GoogleSearchAPI, which is designed to interact with the Google Custom Search API to perform custom searches on Google and retrieve search results related to specific LinkedIn profiles. Let's explain how this code can be used and how it fits into a larger context:
26+
27+
1. GoogleSearchAPI class:
28+
29+
- Initialization (__init__): The GoogleSearchAPI class is initialized with two parameters: key and cx. These parameters represent the API key and custom search engine ID required to access the Google Custom Search API.
30+
31+
- API Request (_hit_api): The _hit_api method is responsible for making requests to the Google Custom Search API. It takes a LinkedIn ID (extracted from the LinkedIn profile URL) as input and constructs a request to search for results related to that ID. The method supports pagination to retrieve multiple pages of search results.
32+
33+
- Response Handling (_create_api_response): The _create_api_response method processes the API response and extracts relevant information from it. If the API response status code is 200, it retrieves search results and stores them in the results list. Otherwise, it stores the error response in the error attribute of an APIResponse object.
34+
35+
2. Usage:
36+
To use the GoogleSearchAPI class, you would typically do the following steps:
37+
38+
- Import the required modules and create an instance of the GoogleSearchAPI class, providing your API key and custom search engine ID.
39+
40+
Get the LinkedIn profile URL for which you want to find the profile picture. Extract the LinkedIn ID from this URL using the extract_id method of the ProfilePicture class.
41+
42+
- Use the get_profile_picture method of the ProfilePicture class to get the profile picture URL by passing the LinkedIn profile URL as input. This method internally uses the GoogleSearchAPI class to perform the custom search and extract the profile picture URL.
43+
44+
- Optionally, you can use the get_profile_info method of the ProfilePicture class to fetch additional profile information like name, headline, and public URL from the LinkedIn profile.
45+
46+
3. API Rate Limiting:
47+
The code handles API rate limiting gracefully. If the Google Custom Search API returns a status code of 429 (Too Many Requests), it means the API rate limit has been reached for a particular period. In such cases, the code waits for the number of seconds specified in the "Retry-After" header sent by the API and then retries the API request. This ensures that the code doesn't exceed the API rate limit and avoids getting blocked.

LinkedIn_Profile_Info/google_API.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
import re
2+
import requests
3+
import logging
4+
import time
5+
6+
logger = logging.getLogger(__name__)
7+
8+
class GoogleSearchAPI:
9+
def __init__(self, key: str, cx: str):
10+
self._cx = cx
11+
self._key = key
12+
self._api_url = "https://www.googleapis.com/customsearch/v1"
13+
self._params = {
14+
"num": 10,
15+
"cx": self._cx,
16+
"key": self._key
17+
}
18+
19+
def _hit_api(self, linkedin_id: str) -> list:
20+
results = []
21+
try:
22+
params = self._params.copy()
23+
params["exactTerms"] = f"/in/{linkedin_id}"
24+
while True:
25+
resp = requests.get(self._api_url, params=params)
26+
if resp.status_code == 200:
27+
data = resp.json()
28+
items = data.get("items", [])
29+
results.extend(items)
30+
31+
next_page = data.get("queries", {}).get("nextPage", [])
32+
if not next_page:
33+
break
34+
params["start"] = next_page[0]["startIndex"]
35+
elif resp.status_code == 429: # API rate limiting
36+
retry_after = int(resp.headers.get("Retry-After", 5))
37+
logger.warning(f"Google Custom Search API rate limit reached. Retrying in {retry_after} seconds.")
38+
time.sleep(retry_after)
39+
else:
40+
resp.raise_for_status() # Raise an exception for other HTTP status codes
41+
except requests.exceptions.RequestException as e:
42+
logger.exception(f"Error in _hit_api: {e}")
43+
except Exception as e:
44+
logger.exception("An error occurred while processing the API response.")
45+
return results

LinkedIn_Profile_Info/profile.py

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
import re
2+
import requests
3+
from urllib.parse import urlparse, unquote
4+
import logging
5+
6+
logger = logging.getLogger(__name__)
7+
8+
class GoogleSearchAPI:
9+
def __init__(self, key: str, cx: str):
10+
self._cx = cx
11+
self._key = key
12+
self._api_url = "https://www.googleapis.com/customsearch/v1"
13+
self._params = {
14+
"num": 10,
15+
"cx": self._cx,
16+
"key": self._key
17+
}
18+
19+
def _hit_api(self, linkedin_id: str) -> list:
20+
results = []
21+
try:
22+
params = self._params.copy()
23+
params["exactTerms"] = f"/in/{linkedin_id}"
24+
while True:
25+
resp = requests.get(self._api_url, params=params)
26+
if resp.status_code != 200:
27+
logger.warning(f"Google Custom Search API error: {resp.status_code} - {resp.text}")
28+
break
29+
30+
data = resp.json()
31+
items = data.get("items", [])
32+
results.extend(items)
33+
34+
next_page = data.get("queries", {}).get("nextPage", [])
35+
if not next_page:
36+
break
37+
params["start"] = next_page[0]["startIndex"]
38+
except Exception as e:
39+
logger.exception("Error in _hit_api:")
40+
return results
41+
42+
class ProfilePicture:
43+
def __init__(self, key: str, cx: str):
44+
self._api_obj = GoogleSearchAPI(key, cx)
45+
46+
def extract_id(self, link: str) -> str:
47+
""" To get a clean LinkedIn ID """
48+
linkedin_id = link
49+
match = re.findall(r'\/in\/([^\/]+)\/?', urlparse(link).path)
50+
if match:
51+
linkedin_id = match[0].strip()
52+
linkedin_id = linkedin_id.strip("/")
53+
linkedin_id = unquote(linkedin_id)
54+
return linkedin_id
55+
56+
def _check_picture_url(self, link: str) -> bool:
57+
match = re.search(r"(media-exp\d\.licdn\.com).+?(profile-displayphoto-shrink_)", link)
58+
return bool(match)
59+
60+
def _check_url_exists(self, link: str) -> bool:
61+
try:
62+
resp = requests.head(link, timeout=5)
63+
return resp.status_code == 200
64+
except requests.RequestException:
65+
return False
66+
67+
def _extract_profile_picture(self, linkedin_id: str, res: list) -> str:
68+
link = ""
69+
for item in res:
70+
linkedin_url = item.get("link", "")
71+
search_id = self.extract_id(linkedin_url)
72+
if search_id == linkedin_id:
73+
metatags = item.get("pagemap", {}).get("metatags", [])
74+
metatags = [tag.get("og:image") for tag in metatags if "og:image" in tag]
75+
76+
for url in metatags:
77+
if self._check_picture_url(url) and self._check_url_exists(url):
78+
link = url
79+
break
80+
if link:
81+
break
82+
return link
83+
84+
def _extract_profile_info(self, linkedin_id: str, res: list) -> dict:
85+
info = {}
86+
for item in res:
87+
linkedin_url = item.get("link", "")
88+
search_id = self.extract_id(linkedin_url)
89+
if search_id == linkedin_id:
90+
info["name"] = item.get("title")
91+
info["headline"] = item.get("snippet")
92+
info["public_url"] = linkedin_url
93+
break
94+
return info
95+
96+
def get_profile_picture(self, link: str) -> str:
97+
linkedin_id = self.extract_id(link)
98+
api_resp = self._api_obj._hit_api(linkedin_id)
99+
profile_picture_url = self._extract_profile_picture(linkedin_id, api_resp)
100+
return profile_picture_url
101+
102+
def get_profile_info(self, link: str) -> dict:
103+
linkedin_id = self.extract_id(link)
104+
api_resp = self._api_obj._hit_api(linkedin_id)
105+
profile_info = self._extract_profile_info(linkedin_id, api_resp)
106+
return profile_info
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
requests>=2.26.0

0 commit comments

Comments
 (0)