You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-250Lines changed: 2 additions & 250 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,253 +39,5 @@ Output :
39
39
"comments": ""
40
40
}
41
41
```
42
-
43
-
44
-
## Available properties and methods
45
-
```python
46
-
# You can use any of below properties and methods instead `a_tags_mp3`
47
-
page.a_tags_mp3
48
-
```
49
-
<details>
50
-
51
-
<summary>Click to expand!</summary>
52
-
53
-
54
-
#### <kbd>property</kbd> a_tag_hrefs
55
-
56
-
57
-
58
-
59
-
60
-
---
61
-
62
-
#### <kbd>property</kbd> a_tag_texts
63
-
64
-
65
-
66
-
67
-
68
-
---
69
-
70
-
#### <kbd>property</kbd> a_tags_mp3
71
-
72
-
73
-
74
-
75
-
76
-
---
77
-
78
-
#### <kbd>property</kbd> a_tags_rar
79
-
80
-
81
-
82
-
83
-
84
-
---
85
-
86
-
#### <kbd>property</kbd> a_tags_with_href
87
-
88
-
89
-
90
-
91
-
92
-
---
93
-
94
-
#### <kbd>property</kbd> article_tag
95
-
96
-
returns an article tag which has the most text length
97
-
98
-
---
99
-
100
-
#### <kbd>property</kbd> children
101
-
102
-
returns a list of `EzSoup` instances from `self.important_hrefs` ##### using `ThreadPoolExecutor` to crawl children much faster than normal `for` loop
103
-
104
-
---
105
-
106
-
#### <kbd>property</kbd> favicon_href
107
-
108
-
109
-
110
-
111
-
112
-
---
113
-
114
-
#### <kbd>property</kbd> important_a_tags
115
-
116
-
returns `a` tags that includes header (h2, h3) inside or `a` tags inside headers or elements with class `item` or `post` I call these important becuase they're most likely to be crawlable contentful webpages
returns possible topic/breadcrump names of webpage ### values can be unreliable since they aren't generated with NLP methods yet .
195
-
196
-
---
197
-
198
-
#### <kbd>property</kbd> summary_dict
199
-
200
-
201
-
202
-
203
-
204
-
---
205
-
206
-
#### <kbd>property</kbd> text
207
-
208
-
209
-
210
-
211
-
212
-
---
213
-
214
-
#### <kbd>property</kbd> title
215
-
216
-
usually the `<h1>` tag content of a web page is cleaner than original page `<title>` text so if the h1 or h2 text is similar to the title it is better to return it instead of original title text
0 commit comments