-
-
Notifications
You must be signed in to change notification settings - Fork 292
feat: add parsing of response body for encoding #501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add parsing of response body for encoding #501
Conversation
s0ph1e
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you for creating a PR with this improvement 👍
I think we can merge it soon, please take a look on few minor comments ⬇️
-
Encoding changes
Resource type can be set after this functionality is executed, see
node-website-scraper/lib/scraper.js
Lines 179 to 182 in 07c4c02
// if type was not determined by mime we can try to get it from filename after it was generated if (!resource.getType()) { resource.setType(getTypeByFilename(filename)); }
not sure if it will work fine together.Now we parse html and css content in
lib/request.js, but there are CssHandler and HtmlHandler - they may be more appropriate places to manipulate the content. I suggest to leave everything related to request/response (e.g. working with headers) inlib/request.jsand move file content parsing to html- and css handlers. Does it makes sense for you? -
About package-lock file
package-lock.jsonwas intentionally added to.gitignorebecause it's only for developers & CI and if package versions are locked withpackage-lock.jsonit's harder to identify when something started to break because of nested dependency update during development.Without
package-lock.jsonwe can be sure that latest possible versions are installed and rely on nightly test to see if something started to break after dependency update.Maybe there is a better way to have always up-to-date dependencies and check it on CI. I will be happy to discuss it, but I suggest to do it in separate PR.
| return contentTypeHeader && contentTypeHeader.includes('utf-8') ? 'utf8' : 'binary'; | ||
| } | ||
|
|
||
| return undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we return undefined from this functions, but null from extractMimeTypeFromResponse? Maybe it will be better to return same resule (null or undefined) from both functions because they seem to be very similar.
| let body = null; | ||
| if (data instanceof Buffer) { | ||
| body = data.toString(encoding); | ||
| body = data.toString(encoding || 'binary'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need || here? It looks like encoding will be always 'binary' even if it was not found in headers or file content
|
Currently on holiday so can't address changes at the moment but yes that makes sense. Regarding the package-lock, how about the nightly is just updated to install with |
|
Of course I could just pull the branch and check, duh... |
|
@phawxby yes, --package-lock=false may work. @marcfielding1 we expect some encoding issues to be fixed by this PR, especially when encoding is set inside html file in tag. But it would be nice if you check whether this branch fixes an issue for you |
|
FYI, I scrapped https://tonclubtonmaillot.groupama.fr (I'm hostingthe result in [1]) from this PR, but go "�" instead of "à" in index page. I'll try to troubleshoot it when I have a chance |
|
sorry @phawxby it's out of my skills 😕 |
|
I'm closing this PR because similar changes were merged in #504 and will be released in the next version in the next 1-2 days |




Quickly threw this together, it should work in theory and close #500.
This is my last day before vacation so any additional work to get it merged in will need to be picked up by someone else. @Jeremytijal ?
Changes:
package-lock.jsonfrom.gitignore