feat: add parsing of response body for encoding #501

phawxby · 2022-07-01T09:15:07Z

Quickly threw this together, it should work in theory and close #500.

This is my last day before vacation so any additional work to get it merged in will need to be picked up by someone else. @Jeremytijal ?

Changes:

Adds parsing of response body to look for charset
Remove package-lock.json from .gitignore
Increases ecmaversion in eslint to allow for optional chaining, it's an easy way to reduce cognitive complexity. It's supported in Node 14 and above

s0ph1e

First of all, thank you for creating a PR with this improvement 👍
I think we can merge it soon, please take a look on few minor comments ⬇️

Encoding changes

Resource type can be set after this functionality is executed, see

node-website-scraper/lib/scraper.js

Lines 179 to 182 in 07c4c02

    
           // if type was not determined by mime we can try to get it from filename after it was generated 
        
           if (!resource.getType()) { 
        
           	resource.setType(getTypeByFilename(filename)); 
        
           }

not sure if it will work fine together.

Now we parse html and css content in lib/request.js, but there are CssHandler and HtmlHandler - they may be more appropriate places to manipulate the content. I suggest to leave everything related to request/response (e.g. working with headers) in lib/request.js and move file content parsing to html- and css handlers. Does it makes sense for you?

About package-lock file

package-lock.json was intentionally added to .gitignore because it's only for developers & CI and if package versions are locked with package-lock.json it's harder to identify when something started to break because of nested dependency update during development.

Without package-lock.json we can be sure that latest possible versions are installed and rely on nightly test to see if something started to break after dependency update.

Maybe there is a better way to have always up-to-date dependencies and check it on CI. I will be happy to discuss it, but I suggest to do it in separate PR.

s0ph1e · 2022-07-09T19:50:41Z

lib/request.js

+		return contentTypeHeader && contentTypeHeader.includes('utf-8') ? 'utf8' : 'binary';
+	}
+
+	return undefined;


why do we return undefined from this functions, but null from extractMimeTypeFromResponse? Maybe it will be better to return same resule (null or undefined) from both functions because they seem to be very similar.

s0ph1e · 2022-07-09T19:53:00Z

lib/request.js

 	let body = null;
 	if (data instanceof Buffer) {
-		body = data.toString(encoding);
+		body = data.toString(encoding || 'binary');


Do we need || here? It looks like encoding will be always 'binary' even if it was not found in headers or file content

phawxby · 2022-07-11T07:03:49Z

Currently on holiday so can't address changes at the moment but yes that makes sense.

Regarding the package-lock, how about the nightly is just updated to install with --package-lock=false? You get install consistency with nightly checks.
https://docs.npmjs.com/cli/v8/using-npm/config#package-lock

marcfielding1 · 2022-07-14T13:08:00Z

Yeah this looks cool so for example the site html I'm scraping looks like this:

and I end up with:

Which is the difference between:

And this:

Which I'm assuming this would fix?

marcfielding1 · 2022-07-14T13:20:04Z

Of course I could just pull the branch and check, duh...

s0ph1e · 2022-07-14T14:35:25Z

@phawxby yes, --package-lock=false may work.

@marcfielding1 we expect some encoding issues to be fixed by this PR, especially when encoding is set inside html file in tag. But it would be nice if you check whether this branch fixes an issue for you

joelcapitao · 2022-08-01T14:17:58Z

FYI, I scrapped https://tonclubtonmaillot.groupama.fr (I'm hostingthe result in [1]) from this PR, but go "�" instead of "à" in index page. I'll try to troubleshoot it when I have a chance

[1] https://test-node-website-scraper.netlify.app/

Jeremytijal · 2022-08-16T15:02:48Z

sorry @phawxby it's out of my skills 😕

s0ph1e · 2022-08-29T19:59:59Z

I'm closing this PR because similar changes were merged in #504 and will be released in the next version in the next 1-2 days

phawxby added 2 commits July 1, 2022 10:12

feat: add parsing of response body for encoding

0964586

fix: reduce complexity

473920c

s0ph1e reviewed Jul 9, 2022

View reviewed changes

s0ph1e mentioned this pull request Aug 29, 2022

Use encoding from resource text #504

Merged

s0ph1e closed this Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: add parsing of response body for encoding #501

feat: add parsing of response body for encoding #501

Uh oh!

phawxby commented Jul 1, 2022 •

edited

Loading

Uh oh!

s0ph1e left a comment

Uh oh!

s0ph1e Jul 9, 2022

Uh oh!

s0ph1e Jul 9, 2022

Uh oh!

phawxby commented Jul 11, 2022

Uh oh!

marcfielding1 commented Jul 14, 2022 •

edited

Loading

Uh oh!

marcfielding1 commented Jul 14, 2022

Uh oh!

s0ph1e commented Jul 14, 2022

Uh oh!

joelcapitao commented Aug 1, 2022

Uh oh!

Jeremytijal commented Aug 16, 2022

Uh oh!

s0ph1e commented Aug 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	// if type was not determined by mime we can try to get it from filename after it was generated
	if (!resource.getType()) {
	resource.setType(getTypeByFilename(filename));
	}

Uh oh!

feat: add parsing of response body for encoding #501

feat: add parsing of response body for encoding #501

Uh oh!

Conversation

phawxby commented Jul 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s0ph1e left a comment

Choose a reason for hiding this comment

Uh oh!

s0ph1e Jul 9, 2022

Choose a reason for hiding this comment

Uh oh!

s0ph1e Jul 9, 2022

Choose a reason for hiding this comment

Uh oh!

phawxby commented Jul 11, 2022

Uh oh!

marcfielding1 commented Jul 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcfielding1 commented Jul 14, 2022

Uh oh!

s0ph1e commented Jul 14, 2022

Uh oh!

joelcapitao commented Aug 1, 2022

Uh oh!

Jeremytijal commented Aug 16, 2022

Uh oh!

s0ph1e commented Aug 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

phawxby commented Jul 1, 2022 •

edited

Loading

marcfielding1 commented Jul 14, 2022 •

edited

Loading