Skip to content

Conversation

@phawxby
Copy link
Contributor

@phawxby phawxby commented Jul 1, 2022

Quickly threw this together, it should work in theory and close #500.

This is my last day before vacation so any additional work to get it merged in will need to be picked up by someone else. @Jeremytijal ?

Changes:

Copy link
Member

@s0ph1e s0ph1e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for creating a PR with this improvement 👍
I think we can merge it soon, please take a look on few minor comments ⬇️

  • Encoding changes

    Resource type can be set after this functionality is executed, see

    // if type was not determined by mime we can try to get it from filename after it was generated
    if (!resource.getType()) {
    resource.setType(getTypeByFilename(filename));
    }

    not sure if it will work fine together.

    Now we parse html and css content in lib/request.js, but there are CssHandler and HtmlHandler - they may be more appropriate places to manipulate the content. I suggest to leave everything related to request/response (e.g. working with headers) in lib/request.js and move file content parsing to html- and css handlers. Does it makes sense for you?

  • About package-lock file

    package-lock.json was intentionally added to .gitignore because it's only for developers & CI and if package versions are locked with package-lock.json it's harder to identify when something started to break because of nested dependency update during development.

    Without package-lock.json we can be sure that latest possible versions are installed and rely on nightly test to see if something started to break after dependency update.

    Maybe there is a better way to have always up-to-date dependencies and check it on CI. I will be happy to discuss it, but I suggest to do it in separate PR.

return contentTypeHeader && contentTypeHeader.includes('utf-8') ? 'utf8' : 'binary';
}

return undefined;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we return undefined from this functions, but null from extractMimeTypeFromResponse? Maybe it will be better to return same resule (null or undefined) from both functions because they seem to be very similar.

let body = null;
if (data instanceof Buffer) {
body = data.toString(encoding);
body = data.toString(encoding || 'binary');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need || here? It looks like encoding will be always 'binary' even if it was not found in headers or file content

@phawxby
Copy link
Contributor Author

phawxby commented Jul 11, 2022

Currently on holiday so can't address changes at the moment but yes that makes sense.

Regarding the package-lock, how about the nightly is just updated to install with --package-lock=false? You get install consistency with nightly checks.
https://docs.npmjs.com/cli/v8/using-npm/config#package-lock

@marcfielding1
Copy link

marcfielding1 commented Jul 14, 2022

Yeah this looks cool so for example the site html I'm scraping looks like this:

image

and I end up with:

image

Which is the difference between:

image

And this:

image

Which I'm assuming this would fix?

@marcfielding1
Copy link

Of course I could just pull the branch and check, duh...

@s0ph1e
Copy link
Member

s0ph1e commented Jul 14, 2022

@phawxby yes, --package-lock=false may work.

@marcfielding1 we expect some encoding issues to be fixed by this PR, especially when encoding is set inside html file in tag. But it would be nice if you check whether this branch fixes an issue for you

@joelcapitao
Copy link

FYI, I scrapped https://tonclubtonmaillot.groupama.fr (I'm hostingthe result in [1]) from this PR, but go "�" instead of "à" in index page. I'll try to troubleshoot it when I have a chance

[1] https://test-node-website-scraper.netlify.app/

@Jeremytijal
Copy link

sorry @phawxby it's out of my skills 😕

@s0ph1e
Copy link
Member

s0ph1e commented Aug 29, 2022

I'm closing this PR because similar changes were merged in #504 and will be released in the next version in the next 1-2 days

@s0ph1e s0ph1e closed this Aug 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

French characters "à" are not converted correctly

5 participants