Skip to content

Content.Text empty despite response code OK and Content stream contains data #238

@seanarmstrong87

Description

@seanarmstrong87

I am trying to crawl this page

https://www.tzb-info.cz/kontakty

By passing it to validUri in the following code:

        var pageRequester = new PageRequester(new CrawlConfiguration(), new WebContentExtractor());

        var crawledPage = await pageRequester.MakeRequestAsync(validUri).ConfigureAwait(false);
            
        Log.Logger.Information("{@Result}", new
        {
            url = crawledPage.Uri,
            status = Convert.ToInt32(crawledPage.HttpResponseMessage.StatusCode)
        });

        return crawledPage.Content.Text;

That website has a less common chartset in the header set like this

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">

The result is the Content.Text is always empty despite the response code being successful.

If I try to read the response stream directly I get this exception:

The character set provided in ContentType is invalid. Cannot read content as string using an invalid character set.

If I change the ChartSet on the response manually I am then able to read the stream:

args.CrawledPage.HttpResponseMessage.Content.Headers.ContentType.CharSet = @"ISO-8859-1";

This is my workaround for now.

Is this a bug that the "iso-8859-2" charset is not being interpreted correctly ? Or am I missing something from the configuration or setup in order to handle this charset?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions