Don't insist on answer component of URL

crawler.py can be used to retrieve blogs from Quora, not just answers. But if it is, the constraint that the URL fetched needs to match `quora.com/answer/...` needs to be relaxed:

    # Get the part of the URL indicating the question title; we will save under this name
    m1 = re.search('quora\.com/([^/]+)/answer', url)
    # if there's a context topic
    m2 = re.search('quora\.com/[^/]+/([^/]+)/answer', url)
    filename = added_time + ' '
    if not m1 is None:
        filename += m1.group(1)
    elif not m2 is None:
        filename += m2.group(1)
    else:
        print('[ERROR] Could not find question part of URL %s; skipping' % url, file=sys.stderr)
        continue


I change the last two lines to:

        # blog post
        m3 = re.search('quora\.com/([^/]+)', url)
        filename += m3.group(1)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't insist on answer component of URL #16

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Don't insist on answer component of URL #16

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions