Skip to content

Don't insist on answer component of URL #16

@opoudjis

Description

@opoudjis

crawler.py can be used to retrieve blogs from Quora, not just answers. But if it is, the constraint that the URL fetched needs to match quora.com/answer/... needs to be relaxed:

# Get the part of the URL indicating the question title; we will save under this name
m1 = re.search('quora\.com/([^/]+)/answer', url)
# if there's a context topic
m2 = re.search('quora\.com/[^/]+/([^/]+)/answer', url)
filename = added_time + ' '
if not m1 is None:
    filename += m1.group(1)
elif not m2 is None:
    filename += m2.group(1)
else:
    print('[ERROR] Could not find question part of URL %s; skipping' % url, file=sys.stderr)
    continue

I change the last two lines to:

    # blog post
    m3 = re.search('quora\.com/([^/]+)', url)
    filename += m3.group(1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions