diff --git a/codepile/github_issues/README.md b/codepile/github_issues/README.md new file mode 100644 index 0000000..b091b5f --- /dev/null +++ b/codepile/github_issues/README.md @@ -0,0 +1,53 @@ + +Primary source of data is the bigquery [githubarchive](https://www.gharchive.org/) dataset. Raw dumps can also be downloaded from https://www.gharchive.org/ directly. But since the data is large (17TB in bigquery), downloading and analyzing them will be an effort. So, going with bigquery is a reasonable choice to begin with. + +Information in githubarchive is events data, it's not a snapshot of github data.There will be multiple events (create, update, delete etc) for same github resource (issue, repo etc). + +bigquery data has a top level field called "type" which is the event type, based on which we can filter the events that are of interest to us. + +Events that are of interest to us are IssueCommentEvent, IssuesEvent. Read more about these events [here](https://docs.github.com/en/developers/webhooks-and-events/events/github-event-types). This documentation says that the `payload.action` field can be "created", "edited" or "deleted". but, bigquery seems to only contain data for "created" action. It is clarified [here](https://github.com/igrigorik/gharchive.org/issues/183) on the fact that edit events are not part of the gharchive. + +Github APIs treat both issues and pull request in a similar manner. [Ref](https://docs.github.com/en/rest/issues/issues). `IssuesEvent` contains events related to issue + pull request creation/closed events. `IssueCommentEvent` contains events related to issue + pull request comments. So, we need to exclude events related to pull requests. + +Data format for pre-2015 and later periods is different. pre-2015 data contains only the issue id and comment ids while the later data contains title and body as well. So, we need to get the content for pre-2015 data by some other means. + +`WatchEvent` can be used to get the list of repos by number of stars. Below query gets the list of repos and the number of stars +``` +SELECT + COUNT(*) naive_count, + COUNT(DISTINCT actor.id) unique_by_actor_id, + COUNT(DISTINCT actor.login) unique_by_actor_login, repo.id, repo.url +FROM `githubarchive.day.2*` +where type = 'WatchEvent' +GROUP BY repo.id, repo.url +``` +Note that the number of stars from this query is only approximate. Read [SO Post](https://stackoverflow.com/questions/42918135/how-to-get-total-number-of-github-stars-for-a-given-repo-in-bigquery) to understand the nuances around the star counts. + +Data in big query is organized in three ways (day/month/year wise). Use the daily tables for exploration and testing since bigquery pricing is per the data that gets scanned during query execution. + +Issue, comments data is extracted from monthly tables on 27,28th Oct 2022. +Repo list is extracted from daily tables on 30th Oct 2022. + +#### Some stats +Total repos extracted = ~25.7M +Repos with <= 100 stars = ~324K + + +post 2015 issue = ~85M issues +pre 2015 issues = ~9.4M issues +post 2015 issue comments = ~156M +pre 2015 issue comments = ~17.7M + + +filtered post 2015 issues = ~29.5M +filtered pre 2015 issues = ~2.9M +filtered post 2015 issue comments = ~100M +filtered pre 2015 issue comments = ~11M + + +#### Other data sources explored: +[ghtorrent](https://ghtorrent.org/) project doesn't seem to be active. Data is there only till 2019. Even that, we can only get the issue ids. +Bigquery public dataset `github_repo` doesn't have data related to issues and comments. It only has code. + + + diff --git a/codepile/github_issues/docs/bigquery-queries.md b/codepile/github_issues/docs/bigquery-queries.md new file mode 100644 index 0000000..71112f4 --- /dev/null +++ b/codepile/github_issues/docs/bigquery-queries.md @@ -0,0 +1,103 @@ +#### Get information about different tables in the dataset +``` +SELECT * from `.__TABLES__` +``` + +#### Sample events form a table +To explore the data in a table, use sampling instead of a normal query as it is cost efficient +``` +SELECT * FROM `` TABLESAMPLE SYSTEM (.001 percent) +``` + +#### Get issues +``` +SELECT json_payload.issue.id as issue_id, json_payload.issue.number as issue_no, issue_url, payload, repo, org, created_at, id, other FROM + (SELECT + id, SAFE.PARSE_JSON(payload) AS json_payload, JSON_VALUE(payload, '$.action') AS action, JSON_QUERY(payload, '$.issue.url') as issue_url, payload, repo, org, created_at, other + FROM `githubarchive.month.20*` + WHERE _TABLE_SUFFIX BETWEEN '1412' and '2300' + AND type = 'IssuesEvent' + ) WHERE action = 'opened' AND issue_url IS NOT NULL + +``` + +#### Get issue comments +``` +SELECT issue_id, issue_no, comment_id, comment_url, payload, repo, org, created_at, id, other FROM + (SELECT + id, JSON_VALUE(payload, '$.issue.id') AS issue_id, JSON_VALUE(payload, '$.issue.number') as issue_no, JSON_VALUE(payload, '$.comment.id') as comment_id, JSON_VALUE(payload, '$.comment.url') as comment_url, JSON_QUERY(payload, '$.issue.pull_request') as pull_request,payload, repo, org, created_at, other + FROM `githubarchive.month.20*` + WHERE _TABLE_SUFFIX BETWEEN '1412' and '2300' + AND type = 'IssueCommentEvent' + ) WHERE comment_url IS NOT NULL AND pull_request IS NULL + +``` + +#### Get pre-2015 issues +``` +SELECT tb1.json_payload.issue as issue_id, tb1.json_payload.number as issue_no, payload, repo, org, created_at, id, other FROM +( + select SAFE.PARSE_JSON(payload) as json_payload, JSON_QUERY(payload, '$.action') as action, JSON_QUERY(payload, '$.issue.url') as issue_url, * + from `githubarchive.year.201*` + WHERE type = 'IssuesEvent' + AND _TABLE_SUFFIX BETWEEN '1' and '5' +) tb1 +WHERE tb1.action = '"opened"' AND tb1.issue_url IS NULL +``` + +#### Get pre-2015 issue comments +``` +SELECT issue_id, comment_id, comment_url, payload, repo, org, created_at, id, other FROM +( + select JSON_VALUE(payload, '$.comment_id') as comment_id, JSON_VALUE(payload, '$.issue_id') as issue_id, JSON_VALUE(other, '$.url') as comment_url, payload, repo, org, created_at, id, other + from `githubarchive.month.201*` + WHERE _TABLE_SUFFIX BETWEEN '000' AND '501' AND type = 'IssueCommentEvent' +) tb1 +WHERE comment_id IS NOT NULL AND NOT CONTAINS_SUBSTR(tb1.comment_url, '/pull/') +LIMIT 100 +``` + +#### Issues filtered +``` +select stars, html_url as repo_url, issue_id, issue_no, title, body FROM +`.100-star-repos` as t1 INNER JOIN +(select issue_id, issue_no, issue_url, repo, JSON_VALUE(payload, '$.issue.title') as title, JSON_VALUE(payload, '$.issue.body') as body from `.issues`) as t2 ON SUBSTR(t1.html_url, 20) = t2.repo.name +``` + +#### Issue comments filtered +``` +select stars, html_url as repo_url, issue_id, issue_no, comment_id, title, body, comment, created_at FROM +`.100-star-repos` as t1 INNER JOIN +(select issue_id, issue_no, comment_id, repo, JSON_VALUE(payload, '$.issue.title') as title, JSON_VALUE(payload, '$.issue.body') as body, JSON_VALUE(payload, '$.comment.body') as comment, created_at from +`.issue-comments`) as t2 +ON SUBSTR(t1.html_url, 20) = t2.repo.name +``` + +#### Star count per repo +``` +SELECT + COUNT(*) naive_count, + COUNT(DISTINCT actor.id) unique_by_actor_id, + COUNT(DISTINCT actor.login) unique_by_actor_login, repo.id, repo.url +FROM `githubarchive.day.2*` +where type = 'WatchEvent' +GROUP BY repo.id, repo.url +``` +There is additional post processing done on top of the results of this to unify the repo url to same format since different events have different format of the url (https://github.com, https://api.github.com etc) + +#### pre-2015 issues filtered by 100-star repos +``` +select t1.stars as repo_stars, t1.html_url as repo_url, t2.issue_id, t2.issue_no + FROM `.100-star-repos` as t1 + INNER JOIN `.pre-2015-issues` as t2 + ON t1.html_url = t2.repo.url +``` + +#### pre-2015 issue comments filtered by 100-star repo list +``` +select t1.html_url as repo_url, t1.stars as repo_stars, t2.issue_id, t2.comment_id, t2.comment_url + FROM `.100-star-repos` as t1 + INNER JOIN `.pre-2015-issue-comments` as t2 + ON t1.html_url = t2.repo.url +``` + diff --git a/codepile/github_issues/docs/repo-stars.md b/codepile/github_issues/docs/repo-stars.md new file mode 100644 index 0000000..39b403b --- /dev/null +++ b/codepile/github_issues/docs/repo-stars.md @@ -0,0 +1,16 @@ +Get number of stars by repo +``` +SELECT + COUNT(*) naive_count, + COUNT(DISTINCT actor.id) unique_by_actor_id, + COUNT(DISTINCT actor.login) unique_by_actor_login, repo.id, repo.url +FROM `githubarchive.day.2*` +where type = 'WatchEvent' +GROUP BY repo.url +``` + +Some of the events don't contain repo.id and some don't have actor.id. So, `unique_by_actor_login` is the most accurate of all the counts. + +repo.url takes values of different format over the time period. Some have urls of the format https://api.github.com/repos/... while some have https://api.github.dev/repos/... + +Result of big query is further processed to get the repo url into single format and then sum stars by new url. This list is further filtered down to get the repos that have >= 100 stars. \ No newline at end of file diff --git a/codepile/github_issues/gh_graphql/README.md b/codepile/github_issues/gh_graphql/README.md new file mode 100644 index 0000000..bc7553e --- /dev/null +++ b/codepile/github_issues/gh_graphql/README.md @@ -0,0 +1,7 @@ +* This Scrapy crawler is not production-grade implementation and may not be following best practices. +* This uses Github graphql endpoint to fetch data. API lets us get upto 100 issues along with other metadata(labels, comments, author etc) in a single request. +* When getting issues+comments+other metadata, API is returning something called secondary rate limits even when we haven't breached to 5k/hour request limit. It is not entirely clear on how to mitigate this. +* Graphql requests and responses can be explored here https://docs.github.com/en/graphql/overview/explorer +* Refer the standalone scripts "extract-*" to convert the raw graphql responses into a flat list of issues/comments. + +Run scrapy spider using a command like this `python run.py 0|1|2...` \ No newline at end of file diff --git a/codepile/github_issues/gh_graphql/comments.graphql b/codepile/github_issues/gh_graphql/comments.graphql new file mode 100644 index 0000000..1667f5d --- /dev/null +++ b/codepile/github_issues/gh_graphql/comments.graphql @@ -0,0 +1,46 @@ +query($repo_owner: String!, $repo_name: String!, $page_size: Int!, $after_cursor: String) { + repository(owner: $repo_owner, name: $repo_name) { + issues(first: $page_size, after: $after_cursor) { + pageInfo { + endCursor + hasNextPage + }, + edges { + node { + number, + databaseId, + createdAt, + comments(first: 100) { + pageInfo { + hasNextPage, + endCursor + } + nodes { + databaseId + authorAssociation, + author { + login, + avatarUrl, + __typename + } + body + reactionGroups { + content, + reactors { + totalCount + } + } + }, + totalCount + } + } + } + } + }, + rateLimit { + limit + cost + remaining + resetAt + } +} \ No newline at end of file diff --git a/codepile/github_issues/gh_graphql/extract-comments.py b/codepile/github_issues/gh_graphql/extract-comments.py new file mode 100644 index 0000000..057e267 --- /dev/null +++ b/codepile/github_issues/gh_graphql/extract-comments.py @@ -0,0 +1,39 @@ +from pyspark.sql import SparkSession +from pyspark.sql.functions import explode, col, filter, size, transform + +spark_dir = "/data/tmp/" +spark = SparkSession.builder.config("spark.worker.cleanup.enabled", "true").config("spark.local.dir", spark_dir).config("spark.driver.memory", "8G").config("spark.executor.cores", 10).master("local[16]").appName('spark-stats').getOrCreate() +df = spark.read.json("/data/comments-*.jsonl") + +df2 = df.select(["data.repository.issues.pageInfo.hasNextPage", explode("data.repository.issues.edges").alias("issue")]) +df3 = df2.select([ + col("issue.node.number").alias("issue_no"), + col("issue.node.databaseId").alias("issue_id"), + col("issue.node.createdAt").alias("issue_created_at"), + col("issue.node.comments.pageInfo.hasNextPage").alias("has_more_comments"), + col("issue.node.comments.pageInfo.endCursor").alias("next_comments_cursor"), + explode("issue.node.comments.nodes").alias("comment") +]) + +def filter_reactions(x): + return x.reactors.totalCount > 0 + +def transform_reactions(x): + print(x) + return {x.content: x.reactors.totalCount} + +df4 = df3.select([ + "issue_no", + "issue_id", + "issue_created_at", + "has_more_comments", + "next_comments_cursor", + "comment.databaseId", + "comment.authorAssociation", + col("comment.author.login").alias("comment_author"), + col("comment.body").alias("comment_body"), + filter("comment.reactionGroups", filter_reactions) + .alias("reaction_groups") +]).dropDuplicates(["databaseId"]) + +df4.write.parquet("/data/comments") \ No newline at end of file diff --git a/codepile/github_issues/gh_graphql/extract-issues-small.py b/codepile/github_issues/gh_graphql/extract-issues-small.py new file mode 100644 index 0000000..cc06994 --- /dev/null +++ b/codepile/github_issues/gh_graphql/extract-issues-small.py @@ -0,0 +1,33 @@ +from pyspark.sql import SparkSession +from pyspark.sql.functions import explode, col, filter, size, transform + +spark_dir = "/data/tmp/" +spark = SparkSession.builder.config("spark.worker.cleanup.enabled", "true").config("spark.local.dir", spark_dir).config("spark.driver.memory", "8G").config("spark.executor.cores", 10).master("local[16]").appName('spark-stats').getOrCreate() + +def labels_transformer(x): + return {"name": x.node.name, "description": x.node.description} + +def filter_reactions(x): + return x.reactors.totalCount > 0 + +df = spark.read.json("/data/issues-*.jsonl") + +# separate issues into their own rows +df2 = df.select([ + col("data.repository.databaseId").alias("repo_id"), + col("data.repository.nameWithOwner").alias("repo_name_with_owner"), + explode("data.repository.issues.edges").alias("issue") +]) + +#extract, clean issue metadata +df3 = df2.select([ + "repo_id", + "repo_name_with_owner", + col("issue.node.number").alias("issue_no"), + col("issue.node.databaseId").alias("issue_id"), + col("issue.node.createdAt").alias("issue_created_at"), + col("issue.node.title").alias("title"), + col("issue.node.body").alias("body") +]).dropDuplicates(["issue_id"]) +df3.write.parquet("/data/issues-lite") +print(df3.count()) \ No newline at end of file diff --git a/codepile/github_issues/gh_graphql/extract-issues.py b/codepile/github_issues/gh_graphql/extract-issues.py new file mode 100644 index 0000000..d16c2cb --- /dev/null +++ b/codepile/github_issues/gh_graphql/extract-issues.py @@ -0,0 +1,49 @@ +from pyspark.sql import SparkSession +from pyspark.sql.functions import explode, col, filter, size, transform + +spark_dir = "/data/tmp/" +spark = SparkSession.builder.config("spark.worker.cleanup.enabled", "true").config("spark.local.dir", spark_dir).config("spark.driver.memory", "8G").config("spark.executor.cores", 10).master("local[16]").appName('spark-stats').getOrCreate() + +def labels_transformer(x): + return {"name": x.node.name, "description": x.node.description} + +def filter_reactions(x): + return x.reactors.totalCount > 0 + +df = spark.read.json("/data/issues.jsonl") + +# separate issues into their own rows +df2 = df.select([ + col("data.repository.databaseId").alias("repo_id"), + col("data.repository.nameWithOwner").alias("repo_name_with_owner"), + col("data.repository.stargazerCount").alias("star_count"), + col("data.repository.description").alias("repo_description"), + col("data.repository.languages.edges").alias("languages"), + "data.repository.issues.pageInfo.hasNextPage", + col("data.repository.issues.totalCount").alias("issue_count"), + explode("data.repository.issues.edges").alias("issue") +]) + +#extract, clean issue metadata +df3 = df2.select([ + "repo_id", + "repo_name_with_owner", + "star_count", + "repo_description", + "languages", + "issue_count", + col("issue.node.number").alias("issue_no"), + col("issue.node.databaseId").alias("issue_id"), + col("issue.node.createdAt").alias("issue_created_at"), + col("issue.node.title").alias("title"), + col("issue.node.author.login").alias("author"), + col("issue.node.author.avatarUrl").alias("author_avatar"), + col("issue.node.author.__typename").alias("author_type"), + col("issue.node.authorAssociation").alias("author_association"), + col("issue.node.comments.totalCount").alias("comment_count"), + col("issue.node.labels.edges").alias("labels"), + filter("issue.node.reactionGroups", filter_reactions) + .alias("reaction_groups") +]).dropDuplicates(["issue_id"]) +df3.write.parquet("/data/issues") +print(df3.count()) \ No newline at end of file diff --git a/codepile/github_issues/gh_graphql/gh_graphql/__init__.py b/codepile/github_issues/gh_graphql/gh_graphql/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/codepile/github_issues/gh_graphql/gh_graphql/items.py b/codepile/github_issues/gh_graphql/gh_graphql/items.py new file mode 100644 index 0000000..8d8078d --- /dev/null +++ b/codepile/github_issues/gh_graphql/gh_graphql/items.py @@ -0,0 +1,12 @@ +# Define here the models for your scraped items +# +# See documentation in: +# https://docs.scrapy.org/en/latest/topics/items.html + +import scrapy + + +class GhGraphqlItem(scrapy.Item): + # define the fields for your item here like: + # name = scrapy.Field() + pass diff --git a/codepile/github_issues/gh_graphql/gh_graphql/middlewares.py b/codepile/github_issues/gh_graphql/gh_graphql/middlewares.py new file mode 100644 index 0000000..9422aef --- /dev/null +++ b/codepile/github_issues/gh_graphql/gh_graphql/middlewares.py @@ -0,0 +1,104 @@ +# Define here the models for your spider middleware +# +# See documentation in: +# https://docs.scrapy.org/en/latest/topics/spider-middleware.html + +from scrapy import signals + +# useful for handling different item types with a single interface +from itemadapter import is_item, ItemAdapter + + +class GhGraphqlSpiderMiddleware: + # Not all methods need to be defined. If a method is not defined, + # scrapy acts as if the spider middleware does not modify the + # passed objects. + + @classmethod + def from_crawler(cls, crawler): + # This method is used by Scrapy to create your spiders. + s = cls() + crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) + return s + + def process_spider_input(self, response, spider): + # Called for each response that goes through the spider + # middleware and into the spider. + + # Should return None or raise an exception. + return None + + def process_spider_output(self, response, result, spider): + # Called with the results returned from the Spider, after + # it has processed the response. + + # Must return an iterable of Request, or item objects. + for i in result: + yield i + + def process_spider_exception(self, response, exception, spider): + # Called when a spider or process_spider_input() method + # (from other spider middleware) raises an exception. + + # Should return either None or an iterable of Request or item objects. + print(response.body) + pass + + def process_start_requests(self, start_requests, spider): + # Called with the start requests of the spider, and works + # similarly to the process_spider_output() method, except + # that it doesn’t have a response associated. + + # Must return only requests (not items). + for r in start_requests: + yield r + + def spider_opened(self, spider): + spider.logger.info('Spider opened: %s' % spider.name) + + +class GhGraphqlDownloaderMiddleware: + # Not all methods need to be defined. If a method is not defined, + # scrapy acts as if the downloader middleware does not modify the + # passed objects. + + @classmethod + def from_crawler(cls, crawler): + # This method is used by Scrapy to create your spiders. + s = cls() + crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) + return s + + def process_request(self, request, spider): + # Called for each request that goes through the downloader + # middleware. + + # Must either: + # - return None: continue processing this request + # - or return a Response object + # - or return a Request object + # - or raise IgnoreRequest: process_exception() methods of + # installed downloader middleware will be called + return None + + def process_response(self, request, response, spider): + # Called with the response returned from the downloader. + + # Must either; + # - return a Response object + # - return a Request object + # - or raise IgnoreRequest + return response + + def process_exception(self, request, exception, spider): + # Called when a download handler or a process_request() + # (from other downloader middleware) raises an exception. + + # Must either: + # - return None: continue processing this exception + # - return a Response object: stops process_exception() chain + # - return a Request object: stops process_exception() chain + pass + + def spider_opened(self, spider): + spider.logger.info('Spider opened: %s' % spider.name) diff --git a/codepile/github_issues/gh_graphql/gh_graphql/pipelines.py b/codepile/github_issues/gh_graphql/gh_graphql/pipelines.py new file mode 100644 index 0000000..09c9b8c --- /dev/null +++ b/codepile/github_issues/gh_graphql/gh_graphql/pipelines.py @@ -0,0 +1,13 @@ +# Define your item pipelines here +# +# Don't forget to add your pipeline to the ITEM_PIPELINES setting +# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html + + +# useful for handling different item types with a single interface +from itemadapter import ItemAdapter + + +class GhGraphqlPipeline: + def process_item(self, item, spider): + return item diff --git a/codepile/github_issues/gh_graphql/gh_graphql/settings.py b/codepile/github_issues/gh_graphql/gh_graphql/settings.py new file mode 100644 index 0000000..9ac1e94 --- /dev/null +++ b/codepile/github_issues/gh_graphql/gh_graphql/settings.py @@ -0,0 +1,92 @@ +# Scrapy settings for gh_graphql project +# +# For simplicity, this file contains only settings considered important or +# commonly used. You can find more settings consulting the documentation: +# +# https://docs.scrapy.org/en/latest/topics/settings.html +# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html +# https://docs.scrapy.org/en/latest/topics/spider-middleware.html + +BOT_NAME = 'gh_graphql' + +SPIDER_MODULES = ['gh_graphql.spiders'] +NEWSPIDER_MODULE = 'gh_graphql.spiders' + + +# Crawl responsibly by identifying yourself (and your website) on the user-agent +#USER_AGENT = 'gh_graphql (+http://www.yourdomain.com)' + +# Obey robots.txt rules +ROBOTSTXT_OBEY = True + +# Configure maximum concurrent requests performed by Scrapy (default: 16) +#CONCURRENT_REQUESTS = 32 + +# Configure a delay for requests for the same website (default: 0) +# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay +# See also autothrottle settings and docs +#DOWNLOAD_DELAY = 3 +# The download delay setting will honor only one of: +#CONCURRENT_REQUESTS_PER_DOMAIN = 16 +#CONCURRENT_REQUESTS_PER_IP = 16 + +# Disable cookies (enabled by default) +#COOKIES_ENABLED = False + +# Disable Telnet Console (enabled by default) +#TELNETCONSOLE_ENABLED = False + +# Override the default request headers: +#DEFAULT_REQUEST_HEADERS = { +# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', +# 'Accept-Language': 'en', +#} + +# Enable or disable spider middlewares +# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html +#SPIDER_MIDDLEWARES = { +# 'gh_graphql.middlewares.GhGraphqlSpiderMiddleware': 543, +#} + +# Enable or disable downloader middlewares +# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html +#DOWNLOADER_MIDDLEWARES = { +# 'gh_graphql.middlewares.GhGraphqlDownloaderMiddleware': 543, +#} + +# Enable or disable extensions +# See https://docs.scrapy.org/en/latest/topics/extensions.html +#EXTENSIONS = { +# 'scrapy.extensions.telnet.TelnetConsole': None, +#} + +# Configure item pipelines +# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html +#ITEM_PIPELINES = { +# 'gh_graphql.pipelines.GhGraphqlPipeline': 300, +#} + +# Enable and configure the AutoThrottle extension (disabled by default) +# See https://docs.scrapy.org/en/latest/topics/autothrottle.html +#AUTOTHROTTLE_ENABLED = True +# The initial download delay +#AUTOTHROTTLE_START_DELAY = 5 +# The maximum download delay to be set in case of high latencies +#AUTOTHROTTLE_MAX_DELAY = 60 +# The average number of requests Scrapy should be sending in parallel to +# each remote server +#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 +# Enable showing throttling stats for every response received: +#AUTOTHROTTLE_DEBUG = False + +# Enable and configure HTTP caching (disabled by default) +# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings +#HTTPCACHE_ENABLED = True +#HTTPCACHE_EXPIRATION_SECS = 0 +#HTTPCACHE_DIR = 'httpcache' +#HTTPCACHE_IGNORE_HTTP_CODES = [] +#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' + +# Set settings whose default value is deprecated to a future-proof value +REQUEST_FINGERPRINTER_IMPLEMENTATION = '2.7' +TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor' diff --git a/codepile/github_issues/gh_graphql/gh_graphql/spiders/__init__.py b/codepile/github_issues/gh_graphql/gh_graphql/spiders/__init__.py new file mode 100644 index 0000000..ebd689a --- /dev/null +++ b/codepile/github_issues/gh_graphql/gh_graphql/spiders/__init__.py @@ -0,0 +1,4 @@ +# This package will contain the spiders of your Scrapy project +# +# Please refer to the documentation for information on how to create and manage +# your spiders. diff --git a/codepile/github_issues/gh_graphql/gh_graphql/spiders/issues.py b/codepile/github_issues/gh_graphql/gh_graphql/spiders/issues.py new file mode 100644 index 0000000..5b955c2 --- /dev/null +++ b/codepile/github_issues/gh_graphql/gh_graphql/spiders/issues.py @@ -0,0 +1,191 @@ +from scrapy.http import Request +import scrapy +import pandas as pd +from random import randint +import json +import os, sys +import hashlib +import time + +# Loads a list of repos and makes requests to Github GraphQL API to get data +# Does pagination of top level issues +# keeps track of the information on what has been already fetched in json files so that scraping can be resumed (TODO: there is probably a better, scrapy native way of tracking and resuming) +# Multiple github tokens can be configured and they can be selected when launching scrapy via run.py. Originally, it was implemented in such a way that tokens are selected automatically, but because of the way scrapy schedules the requests there is a chance that same token can be used concurrently across two reqeusts which could lead to some rate limiting. + +class IssuesSpider(scrapy.Spider): + custom_settings = { + 'CONCURRENT_ITEMS': 1, + 'CONCURRENT_REQUESTS': 1, + 'DOWNLOAD_DELAY': .7 + } + name = 'issues' + allowed_domains = ['api.github.com'] + start_urls = ['http://api.github.com/'] + repo_list = "/data/github-issues_pre-2015-issues-without-content.parquet" + graphql_query_path = "./issues.graphql" + issue_query_template = open(graphql_query_path).read() + graphql_api_url = "https://api.github.com/graphql" + github_access_tokens = [ + {"token": "abcd", "id": "username"}, + {"token": "efgh", "id": "username2"} + ] + output_path = f"/data/{name}-{sys.argv[1]}.jsonl" + track_file = f"/data/{name}-track-{sys.argv[1]}.jsonl" + error_file = f"/data/{name}-error-{sys.argv[1]}.jsonl" + + def __init__(self, *a, **kw): + super().__init__(*a, **kw) + self.prepare() + + def prepare(self): + self.output_writer = open(self.output_path,'a+') + self.track_writer = open(self.track_file, 'a+') + self.error_writer = open(self.error_file, "a+") + self.finished_in_previous_runs = {} + with open(self.track_file, 'r') as track_fo: + tracked_list = track_fo.readlines() + for item in tracked_list: + item_o = json.loads(item) + track_item = { + "owner": item_o['owner'], + "name": item_o['name'], + "page": item_o['page'] + } + track_hash = self.get_hash_for_dict(track_item) + self.finished_in_previous_runs[track_hash] = item_o + + def get_hash_for_dict(self, d): + dhash = hashlib.md5() + encoded = json.dumps(d, sort_keys=True).encode() + dhash.update(encoded) + return dhash.hexdigest() + + def start_requests(self): + print(sys.argv[1]) + df = pd.read_parquet(self.repo_list) + df.sort_values(by="repo_stars", ascending=False, inplace=True) + df.reset_index(level=0, inplace=True) + no_of_tokens = len(self.github_access_tokens) + shard = int(sys.argv[1]) + for index, url in df['repo_url'][shard::no_of_tokens].items(): + repo_part = url.replace("https://github.com/", "") + if len(repo_part.split("/")) != 2: + print(repo_part) + repo_owner, repo_name = repo_part.split("/") + variables = { + "repo_owner": repo_owner, + "repo_name": repo_name, + "page_size": 100 + } + req_body = { + "query": self.issue_query_template, + "variables": variables + } + meta = {"variables": variables, "page": 0, "referrer_policy": "no-referrer"} + # remainder = index % len(self.github_access_tokens) + meta['token_idx'] = shard + request = self.get_request(req_body, meta) + if request: + yield request + + def parse(self, response): + res_json = response.json() + if "errors" in res_json and len(res_json['errors']) > 0: + print("Graphql errors found") + print(res_json['errors']) + raise Exception("graphql error") + + + self.output_writer.write(response.text + "\n") + track_content = { + "owner": response.meta.get("variables")['repo_owner'], + "name": response.meta.get("variables")['repo_name'], + "page": response.meta.get("page") + } + + issues = res_json['data']['repository']['issues'] + if issues['pageInfo']['hasNextPage']: + end_cursor = issues['pageInfo']['endCursor'] + track_content["next_cursor"] = end_cursor + next_request = self.get_next_page_request(response.meta, end_cursor) + if next_request: + yield next_request + self.track_writer.write(json.dumps(track_content) + "\n") + + def get_request(self, req_body, meta, nested_call=False): + if not nested_call: + track_item = { + "owner": meta.get("variables")['repo_owner'], + "name": meta.get("variables")['repo_name'], + "page": meta.get("page") + } + track_hash = self.get_hash_for_dict(track_item) + while track_hash in self.finished_in_previous_runs: + print(f"Already fetched in previous run: {track_item}") + stored_track_item = self.finished_in_previous_runs[track_hash] + if "next_cursor" in stored_track_item: + # check if next page is already scraped and proceed accordingly + print(f"next page available: {stored_track_item}") + track_item['page'] = track_item['page'] + 1 + track_hash = self.get_hash_for_dict(track_item) + req_body['variables']['after_cursor'] = stored_track_item['next_cursor'] + meta['variables'] = req_body['variables'] + meta['page'] = track_item['page'] + else: + return None + + headers = self.get_req_headers(meta) + print(json.dumps(meta)) + return Request(method="POST", url=self.graphql_api_url, body=json.dumps(req_body), headers=headers, meta=meta, errback=self.error_callback) + + def get_next_page_request(self, meta, next_page_cursor): + variables = meta.get("variables").copy() + new_meta = {"variables": variables, "referrer_policy": "no-referrer", "token_idx": meta.get("token_idx")} + new_meta["page"] = meta.get("page") + 1 + variables['after_cursor'] = next_page_cursor + req_body = { + "query": self.issue_query_template, + "variables": variables + } + return self.get_request(req_body, new_meta, nested_call=True) + + def get_req_headers(self, meta): + token_idx = meta.get("token_idx") + if token_idx: + token = self.github_access_tokens[token_idx]['token'] + user_id = self.github_access_tokens[token_idx]['id'] + else: + rand_idx = randint(0, len(self.github_access_tokens) - 1) + token = self.github_access_tokens[rand_idx]['token'] + user_id = self.github_access_tokens[rand_idx]['id'] + headers = { + "Authorization": "token " + token, + "User-Agent": user_id + } + return headers + + def close(self, spider, reason): + self.output_writer.close() + self.track_writer.close() + self.error_writer.close() + super().close(spider, reason) + + def error_callback(self, error): + print(f"Request failed: {error.value.response}") + if int(error.value.response.status) == 403: + # TODO: This probably isn't working as expected + time.sleep(1) + no_of_retries = error.value.response.request.meta.get("retries", 0) + if no_of_retries < 5: + request_o = error.value.response.request + request_o.meta['retries'] = no_of_retries + 1 + print(f"retrying request: {error.value.response.request.meta}") + yield error.value.response.request + failed_reason = error.value.response.text + failed_variables = json.loads(error.value.response.request.body)['variables'] + failed_variables["reason"] = failed_reason + failed_variables['page'] = error.value.response.request.meta.get("page") + v_str = json.dumps(failed_variables) + "\n" + self.error_writer.write(v_str) + print(json.dumps(failed_reason)) + return None diff --git a/codepile/github_issues/gh_graphql/issue-comments-join.py b/codepile/github_issues/gh_graphql/issue-comments-join.py new file mode 100644 index 0000000..56cdd05 --- /dev/null +++ b/codepile/github_issues/gh_graphql/issue-comments-join.py @@ -0,0 +1,24 @@ +from pyspark.sql import SparkSession +from pyspark.sql.functions import explode, col, filter, size, transform, lit, create_map, collect_list, to_json +spark_dir = "/tmp/" +spark = SparkSession.builder.config("spark.worker.cleanup.enabled", "true").config("spark.local.dir", spark_dir).config("spark.driver.memory", "24G").config("spark.executor.cores", 10).master("local[16]").appName('spark-stats').getOrCreate() +issues = spark.read.parquet("/fsx/shared/codepile/github_issues/github-issues-all-filtered/") +comments_unfiltered = spark.read.parquet("/fsx/shared/codepile/github_issues/github-issues-comment-all/") + +comments = comments_unfiltered.orderBy("event_created_at", ascending=False).dropDuplicates(["comment_id"]) + +def create_map_args(df): + args = [] + for c in df.columns: + args.append(lit(c)) + args.append(col(c)) + return args + +comments_dicted = comments.withColumn("dict", + create_map(create_map_args(comments.select(["comment_id", "comment"]))) + ).select(["issue_id", "issue_no", "dict"]) + +comments_grouped = comments_dicted.groupby(["issue_id", "issue_no"]).agg(collect_list("dict").alias("comments")) + +print("Adding comments to issues") +issues_joined = issues.join(comments_grouped, issues.issue_id == comments_grouped.issue_id, "left").select(issues["*"], to_json(comments_grouped["comments"]).alias("comments")) \ No newline at end of file diff --git a/codepile/github_issues/gh_graphql/issues-full.graphql b/codepile/github_issues/gh_graphql/issues-full.graphql new file mode 100644 index 0000000..5b5969e --- /dev/null +++ b/codepile/github_issues/gh_graphql/issues-full.graphql @@ -0,0 +1,83 @@ +query($repo_owner: String!, $repo_name: String!, $page_size: Int!, $after_cursor: String) { + repository(owner: $repo_owner, name: $repo_name) { + databaseId, + nameWithOwner, + stargazerCount, + description, + languages(first: 100) { + edges { + node { + name + } + } + }, + issues(first: $page_size, after: $after_cursor) { + pageInfo { + endCursor + hasNextPage + }, + totalCount, + edges { + node { + number, + databaseId, + createdAt, + title, + body, + author { + login, + avatarUrl + __typename + }, + authorAssociation + labels(first: 100) { + edges { + node { + name, + description + + } + } + }, + reactionGroups { + content + reactors { + totalCount + } + + }, + comments(first: 2) { + pageInfo { + hasNextPage, + endCursor + } + nodes { + databaseId + authorAssociation, + author { + login, + avatarUrl, + __typename + } + body + reactionGroups { + content, + reactors { + totalCount + } + } + }, + totalCount + } + } + + } + } + }, + rateLimit { + limit + cost + remaining + resetAt + } +} \ No newline at end of file diff --git a/codepile/github_issues/gh_graphql/issues.graphql b/codepile/github_issues/gh_graphql/issues.graphql new file mode 100644 index 0000000..26c09b1 --- /dev/null +++ b/codepile/github_issues/gh_graphql/issues.graphql @@ -0,0 +1,27 @@ +query($repo_owner: String!, $repo_name: String!, $page_size: Int!, $after_cursor: String) { + repository(owner: $repo_owner, name: $repo_name) { + databaseId, + nameWithOwner, + issues(first: $page_size, after: $after_cursor) { + pageInfo { + endCursor + hasNextPage + }, + edges { + node { + number, + databaseId, + createdAt, + title, + body + } + } + } + }, + rateLimit { + limit + cost + remaining + resetAt + } +} \ No newline at end of file diff --git a/codepile/github_issues/gh_graphql/run.py b/codepile/github_issues/gh_graphql/run.py new file mode 100644 index 0000000..fb5acd5 --- /dev/null +++ b/codepile/github_issues/gh_graphql/run.py @@ -0,0 +1,10 @@ +from scrapy.crawler import CrawlerProcess + +from gh_graphql.spiders import issues + +process = CrawlerProcess({ + +}) + +process.crawl(issues.IssuesSpider) +process.start() # \ No newline at end of file diff --git a/codepile/github_issues/gh_graphql/scrapy.cfg b/codepile/github_issues/gh_graphql/scrapy.cfg new file mode 100644 index 0000000..c525bb6 --- /dev/null +++ b/codepile/github_issues/gh_graphql/scrapy.cfg @@ -0,0 +1,11 @@ +# Automatically created by: scrapy startproject +# +# For more information about the [deploy] section see: +# https://scrapyd.readthedocs.io/en/latest/deploy.html + +[settings] +default = gh_graphql.settings + +[deploy] +#url = http://localhost:6800/ +project = gh_graphql