forked from gousiosg/github-mirror
-
Notifications
You must be signed in to change notification settings - Fork 0
Setting up a mirroring cluster
gousiosg edited this page May 2, 2012
·
16 revisions
Github's API limit (currently, 5000 reqs/sec) makes it impossible to retrieve all data linked from events on a single node. For that, GHTorrent was designed to work on multiple phases in a distributed fashion. Depending on the data you want to collect, a cluster setup may be necessary.
A full GHTorrent cluster consists of the following types of nodes:
- Event retrieval nodes: Nodes that query the public Github event API for new events. More than one instances are required to both ensure that no events are lost due to spikes in event generation and that machine or network malfunctions the event collection machine do not affect the service.
- Linked data retrieval nodes: Retrieval of data linked by events is where the Github's API is imposing the most significant restrictions.
- MongoDB shards: A MongoDB installation can be sharded (have the data spread on multiple nodes) on a per collection basis. Sharing MongoDB helps with both distributing the storage requirements and faster querying.
- RabbitMQ active-active mirrors: RabbitMQ can work in cluster mode to