Skip to content

Commit 0c02488

Browse files
jeffhostetlerdscho
authored andcommitted
survey: expanded TODO list at the bottom of the source file
Signed-off-by: Jeff Hostetler <[email protected]>
1 parent 5d11a62 commit 0c02488

File tree

1 file changed

+116
-32
lines changed

1 file changed

+116
-32
lines changed

builtin/survey.c

Lines changed: 116 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1687,47 +1687,131 @@ int cmd_survey(int argc, const char **argv, const char *prefix, struct repositor
16871687
}
16881688

16891689
/*
1690-
* NEEDSWORK: The following is a bit of a laundry list of things
1691-
* that I'd like to add.
1690+
* NEEDSWORK: So far, I only have iteration on the requested set of
1691+
* refs and treewalk/reachable objects on that set of refs. The
1692+
* following is a bit of a laundry list of things that I'd like to
1693+
* add.
16921694
*
16931695
* [] Dump stats on all of the packfiles. The number and size of each.
1694-
* Whether each is in the .git directory or in an alternate. The state
1695-
* of the IDX or MIDX files and etc. Delta chain stats. All of this
1696-
* data is relative to the "lived-in" state of the repository. Stuff
1697-
* that may change after a GC or repack.
1696+
* Whether each is in the .git directory or in an alternate. The
1697+
* state of the IDX or MIDX files and etc. Delta chain stats. All
1698+
* of this data is relative to the "lived-in" state of the
1699+
* repository. Stuff that may change after a GC or repack.
1700+
*
1701+
* [] Clone and Index stats. partial, shallow, sparse-checkout,
1702+
* sparse-index, etc. Hydration stats.
16981703
*
16991704
* [] Dump stats on each remote. When we fetch from a remote the size
1700-
* of the response is related to the set of haves on the server. You
1701-
* can see this in `GIT_TRACE_CURL=1 git fetch`. We get a `ls-refs`
1702-
* payload that lists all of the branches and tags on the server, so
1703-
* at a minimum the RefName and SHA for each. But for annotated tags
1704-
* we also get the peeled SHA. The size of this overhead on every
1705-
* fetch is proporational to the size of the `git ls-remote` response
1706-
* (roughly, although the latter repeats the RefName of the peeled
1707-
* tag). If, for example, you have 500K refs on a remote, you're
1708-
* going to have a long "haves" message, so every fetch will be slow
1709-
* just because of that overhead (not counting new objects to be
1710-
* downloaded).
1705+
* of the response is related to the set of haves on the server.
1706+
* You can see this in `GIT_TRACE_CURL=1 git fetch`. We get a
1707+
* `ls-refs` payload that lists all of the branches and tags on the
1708+
* server, so at a minimum the RefName and SHA for each. But for
1709+
* annotated tags we also get the peeled SHA. The size of this
1710+
* overhead on every fetch is proporational to the size of the `git
1711+
* ls-remote` response (roughly, although the latter repeats the
1712+
* RefName of the peeled tag). If, for example, you have 500K refs
1713+
* on a remote, you're going to have a long "haves" message, so
1714+
* every fetch will be slow just because of that overhead (not
1715+
* counting new objects to be downloaded).
17111716
*
1712-
* Note that the local set of tags in "refs/tags/" is a union over all
1713-
* remotes. However, since most people only have one remote, we can
1714-
* probaly estimate the overhead value directly from the size of the
1715-
* set of "refs/tags/" that we visited while building the `ref_info`
1716-
* and `ref_array` and not need to ask the remote.
1717+
* Note that the local set of tags in "refs/tags/" is a union over
1718+
* all remotes. However, since most people only have one remote,
1719+
* we can probaly estimate the overhead value directly from the
1720+
* size of the set of "refs/tags/" that we visited while building
1721+
* the `ref_info` and `ref_array` and not need to ask the remote.
17171722
*
17181723
* [] Dump info on the complexity of the DAG. Criss-cross merges.
1719-
* The number of edges that must be touched to compute merge bases.
1720-
* Edge length. The number of parallel lanes in the history that must
1721-
* be navigated to get to the merge base. What affects the cost of
1722-
* the Ahead/Behind computation? How often do criss-crosses occur and
1723-
* do they cause various operations to slow down?
1724+
* The number of edges that must be touched to compute merge bases.
1725+
* Edge length. The number of parallel lanes in the history that
1726+
* must be navigated to get to the merge base. What affects the
1727+
* cost of the Ahead/Behind computation? How often do
1728+
* criss-crosses occur and do they cause various operations to slow
1729+
* down?
17241730
*
17251731
* [] If there are primary branches (like "main" or "master") are they
1726-
* always on the left side of merges? Does the graph have a clean
1727-
* left edge? Or are there normal and "backwards" merges? Do these
1728-
* cause problems at scale?
1732+
* always on the left side of merges? Does the graph have a clean
1733+
* left edge? Or are there normal and "backwards" merges? Do
1734+
* these cause problems at scale?
17291735
*
17301736
* [] If we have a hierarchy of FI/RI branches like "L1", "L2, ...,
1731-
* can we learn anything about the shape of the repo around these FI
1732-
* and RI integrations?
1737+
* can we learn anything about the shape of the repo around these
1738+
* FI and RI integrations?
1739+
*
1740+
* [] Do we need a no-PII flag to omit pathnames or branch/tag names
1741+
* in the various histograms? (This would turn off --name-rev
1742+
* too.)
1743+
*
1744+
* [] I have so far avoided adding opinions about individual fields
1745+
* (such as the way `git-sizer` prints a row of stars or bangs in
1746+
* the last column).
1747+
*
1748+
* I'm wondering if that is a job of this executable or if it
1749+
* should be done in a post-processing step using the JSON output.
1750+
*
1751+
* My problem with the `git-sizer` approach is that it doesn't give
1752+
* the (casual) user any information on why it has stars or bangs.
1753+
* And there isn't a good way to print detailed information in the
1754+
* ASCII-art tables that would be easy to understand.
1755+
*
1756+
* [] For example, a large number of refs does not define a cliff.
1757+
* Performance will drop off (linearly, quadratically, ... ??).
1758+
* The tool should refer them to article(s) talking about the
1759+
* different problems that it could cause. So should `git
1760+
* survey` just print the number and (implicitly) refer them to
1761+
* the man page (chapter/verse) or to a tool that will interpret
1762+
* the number and explain it?
1763+
*
1764+
* [] Alternatively, should `git survey` do that analysis too and
1765+
* just print footnotes for each large number?
1766+
*
1767+
* [] The computation of the raw survey JSON data can take HOURS on
1768+
* a very large repo (like Windows), so I'm wondering if we
1769+
* want to keep the opinion portion separate.
1770+
*
1771+
* [] In addition to opinions based on the static data, I would like
1772+
* to dump the JSON results (or the Trace2 telemetry) into a DB and
1773+
* aggregate it with other users.
1774+
*
1775+
* Granted, they should all see the same DAG and the same set of
1776+
* reachable objects, but we could average across all datasets
1777+
* generated on a particular date and detect outlier users.
1778+
*
1779+
* [] Maybe someone cloned from the `_full` endpoint rather than
1780+
* the limited refs endpoint.
1781+
*
1782+
* [] Maybe that user is having problems with repacking / GC /
1783+
* maintenance without knowing it.
1784+
*
1785+
* [] I'd also like to dump use the DB to compare survey datasets over
1786+
* a time. How fast is their repository growing and in what ways?
1787+
*
1788+
* [] I'd rather have the delta analysis NOT be inside `git
1789+
* survey`, so it makes sense to consider having all of it in a
1790+
* post-process step.
1791+
*
1792+
* [] Another reason to put the opinion analysis in a post-process
1793+
* is that it would be easier to generate plots on the data tables.
1794+
* Granted, we can get plots from telemetry, but a stand-alone user
1795+
* could run the JSON thru python or jq or something and generate
1796+
* something nicer than ASCII-art and it could handle cross-referencing
1797+
* and hyperlinking to helpful information on each issue.
1798+
*
1799+
* [] I think there are several classes of data that we can report on:
1800+
*
1801+
* [] The "inherit repo properties", such as the shape and size of
1802+
* the DAG -- these should be universal in each enlistment.
1803+
*
1804+
* [] The "ODB lived in properties", such as the efficiency
1805+
* of the repack and things like partial and shallow clone.
1806+
* These will vary, but indicate health of the ODB.
1807+
*
1808+
* [] The "index related properties", such as sparse-checkout,
1809+
* sparse-index, cache-tree, untracked-cache, fsmonitor, and
1810+
* etc. These will also vary, but are more like knobs for
1811+
* the user to adjust.
1812+
*
1813+
* [] I want to compare these with Matt's "dimensions of scale"
1814+
* notes and see if there are other pieces of data that we
1815+
* could compute/consider.
1816+
*
17331817
*/

0 commit comments

Comments
 (0)