@@ -1690,47 +1690,131 @@ int cmd_survey(int argc, const char **argv, const char *prefix, struct repositor
1690
1690
}
1691
1691
1692
1692
/*
1693
- * NEEDSWORK: The following is a bit of a laundry list of things
1694
- * that I'd like to add.
1693
+ * NEEDSWORK: So far, I only have iteration on the requested set of
1694
+ * refs and treewalk/reachable objects on that set of refs. The
1695
+ * following is a bit of a laundry list of things that I'd like to
1696
+ * add.
1695
1697
*
1696
1698
* [] Dump stats on all of the packfiles. The number and size of each.
1697
- * Whether each is in the .git directory or in an alternate. The state
1698
- * of the IDX or MIDX files and etc. Delta chain stats. All of this
1699
- * data is relative to the "lived-in" state of the repository. Stuff
1700
- * that may change after a GC or repack.
1699
+ * Whether each is in the .git directory or in an alternate. The
1700
+ * state of the IDX or MIDX files and etc. Delta chain stats. All
1701
+ * of this data is relative to the "lived-in" state of the
1702
+ * repository. Stuff that may change after a GC or repack.
1703
+ *
1704
+ * [] Clone and Index stats. partial, shallow, sparse-checkout,
1705
+ * sparse-index, etc. Hydration stats.
1701
1706
*
1702
1707
* [] Dump stats on each remote. When we fetch from a remote the size
1703
- * of the response is related to the set of haves on the server. You
1704
- * can see this in `GIT_TRACE_CURL=1 git fetch`. We get a `ls-refs`
1705
- * payload that lists all of the branches and tags on the server, so
1706
- * at a minimum the RefName and SHA for each. But for annotated tags
1707
- * we also get the peeled SHA. The size of this overhead on every
1708
- * fetch is proporational to the size of the `git ls-remote` response
1709
- * (roughly, although the latter repeats the RefName of the peeled
1710
- * tag). If, for example, you have 500K refs on a remote, you're
1711
- * going to have a long "haves" message, so every fetch will be slow
1712
- * just because of that overhead (not counting new objects to be
1713
- * downloaded).
1708
+ * of the response is related to the set of haves on the server.
1709
+ * You can see this in `GIT_TRACE_CURL=1 git fetch`. We get a
1710
+ * `ls-refs` payload that lists all of the branches and tags on the
1711
+ * server, so at a minimum the RefName and SHA for each. But for
1712
+ * annotated tags we also get the peeled SHA. The size of this
1713
+ * overhead on every fetch is proporational to the size of the `git
1714
+ * ls-remote` response (roughly, although the latter repeats the
1715
+ * RefName of the peeled tag). If, for example, you have 500K refs
1716
+ * on a remote, you're going to have a long "haves" message, so
1717
+ * every fetch will be slow just because of that overhead (not
1718
+ * counting new objects to be downloaded).
1714
1719
*
1715
- * Note that the local set of tags in "refs/tags/" is a union over all
1716
- * remotes. However, since most people only have one remote, we can
1717
- * probaly estimate the overhead value directly from the size of the
1718
- * set of "refs/tags/" that we visited while building the `ref_info`
1719
- * and `ref_array` and not need to ask the remote.
1720
+ * Note that the local set of tags in "refs/tags/" is a union over
1721
+ * all remotes. However, since most people only have one remote,
1722
+ * we can probaly estimate the overhead value directly from the
1723
+ * size of the set of "refs/tags/" that we visited while building
1724
+ * the `ref_info` and `ref_array` and not need to ask the remote.
1720
1725
*
1721
1726
* [] Dump info on the complexity of the DAG. Criss-cross merges.
1722
- * The number of edges that must be touched to compute merge bases.
1723
- * Edge length. The number of parallel lanes in the history that must
1724
- * be navigated to get to the merge base. What affects the cost of
1725
- * the Ahead/Behind computation? How often do criss-crosses occur and
1726
- * do they cause various operations to slow down?
1727
+ * The number of edges that must be touched to compute merge bases.
1728
+ * Edge length. The number of parallel lanes in the history that
1729
+ * must be navigated to get to the merge base. What affects the
1730
+ * cost of the Ahead/Behind computation? How often do
1731
+ * criss-crosses occur and do they cause various operations to slow
1732
+ * down?
1727
1733
*
1728
1734
* [] If there are primary branches (like "main" or "master") are they
1729
- * always on the left side of merges? Does the graph have a clean
1730
- * left edge? Or are there normal and "backwards" merges? Do these
1731
- * cause problems at scale?
1735
+ * always on the left side of merges? Does the graph have a clean
1736
+ * left edge? Or are there normal and "backwards" merges? Do
1737
+ * these cause problems at scale?
1732
1738
*
1733
1739
* [] If we have a hierarchy of FI/RI branches like "L1", "L2, ...,
1734
- * can we learn anything about the shape of the repo around these FI
1735
- * and RI integrations?
1740
+ * can we learn anything about the shape of the repo around these
1741
+ * FI and RI integrations?
1742
+ *
1743
+ * [] Do we need a no-PII flag to omit pathnames or branch/tag names
1744
+ * in the various histograms? (This would turn off --name-rev
1745
+ * too.)
1746
+ *
1747
+ * [] I have so far avoided adding opinions about individual fields
1748
+ * (such as the way `git-sizer` prints a row of stars or bangs in
1749
+ * the last column).
1750
+ *
1751
+ * I'm wondering if that is a job of this executable or if it
1752
+ * should be done in a post-processing step using the JSON output.
1753
+ *
1754
+ * My problem with the `git-sizer` approach is that it doesn't give
1755
+ * the (casual) user any information on why it has stars or bangs.
1756
+ * And there isn't a good way to print detailed information in the
1757
+ * ASCII-art tables that would be easy to understand.
1758
+ *
1759
+ * [] For example, a large number of refs does not define a cliff.
1760
+ * Performance will drop off (linearly, quadratically, ... ??).
1761
+ * The tool should refer them to article(s) talking about the
1762
+ * different problems that it could cause. So should `git
1763
+ * survey` just print the number and (implicitly) refer them to
1764
+ * the man page (chapter/verse) or to a tool that will interpret
1765
+ * the number and explain it?
1766
+ *
1767
+ * [] Alternatively, should `git survey` do that analysis too and
1768
+ * just print footnotes for each large number?
1769
+ *
1770
+ * [] The computation of the raw survey JSON data can take HOURS on
1771
+ * a very large repo (like Windows), so I'm wondering if we
1772
+ * want to keep the opinion portion separate.
1773
+ *
1774
+ * [] In addition to opinions based on the static data, I would like
1775
+ * to dump the JSON results (or the Trace2 telemetry) into a DB and
1776
+ * aggregate it with other users.
1777
+ *
1778
+ * Granted, they should all see the same DAG and the same set of
1779
+ * reachable objects, but we could average across all datasets
1780
+ * generated on a particular date and detect outlier users.
1781
+ *
1782
+ * [] Maybe someone cloned from the `_full` endpoint rather than
1783
+ * the limited refs endpoint.
1784
+ *
1785
+ * [] Maybe that user is having problems with repacking / GC /
1786
+ * maintenance without knowing it.
1787
+ *
1788
+ * [] I'd also like to dump use the DB to compare survey datasets over
1789
+ * a time. How fast is their repository growing and in what ways?
1790
+ *
1791
+ * [] I'd rather have the delta analysis NOT be inside `git
1792
+ * survey`, so it makes sense to consider having all of it in a
1793
+ * post-process step.
1794
+ *
1795
+ * [] Another reason to put the opinion analysis in a post-process
1796
+ * is that it would be easier to generate plots on the data tables.
1797
+ * Granted, we can get plots from telemetry, but a stand-alone user
1798
+ * could run the JSON thru python or jq or something and generate
1799
+ * something nicer than ASCII-art and it could handle cross-referencing
1800
+ * and hyperlinking to helpful information on each issue.
1801
+ *
1802
+ * [] I think there are several classes of data that we can report on:
1803
+ *
1804
+ * [] The "inherit repo properties", such as the shape and size of
1805
+ * the DAG -- these should be universal in each enlistment.
1806
+ *
1807
+ * [] The "ODB lived in properties", such as the efficiency
1808
+ * of the repack and things like partial and shallow clone.
1809
+ * These will vary, but indicate health of the ODB.
1810
+ *
1811
+ * [] The "index related properties", such as sparse-checkout,
1812
+ * sparse-index, cache-tree, untracked-cache, fsmonitor, and
1813
+ * etc. These will also vary, but are more like knobs for
1814
+ * the user to adjust.
1815
+ *
1816
+ * [] I want to compare these with Matt's "dimensions of scale"
1817
+ * notes and see if there are other pieces of data that we
1818
+ * could compute/consider.
1819
+ *
1736
1820
*/
0 commit comments