The main issue is that we need to get clone and diff+render operations
back into normal time frames. The salsa workers (e.g. to render a
diff) time out after 60s. Similar time constraints are put onto other
rendering frond-ends. Actually you can easily get Apache to segfault
if you do not time-constrain cgi/fcgi type processes.
But that's out of scope here.
Back on topic:
Just splitting the file will not do. We need to (unfortunately)
somehow "get rid" of the history (delta-resolution) walks in git:
# test setup limits: Network bw: 200 MBit, client system: 4 core
$ time git clone https://.../debian_security_security-tracker
Klone nach 'debian_security_security-tracker' ...
remote: Counting objects: 334274, done.
remote: Compressing objects: 100% (67288/67288), done.
remote: Total 334274 (delta 211939), reused 329399 (delta 208905)
Empfange Objekte: 100% (334274/334274), 165.46 MiB | 21.93 MiB/s, Fertig.
Löse Unterschiede auf: 100% (211939/211939), Fertig.
real 14m13,159s
user 27m23,980s
sys 0m17,068s
# Run the tool already available to split the main CVE/list
# file into annual files. Thanks Raphael Geissert!
$ bin/split-by-year
# remove the old big CVE/list file
$ git rm data/CVE/list
# get the new files into git
$ git add data/CVE/list.*
$ git commit --all
[master a06d3446ca] Remove list and commit bin/split-by-year results
21 files changed, 342414 insertions(+), 342414 deletions(-)
delete mode 100644 data/CVE/list
create mode 100644 data/CVE/list.1999
create mode 100644 data/CVE/list.2000
create mode 100644 data/CVE/list.2001
create mode 100644 data/CVE/list.2002
create mode 100644 data/CVE/list.2003
create mode 100644 data/CVE/list.2004
create mode 100644 data/CVE/list.2005
create mode 100644 data/CVE/list.2006
create mode 100644 data/CVE/list.2007
create mode 100644 data/CVE/list.2008
create mode 100644 data/CVE/list.2009
create mode 100644 data/CVE/list.2010
create mode 100644 data/CVE/list.2011
create mode 100644 data/CVE/list.2012
create mode 100644 data/CVE/list.2013
create mode 100644 data/CVE/list.2014
create mode 100644 data/CVE/list.2015
create mode 100644 data/CVE/list.2016
create mode 100644 data/CVE/list.2017
create mode 100644 data/CVE/list.2018
# this one is fast:
$ git push
# create a new clone
$ time git clone https://.../debian_security_security-tracker_split_files test-clone
Klone nach 'test-clone' ...
remote: Counting objects: 334298, done.
remote: Compressing objects: 100% (67312/67312), done.
remote: Total 334298 (delta 211943), reused 329399 (delta 208905)
Empfange Objekte: 100% (334298/334298), 168.91 MiB | 21.28 MiB/s, Fertig.
Löse Unterschiede auf: 100% (211943/211943), Fertig.
real 14m35,444s
user 27m45,500s
sys 0m21,100s
--> so splitting alone doesn't help. Git is not clever enough to not run
through the deltas of not to be checked-out files.
git 2.18's git2 wire protocol could be used with server-side filtering
but that's an awful hack. Telling people to
git clone --depth 1 #(shallow)
like Guido advises is easier and more reliable for the clone use-case.
For the original repo that will take ~1.5s, for a split-by-year repo ~0.2s.
There are tools to split git files and keep the history
e.g. https://github.com/potherca-bash/git-split-file
but we'd need (to create) one that also zaps the old deltas.
So really "rewrite history" as the git folks tend to call this.
git filter-branch can do this. But it would get somewhat complex and murky
with commits that span CVE/list-year and list-year+1 which are at least 21 for
2018+2017, 19 for 2017+2016 and ~10 for previous year combos.
So I wouldn't put too much effort into that path.
In any case, a repo with just the split files but no maintained history clones
in ~12s in the above test setup. It also brings the (bare) repo down from 3,3GB
to 189MB. So the issue is really the data/CVE/list file.
That said, data/DSA/list is 14575 lines. That seems to not bother git too much
yet. Still if things get re-structured, this file may be worth a look, too.
To me the most reasonable path forward unfortunately looks like start a new repo
for 2019+ and "just" import the split files or single-record files as mentioned
by pabs but not the git/svn/cvs history. The old repo would - of course - stay
around but frozen at a deadline.
Corsac also mentioned on IRC that the repo could be hosted outside of Gitlab.
That would reduce the pressure for some time.
But cgit and other git frontends (as well as backends) we tested also struggle
with the repo (which is why my company, Faster IT GmbH, used the security-tracker
repo as a very welcome test case in the first place).
So that would buy time but not be a solution long(er) term.
Thanks for reading that much!