Discussion:
Bug#908678: security-tracker - Breaks salsa.d.o
Bastian Blank
2018-09-12 13:10:56 UTC
Permalink
Package: security-tracker
Severity: critical

The security tracker git repository is in a state which git does not
really like. git clone takes ages, fsck takes ages, repack is reported
to be impossible.

The GitLab on salsa.d.o also chokes on it some times during git
operations. Some may be attributed to the old diff formatter problem,
which I hope gets fixed soon. But lately it even caused stalls on git
operation.

As the problems caused by the state of this repo now causes user visible
outages, this needs to be fixed.

Regards,
Bastian
--
I'm a soldier, not a diplomat. I can only tell the truth.
-- Kirk, "Errand of Mercy", stardate 3198.9
Salvatore Bonaccorso
2018-09-13 11:37:35 UTC
Permalink
Hi Bastian,
Post by Bastian Blank
Package: security-tracker
Severity: critical
The security tracker git repository is in a state which git does not
really like. git clone takes ages, fsck takes ages, repack is reported
to be impossible.
The GitLab on salsa.d.o also chokes on it some times during git
operations. Some may be attributed to the old diff formatter problem,
which I hope gets fixed soon. But lately it even caused stalls on git
operation.
As the problems caused by the state of this repo now causes user visible
outages, this needs to be fixed.
Do you have any hints at us on what we could look at to faciliate/help
more salsa maintainers?

What is actually this old diff formater problem you mentioned which
going to be solved? Would it in the meantime help to make the access
only for logged in users/restricted?

Regards,
Salvatore
Paul Wise
2018-09-13 11:44:36 UTC
Permalink
Post by Salvatore Bonaccorso
Do you have any hints at us on what we could look at to faciliate/help
more salsa maintainers?
I think I read on IRC that the main thing is that the design of git is
not optimised for having large and growing files that change on every
commit. So splitting them up into to one file per CVE/DSA/DLA/etc
might help? Or switching from git to a database or something like
restic or borg.
--
bye,
pabs

https://wiki.debian.org/PaulWise
Bastian Blank
2018-09-16 11:22:32 UTC
Permalink
Hi Salvatore
Post by Salvatore Bonaccorso
Do you have any hints at us on what we could look at to faciliate/help
more salsa maintainers?
Please try to fork that repo. Git will take a long time to resolve
deltas. This is due to Git not handling very well the one file appended
in every revision. To fix it for all times this file needs to be split
up. With that change in place the repo needs to be rewritten.

We have even one fork of this repo where blobs are missing.
Post by Salvatore Bonaccorso
What is actually this old diff formater problem you mentioned which
going to be solved? Would it in the meantime help to make the access
only for logged in users/restricted?
For some requests the diff formatter blocks and runs into the one minute
hard timeout. This should be fixed with the 11.3 release next week, so
we can ignore that.

Regards,
Bastian
--
Those who hate and fight must stop themselves -- otherwise it is not stopped.
-- Spock, "Day of the Dove", stardate unknown
Salvatore Bonaccorso
2018-09-17 18:34:57 UTC
Permalink
Hi Bastian,
Post by Bastian Blank
Hi Salvatore
Post by Salvatore Bonaccorso
Do you have any hints at us on what we could look at to faciliate/help
more salsa maintainers?
Please try to fork that repo. Git will take a long time to resolve
deltas. This is due to Git not handling very well the one file appended
in every revision. To fix it for all times this file needs to be split
up. With that change in place the repo needs to be rewritten.
Just to say, we got your reply. I see that we need to try to improve
that situation, as it has impact as well on other users as well. A
split up of the data/CVE/list file would need updates in various other
tasks and workflows on it. I will try to look into that closer.
Post by Bastian Blank
We have even one fork of this repo where blobs are missing.
Post by Salvatore Bonaccorso
What is actually this old diff formater problem you mentioned which
going to be solved? Would it in the meantime help to make the access
only for logged in users/restricted?
For some requests the diff formatter blocks and runs into the one minute
hard timeout. This should be fixed with the 11.3 release next week, so
we can ignore that.
Ok!

Regards,
Salvatore
Salvatore Bonaccorso
2018-09-25 19:00:49 UTC
Permalink
One suggestion from IRC discussion:

< DLange> summary: suggestions are along the idea of creating list-$year and combine in list for current tools or amend them?
Guido Günther
2018-09-26 07:19:09 UTC
Permalink
Hi,
Post by Salvatore Bonaccorso
< DLange> summary: suggestions are along the idea of creating list-$year and combine in list for current tools or amend them?
I think that makes sense. An alternative would be to use shallow clones
(--depth=1) on clones for all the tools (and to recommend it in the
docs).

Did somebody contact git upstream yet? It might be worth showing this
use case.

Cheers,
-- Guido
Daniel Lange
2018-09-26 11:56:16 UTC
Permalink
The main issue is that we need to get clone and diff+render operations
back into normal time frames. The salsa workers (e.g. to render a
diff) time out after 60s. Similar time constraints are put onto other
rendering frond-ends. Actually you can easily get Apache to segfault
if you do not time-constrain cgi/fcgi type processes.
But that's out of scope here.

Back on topic:

Just splitting the file will not do. We need to (unfortunately)
somehow "get rid" of the history (delta-resolution) walks in git:

# test setup limits: Network bw: 200 MBit, client system: 4 core

$ time git clone https://.../debian_security_security-tracker
Klone nach 'debian_security_security-tracker' ...
remote: Counting objects: 334274, done.
remote: Compressing objects: 100% (67288/67288), done.
remote: Total 334274 (delta 211939), reused 329399 (delta 208905)
Empfange Objekte: 100% (334274/334274), 165.46 MiB | 21.93 MiB/s, Fertig.
Löse Unterschiede auf: 100% (211939/211939), Fertig.

real 14m13,159s
user 27m23,980s
sys 0m17,068s

# Run the tool already available to split the main CVE/list
# file into annual files. Thanks Raphael Geissert!
$ bin/split-by-year

# remove the old big CVE/list file
$ git rm data/CVE/list

# get the new files into git
$ git add data/CVE/list.*
$ git commit --all
[master a06d3446ca] Remove list and commit bin/split-by-year results
21 files changed, 342414 insertions(+), 342414 deletions(-)
delete mode 100644 data/CVE/list
create mode 100644 data/CVE/list.1999
create mode 100644 data/CVE/list.2000
create mode 100644 data/CVE/list.2001
create mode 100644 data/CVE/list.2002
create mode 100644 data/CVE/list.2003
create mode 100644 data/CVE/list.2004
create mode 100644 data/CVE/list.2005
create mode 100644 data/CVE/list.2006
create mode 100644 data/CVE/list.2007
create mode 100644 data/CVE/list.2008
create mode 100644 data/CVE/list.2009
create mode 100644 data/CVE/list.2010
create mode 100644 data/CVE/list.2011
create mode 100644 data/CVE/list.2012
create mode 100644 data/CVE/list.2013
create mode 100644 data/CVE/list.2014
create mode 100644 data/CVE/list.2015
create mode 100644 data/CVE/list.2016
create mode 100644 data/CVE/list.2017
create mode 100644 data/CVE/list.2018

# this one is fast:
$ git push

# create a new clone
$ time git clone https://.../debian_security_security-tracker_split_files test-clone
Klone nach 'test-clone' ...
remote: Counting objects: 334298, done.
remote: Compressing objects: 100% (67312/67312), done.
remote: Total 334298 (delta 211943), reused 329399 (delta 208905)
Empfange Objekte: 100% (334298/334298), 168.91 MiB | 21.28 MiB/s, Fertig.
Löse Unterschiede auf: 100% (211943/211943), Fertig.

real 14m35,444s
user 27m45,500s
sys 0m21,100s

--> so splitting alone doesn't help. Git is not clever enough to not run
through the deltas of not to be checked-out files.

git 2.18's git2 wire protocol could be used with server-side filtering
but that's an awful hack. Telling people to

git clone --depth 1 #(shallow)

like Guido advises is easier and more reliable for the clone use-case.
For the original repo that will take ~1.5s, for a split-by-year repo ~0.2s.

There are tools to split git files and keep the history
e.g. https://github.com/potherca-bash/git-split-file
but we'd need (to create) one that also zaps the old deltas.
So really "rewrite history" as the git folks tend to call this.
git filter-branch can do this. But it would get somewhat complex and murky
with commits that span CVE/list-year and list-year+1 which are at least 21 for
2018+2017, 19 for 2017+2016 and ~10 for previous year combos.
So I wouldn't put too much effort into that path.

In any case, a repo with just the split files but no maintained history clones
in ~12s in the above test setup. It also brings the (bare) repo down from 3,3GB
to 189MB. So the issue is really the data/CVE/list file.

That said, data/DSA/list is 14575 lines. That seems to not bother git too much
yet. Still if things get re-structured, this file may be worth a look, too.

To me the most reasonable path forward unfortunately looks like start a new repo
for 2019+ and "just" import the split files or single-record files as mentioned
by pabs but not the git/svn/cvs history. The old repo would - of course - stay
around but frozen at a deadline.

Corsac also mentioned on IRC that the repo could be hosted outside of Gitlab.
That would reduce the pressure for some time.
But cgit and other git frontends (as well as backends) we tested also struggle
with the repo (which is why my company, Faster IT GmbH, used the security-tracker
repo as a very welcome test case in the first place).
So that would buy time but not be a solution long(er) term.

Thanks for reading that much!
Guido Günther
2018-09-26 13:15:14 UTC
Permalink
Hi,
Post by Daniel Lange
The main issue is that we need to get clone and diff+render operations
back into normal time frames. The salsa workers (e.g. to render a
diff) time out after 60s. Similar time constraints are put onto other
I wonder why that is since "git diff" is pretty fast on a local
checkout. Did we ask the gitlab folks about it?

[..snip..]
Post by Daniel Lange
Just splitting the file will not do. We need to (unfortunately)
Not necessarily. Maybe a graft would do:

https://developer.atlassian.com/blog/2015/08/grafting-earlier-history-with-git/

This is IMHO preferable over history rewrites. I've used this to tie
histories in the past. I've not used "git replace" though but
.git/info/grafts.

Cheers,
-- Guido
Salvatore Bonaccorso
2018-09-27 06:39:35 UTC
Permalink
Hi,

[not contributing right now with ideas, just giving one important
datapoint to me to the discussion]
Post by Guido Günther
https://developer.atlassian.com/blog/2015/08/grafting-earlier-history-with-git/
This is IMHO preferable over history rewrites. I've used this to tie
histories in the past. I've not used "git replace" though but
.git/info/grafts.
FWIW on this point, for the securiy team members worklfows it is quite
importannt aspect (even admittely can be slow) to have access to
history of commits while working on their own checkouts. So that would
be a feature that in any splitup work done should be considered,
either in a rewrite-history situation or as mentioned above, or other
possibilties which will arise.

Thank you!

Regards,
Salvatore
Antoine Beaupré
2018-11-09 21:05:06 UTC
Permalink
On 2018-09-26 14:56:16, Daniel Lange wrote:

[...]
Post by Daniel Lange
In any case, a repo with just the split files but no maintained history clones
in ~12s in the above test setup. It also brings the (bare) repo down from 3,3GB
to 189MB. So the issue is really the data/CVE/list file.
So I've looked in that problem as well, four months ago:

https://salsa.debian.org/security-tracker-team/security-tracker/issues/2

In there I proposed splitting the data/CVE/list file into "one file per
CVE". In retrospect, that was a rather naive approach and yielded all
sorts of problems: there were so many files that it create problems even
for the shell (argument list too long).

I hadn't thought of splitting things in "one *file* per year". That
could really help! Unfortunately, it's hard to simulate what it would
look like *14 years* from now (yes, that's how old that repo is
already).

I can think of two ways to simulate that:

1. generate commits to recreate all files from scratch: parse
data/CVE/list, split it up into chunks, and add each CVE in one
separate commit. it's not *exactly* how things are done now, but it
should be a close enough approximation

2. do a crazy filter-branch to send commits to the right
files. considering how long an initial clone takes, i can't even
begin to imagine how long *that* would take. but it would be the
most accurate simulation.

Short of that, I think it's somewhat dishonest to compare a clean
repository with split files against a repository with history over 14
years and thousands of commits. Intuitively, I think you're right and
that "sharding" the data in yearly packets would help a lot git's
performance. But we won't know until we simulate it, and if hit that
problem again 5 years from now, all that work will have been for
nothing. (Although it *would* give us 5 years...)
Post by Daniel Lange
That said, data/DSA/list is 14575 lines. That seems to not bother git too much
yet. Still if things get re-structured, this file may be worth a look, too.
Yeah, I haven't had trouble with that one yet either.
Post by Daniel Lange
To me the most reasonable path forward unfortunately looks like start a new repo
for 2019+ and "just" import the split files or single-record files as mentioned
by pabs but not the git/svn/cvs history. The old repo would - of course - stay
around but frozen at a deadline.
In any case, I personally don't think history over those files is that
critical. We rarely dig into that history because it's so
expensive... Any "git annotate" takes forever in this repo, and running
*that* it over data/CVE/list takes tens of minutes.

That said, once we pick a solution, we *could* craft a magic
filter-branch that *would* keep history. It might be worth eating that
performance cost then. I'll run some tests to see if I can make sense of
such a filter.
Post by Daniel Lange
Corsac also mentioned on IRC that the repo could be hosted outside of Gitlab.
That would reduce the pressure for some time.
But cgit and other git frontends (as well as backends) we tested also struggle
with the repo (which is why my company, Faster IT GmbH, used the security-tracker
repo as a very welcome test case in the first place).
So that would buy time but not be a solution long(er) term.
Agreed. I think the benefits of hosting on gitlab outweigh the trouble
in rearchitecturing our datastore. As I said, it's not just gitlab
that's struggling with a 17MB text file: git itself has trouble dealing
with it as well, and I am often frustrated by that in my work...

A.
--
You are absolutely deluded, if not stupid, if you think that a
worldwide collection of software engineers who can't write operating
systems or applications without security holes, can then turn around
and suddenly write virtualization layers without security holes.
- Theo de Raadt
Antoine Beaupré
2018-11-09 23:05:55 UTC
Permalink
Post by Antoine Beaupré
2. do a crazy filter-branch to send commits to the right
files. considering how long an initial clone takes, i can't even
begin to imagine how long *that* would take. but it would be the
most accurate simulation.
Short of that, I think it's somewhat dishonest to compare a clean
repository with split files against a repository with history over 14
years and thousands of commits. Intuitively, I think you're right and
that "sharding" the data in yearly packets would help a lot git's
performance. But we won't know until we simulate it, and if hit that
problem again 5 years from now, all that work will have been for
nothing. (Although it *would* give us 5 years...)
So I've done that craaaazy filter-branch, on a shallow clone (1000
commits). The original clone is about 30MB, but the split repo is only
4MB.

Cloning the original repo takes a solid 30+ seconds:

[1221]***@curie:src130$ time git clone file://$PWD/security-tracker-1000.orig security-tracker-1000.orig-test
Clonage dans 'security-tracker-1000.orig-test'...
remote: Énumération des objets: 5291, fait.
remote: Décompte des objets: 100% (5291/5291), fait.
remote: Compression des objets: 100% (1264/1264), fait.
remote: Total 5291 (delta 3157), réutilisés 5291 (delta 3157)
Réception d'objets: 100% (5291/5291), 8.80 MiB | 19.47 MiB/s, fait.
Résolution des deltas: 100% (3157/3157), fait.
64.35user 0.44system 0:34.32elapsed 188%CPU (0avgtext+0avgdata 200056maxresident)k
0inputs+58968outputs (0major+48449minor)pagefaults 0swaps

Cloning the split repo takes less than a second:

[1223]***@curie:src$ time git clone file://$PWD/security-tracker-1000-filtered security-tracker-1000-filtered-test
Clonage dans 'security-tracker-1000-filtered-test'...
remote: Énumération des objets: 2214, fait.
remote: Décompte des objets: 100% (2214/2214), fait.
remote: Compression des objets: 100% (1190/1190), fait.
remote: Total 2214 (delta 936), réutilisés 2214 (delta 936)
Réception d'objets: 100% (2214/2214), 1.25 MiB | 22.78 MiB/s, fait.
Résolution des deltas: 100% (936/936), fait.
0.25user 0.04system 0:00.38elapsed 79%CPU (0avgtext+0avgdata 8200maxresident)k
0inputs+8664outputs (0major+3678minor)pagefaults 0swaps

So this is clearly a win, and I think it would be possible to rewrite
the history using the filter-branch command. Commit IDs would change,
but we would keep all commits and so annotate and all that good stuff
would still work.

The split-by-year bash script was too slow for my purposes: it was
taking a solid 15 seconds for each run, which meant it would have taken
9 *days* to process the entire repository.

So I tried to see if this could be optimized, so we could split the file
while keeping history without having to shutdown the whole system for
days. I first rewrote it in Python, which processed the 1000 commits in
801 seconds. This gives an estimate of 15 hours for the 68278 commits I
had locally. Concerned about the Python startup time, I then tried
golang, which processed the tree in 262 seconds, giving final estimate
of 4.8 hours.

Attached are both implementations, for those who want to reproduce my
results. Note that they differ from the original implementation in that
they have to (naturally) remove the data/CVE/list file itself otherwise
it's kept in history.

Here's how to call it:

git -c commit.gpgSign=false filter-branch --tree-filter '/home/anarcat/src/security-tracker/bin/split-by-year.py data/CVE/list' HEAD

Also observe how all gpg commit signatures are (obviously) lost. I have
explicitely disabled that because those actually take a long time to
compute...

I haven't tested if a graft would improve performance, but I suspect it
would not, given the sheer size of the repository that would effectively
need to be carried over anyways.

A.
--
Man really attains the state of complete humanity when he produces,
without being forced by physical need to sell himself as a commodity.
- Ernesto "Che" Guevara
Daniel Lange
2018-11-10 17:56:01 UTC
Permalink
Antoine,

thank you very much for your filter-branch scripts.

I tested each:

1) the golang version:
It completes after 3h36min:

# git filter-branch --tree-filter '/split-by-year' HEAD
Rewrite a09118bf0a33f3721c0b8f6880c4cbb1e407a39d (68282/68286) (12994 seconds passed, remaining 0 predicted)
Ref 'refs/heads/master' was rewritten

But it doesn't Close() the os.OpenFile handles so ...
all data/CVE/list.yyyy files are 0 bytes long. Sic!

I can reproduce that just running the golang executable
against a current checkout of data/CVE/list.

# go version
go version go1.10.3 linux/amd64
(Stretch backport golang-go 2:1.10~5~bpo9+1)

2.1) the Python version
You claim #!/usr/bin/python3 in the shebang, so I tried that first:

# git filter-branch --tree-filter '/usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc' HEAD
Rewrite 990d3c4bbb49308fb3de1e0e91b9ba5600386f8a (1220/68293) (41 seconds passed, remaining 2254 predicted)
Traceback (most recent call last):
File "split-by-year.py", line 13, in <module>
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 5463: invalid start byte
tree filter failed: /usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc

The offending commit is:
* 990d3c4bbb - Rename sarge-checks data to something not specific to sarge, since we're working on etch now.
Sorry for the probable annoyance, but it had to be done. (13 years ago) [Joey Hess]

There will be many more like this, so for Python3
this needs needs to be made unicode-agnostic.

Notice I compiled the .py to .pyc which makes it
much faster and thus well usable.

2.2) Python, when a string was a string .. Python2
Your code is actually Python2, so why not give that a try:

# git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
Rewrite b59da20b82011ffcfa6c4a453de9df58ee036b2c (2516/68293) (113 seconds passed, remaining 2954 predicted)
Traceback (most recent call last):
File "split-by-year.py", line 18, in <module>
yearly = 'data/CVE/list.{:d}'.format(year)
NameError: name 'year' is not defined
tree filter failed: /usr/bin/python2 /split-by-year.pyc

The offending commit is:
* b59da20b82 - claim (13 years ago) [Moritz Muehlenhoff]
| diff --git a/data/CVE/list b/data/CVE/list
| index 7b5d1d21d6..cdf0b74dd0 100644
| --- a/data/CVE/list
| +++ b/data/CVE/list
| @@ -1,3 +1,4 @@
| +begin claimed by jmm
| CVE-2005-3276 (The sys_get_thread_area function in process.c in Linux 2.6 before ...)
| TODO: check
| CVE-2005-3275 (The NAT code (1) ip_nat_proto_tcp.c and (2) ip_nat_proto_udp.c in ...)
| @@ -34,6 +35,7 @@ CVE-2005-3260 (Multiple cross-site scripting (XSS) vulnerabilities in ...)
| TODO: check
| CVE-2005-3259 (Multiple SQL injection vulnerabilities in versatileBulletinBoard (vBB) ...)
| TODO: check
| +end claimed by jmm
| CVE-2005-XXXX [Insecure caching of user id in mantis]
| - mantis <unfixed> (bug #330682; unknown)
| CVE-2005-XXXX [Filter information disclosure in mantis]

As you see the line "+begin claimed by jmm" breaks the too simplistic parser logic.
Unfortunately dry-running against a current version of data/CVE/list such errors do not show up.
The "violations" of the file format are transient and buried in history.

Best,
Daniel
Antoine Beaupré
2018-11-12 17:22:58 UTC
Permalink
Post by Daniel Lange
Antoine,
thank you very much for your filter-branch scripts.
you're welcome! glad it can be of use.
Post by Daniel Lange
# git filter-branch --tree-filter '/split-by-year' HEAD
Rewrite a09118bf0a33f3721c0b8f6880c4cbb1e407a39d (68282/68286) (12994 seconds passed, remaining 0 predicted)
Ref 'refs/heads/master' was rewritten
But it doesn't Close() the os.OpenFile handles so ...
all data/CVE/list.yyyy files are 0 bytes long. Sic!
Well. That explains part of the performance difference. ;)

There were multiple problems with the golang source - variable shadowing
and, yes, a missing Close(). Surprisingly, the fixed version results is
*slower* than the equivalent Python code, taking about one second per
run or 1102 seconds for the last 1000 commits. I'm at a loss as to how I
managed to make go run slower than Python here (and can't help but think
C would have been easier, again). Probably poor programming on my
part. New version attached.

[...]
Post by Daniel Lange
2.1) the Python version
# git filter-branch --tree-filter '/usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc' HEAD
Rewrite 990d3c4bbb49308fb3de1e0e91b9ba5600386f8a (1220/68293) (41 seconds passed, remaining 2254 predicted)
File "split-by-year.py", line 13, in <module>
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 5463: invalid start byte
tree filter failed: /usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc
I suspected this would be a problem, but didn't find any occurence in
the shallow clone so I forgot about it. Note that the golang version
takes great care to treat the data as binary...
Post by Daniel Lange
* 990d3c4bbb - Rename sarge-checks data to something not specific to sarge, since we're working on etch now.
Sorry for the probable annoyance, but it had to be done. (13 years ago) [Joey Hess]
There will be many more like this, so for Python3
this needs needs to be made unicode-agnostic.
... so I rewrote the thing to handle only binary and tested it against
that version of the file. It seems to work fine.
Post by Daniel Lange
Notice I compiled the .py to .pyc which makes it
much faster and thus well usable.
Interesting. I didn't see much difference in performance in my
benchmarks on average, but the worst-case run did improve by 150ms, so I
guess this is worth the trouble. For those who didn't know (like me)
this means running:

python -m compileall bin/split-by-year.py

Whenever the .py file changes (right?).
Post by Daniel Lange
2.2) Python, when a string was a string .. Python2
# git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
Rewrite b59da20b82011ffcfa6c4a453de9df58ee036b2c (2516/68293) (113 seconds passed, remaining 2954 predicted)
File "split-by-year.py", line 18, in <module>
yearly = 'data/CVE/list.{:d}'.format(year)
NameError: name 'year' is not defined
tree filter failed: /usr/bin/python2 /split-by-year.pyc
* b59da20b82 - claim (13 years ago) [Moritz Muehlenhoff]
| diff --git a/data/CVE/list b/data/CVE/list
| index 7b5d1d21d6..cdf0b74dd0 100644
| --- a/data/CVE/list
| +++ b/data/CVE/list
| +begin claimed by jmm
| CVE-2005-3276 (The sys_get_thread_area function in process.c in Linux 2.6 before ...)
| TODO: check
| CVE-2005-3275 (The NAT code (1) ip_nat_proto_tcp.c and (2) ip_nat_proto_udp.c in ...)
| TODO: check
| CVE-2005-3259 (Multiple SQL injection vulnerabilities in versatileBulletinBoard (vBB) ...)
| TODO: check
| +end claimed by jmm
| CVE-2005-XXXX [Insecure caching of user id in mantis]
| - mantis <unfixed> (bug #330682; unknown)
| CVE-2005-XXXX [Filter information disclosure in mantis]
As you see the line "+begin claimed by jmm" breaks the too simplistic parser logic.
Unfortunately dry-running against a current version of data/CVE/list such errors do not show up.
The "violations" of the file format are transient and buried in history.
Hmm... That's a trickier one. I guess we could just pretend that line
doesn't exist and drop it from history... But I chose to buffer it and
treat it like the CVE line so it gets attached to the right file. See if
it does what you expect.

git cat-file -p b59da20b82:data/CVE/list > data/CVE/list.b59da20b82
split-by-year.py data/CVE/list.b59da20b82

Performance-wise, I shaved off a surprising 60ms by enclosing all the
code in a function (yes, it's crazy), but the buffering to deal with the
above issue added another 40ms so performance should be similar.

I'll start a run on the whole history to see if I can find any problems,
as soon as a first clone finishes resolving those damn deltas. ;)

Thanks for the review!

A.
--
Premature optimization is the root of all evil
- Donald Knuth
Antoine Beaupré
2018-11-13 15:56:24 UTC
Permalink
Post by Antoine Beaupré
I'll start a run on the whole history to see if I can find any problems,
as soon as a first clone finishes resolving those damn deltas. ;)
The Python job finished successfully here after 10 hours.

I did some tests on the new git repository. Cloning the repository from
scratch takes around 2 minutes (the original repo: 21 minutes). It is
145MB while the original repo is 1.6GB.

Running git annotate on data/CVE/list.2018 takes about 26 seconds, while
it takes basically forever to annotate the original data/CVE/list. (It's
been running for 10 minutes here already.)

So that's about it. I have not done a thorough job at checking the
actual *integrity* of the results. It's difficult, considering CVE
identifiers are not sequential in the data/CVE/list file, so a naive
diff like this will fail:

$ diff -u <(cat ../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999} ) data/CVE/list | diffstat
list |106562 +++++++++++++++++++++++++++++++++----------------------------------
1 file changed, 53281 insertions(+), 53281 deletions(-)

But at least the numbers add up: it looks like no line is lost. And
indeed, it looks like all CVEs add up:

$ diff -u <(cat ../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999} | grep ^CVE | sort -n ) <( grep ^CVE data/CVE/list | sort -n ) | diffstat
0 files changed

A cursory look at the diff seems to indicate it is clean, however.

I looked at splitting that file per CVE. That did not scale and just
created new problems. But splitting by *year* seems like a very
efficient switch, and I think it would be worth pursuing that idea
forward.

A.
--
There is no cloud, it's just someone else's computer.
- Chris Watterson
Daniel Lange
2018-11-13 17:14:54 UTC
Permalink
Post by Antoine Beaupré
The Python job finished successfully here after 10 hours.
6h40 mins here as I ported your improved logic to the python2 version :).

# git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
Rewrite 1169d256b27eb7244273671582cc08ba88002819 (68356/68357) (24226 seconds passed, remaining 0 predicted)
Ref 'refs/heads/master' was rewritten

The tree-filter blows up the .git/objects store to 13G though.
But nothing a git gc can't fix.
Post by Antoine Beaupré
I did some tests on the new git repository. Cloning the repository from
scratch takes around 2 minutes (the original repo: 21 minutes).
Confirmed.
Post by Antoine Beaupré
So that's about it. I have not done a thorough job at checking the
actual *integrity* of the results. It's difficult, considering CVE
identifiers are not sequential in the data/CVE/list file, so a naive
$ diff -u <(cat ../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999} ) data/CVE/list | diffstat
list |106562 +++++++++++++++++++++++++++++++++----------------------------------
1 file changed, 53281 insertions(+), 53281 deletions(-)
But at least the numbers add up: it looks like no line is lost. And
$ diff -u <(cat ../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999} | grep ^CVE | sort -n ) <( grep ^CVE data/CVE/list | sort -n ) | diffstat
0 files changed
A cursory look at the diff seems to indicate it is clean, however.
I uploaded "my" version to https://people.debian.org/~dlange/
so people can poke the log and diffs and see whether there are any
issues left.
Post by Antoine Beaupré
I looked at splitting that file per CVE. That did not scale and just
created new problems. But splitting by *year* seems like a very
efficient switch, and I think it would be worth pursuing that idea
forward.
The tools in bin/ would need a brush through. I.e. throw away the
unused ones and amend the ones that are used on data/CVE/* to learn
about the split files.
Antoine Beaupré
2018-11-13 17:22:54 UTC
Permalink
Post by Daniel Lange
Post by Antoine Beaupré
The Python job finished successfully here after 10 hours.
6h40 mins here as I ported your improved logic to the python2 version :).
# git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
Rewrite 1169d256b27eb7244273671582cc08ba88002819 (68356/68357) (24226 seconds passed, remaining 0 predicted)
Ref 'refs/heads/master' was rewritten
The tree-filter blows up the .git/objects store to 13G though.
But nothing a git gc can't fix.
Ah but that's because the old repository is still in there. You need to
clone the repo in a clean copy:

git clone file://$PWD/security-tracker security-tracker-filtered

To get the minimal version, i even did that twice although I'm not sure
that's necessary.

[...]
Post by Daniel Lange
Post by Antoine Beaupré
I looked at splitting that file per CVE. That did not scale and just
created new problems. But splitting by *year* seems like a very
efficient switch, and I think it would be worth pursuing that idea
forward.
The tools in bin/ would need a brush through. I.e. throw away the
unused ones and amend the ones that are used on data/CVE/* to learn
about the split files.
Oh yes, lots of work remains, whether we keep the history or not. That's
probably the *most* work we need to do.

But before going through that trouble, I think we'd need to get approval
from the security team first, as that's quite a lot of work. I figured
we would make a feasability study first...

a.
--
On reconnait la grandeur et la valeur d'une nation à la façon dont
celle-ci traite ses animaux.
- Mahatma Gandhi
Moritz Muehlenhoff
2018-11-13 22:09:41 UTC
Permalink
Post by Antoine Beaupré
But before going through that trouble, I think we'd need to get approval
from the security team first, as that's quite a lot of work. I figured
we would make a feasability study first...
The current data structure works very well for us and splitting the files
has many downsides.

If we can't get the repository in run on salsa in a manner that doesn't
impact other repositories (e.g. by disabling the repository browser or
similar), then moving the security tracker repository out of Salsa is
the more likely solution.

Did anyone follow Guido's suggestion to report this upstream to
get their assessment on possible optimisations?

Cheers,
Moritz
Daniel Lange
2018-11-14 06:34:03 UTC
Permalink
Post by Moritz Muehlenhoff
The current data structure works very well for us and splitting the files
has many downsides.
Could you detail what those many downsides are besides the scripts that
need to be amended?
Guido Günther
2018-11-14 08:28:10 UTC
Permalink
Hi,
Post by Moritz Muehlenhoff
Post by Antoine Beaupré
But before going through that trouble, I think we'd need to get approval
from the security team first, as that's quite a lot of work. I figured
we would make a feasability study first...
The current data structure works very well for us and splitting the files
has many downsides.
If we can't get the repository in run on salsa in a manner that doesn't
impact other repositories (e.g. by disabling the repository browser or
similar), then moving the security tracker repository out of Salsa is
the more likely solution.
Did anyone follow Guido's suggestion to report this upstream to
get their assessment on possible optimisations?
Just in case someone takes this upstream. I've filed

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913124

against git a couple of days ago.
Cheers,
-- Guido
Moritz Muehlenhoff
2018-11-14 18:45:59 UTC
Permalink
Post by Daniel Lange
Post by Moritz Muehlenhoff
The current data structure works very well for us and splitting the files
has many downsides.
Could you detail what those many downsides are besides the scripts that
need to be amended?
Nearly all the tasks of actually editing the data require a look at the complete
data, e.g. to check whether something was tracked before, whether there's an ITP
for something, whether something was tracked as NFU in the past and lots more.

Cheers,
Moritz
Holger Levsen
2018-11-14 19:32:02 UTC
Permalink
Post by Moritz Muehlenhoff
Nearly all the tasks of actually editing the data require a look at the complete
data, e.g. to check whether something was tracked before, whether there's an ITP
for something, whether something was tracked as NFU in the past and lots more.
according to git log, the data goes back to 2004. Do you really need all
those 15 years of history or could we maybe make a yearly split for
(now) the first 10 years and have the last 5 years in "one"?

And then when we move into 2019 we would move 2014 to the then 11 first
years and so on... same in 2020 with 2015 then...

IMHO we should do something, else dealing with security-tracker.git will be
even more cumbersome in 5 or 10 years ahead.
--
cheers,
Holger

-------------------------------------------------------------------------------
holger@(debian|reproducible-builds|layer-acht).org
PGP fingerprint: B8BF 5413 7B09 D35C F026 FE9D 091A B856 069A AA1C
Salvatore Bonaccorso
2018-11-14 20:48:17 UTC
Permalink
Hi,
Post by Moritz Muehlenhoff
Post by Daniel Lange
Post by Moritz Muehlenhoff
The current data structure works very well for us and splitting the files
has many downsides.
Could you detail what those many downsides are besides the scripts that
need to be amended?
Nearly all the tasks of actually editing the data require a look at the complete
data, e.g. to check whether something was tracked before, whether there's an ITP
for something, whether something was tracked as NFU in the past and lots more.
Agreed from my point of view as well, history is and contains valuable
data, we do not want to loose that. And even if researching in older
items and made changes takes time. You will even see that with time
passed people started to put more information in the respective done
changes/commits, giving rationales, notes, and additional informations.

And if that all is going to be too much hassle for the salsa
infrastructure we would need/could move the repository to somewhere
else, with the unfortunate downside on contributors from the whole
comunity. But admitely the people regularly contributing is
overviewable.

On the agreement side I fully agree that initial clones of the repo
are a problem. It as well would be intreesting to see what git
upstream would think on that usecase and #913124 raised by Guido.

Regards,
Salvatore

Loading...