- There is no need to materialize all the content of a given Node when we can
simply compute its length directly, saving a lot of memory, on the order of
several megabytes on my instance, with peaks at a couple of dozen.
- One might object to the usage of a recursive construct, but this is a direct
port of goquery's Text method, so this change doesn't make anything worse.
- The computation of linkLength can be similarly computed, but this can go in
another commit, as it's a bit trickier, since we need to get the length of
every Node that has a `a` Node as parent, without iterating on the whole
parent chain every time.
Before
```console
$ go test -bench=.
goos: linux
goarch: arm64
pkg: miniflux.app/v2/internal/reader/readability
BenchmarkExtractContent-8 34 86102474 ns/op
BenchmarkGetWeight-8 10573 103045 ns/op
PASS
ok miniflux.app/v2/internal/reader/readability 5.409s
```
After
```console
$ go test -bench=.
goos: linux
goarch: arm64
pkg: miniflux.app/v2/internal/reader/readability
BenchmarkExtractContent-8 56 83130924 ns/op
BenchmarkGetWeight-8 246541 5241 ns/op
PASS
ok miniflux.app/v2/internal/reader/readability 6.026s
```
This should make ProcessFeedEntries marginally faster, while saving
some memory.
- Use chained strings.Contains instead of a regex for
blacklistCandidatesRegexp, as this is a bit faster
- Simplify a Find.Each.Remove to Find.Remove
- Don't concatenate id and class for removeUnlikelyCandidates, as it makes no
sense to match on overlaps. It might also marginally improve performances, as
regex now have to run on two strings separately, instead of both.
- Add a small benchmark
- Replace a completely overkill regex
- Use `.Remove()` instead of a hand-rolled loop
- Use a strings.Builder instead of a bytes.NewBufferString
- Replace a call to Fprintf with string concatenation, as the latter are much
faster
- Remove a superfluous cast
- Delay some computations
- Add some tests