1
0
Fork 0
mirror of https://github.com/miniflux/v2.git synced 2025-08-06 17:41:00 +00:00
Commit graph

12 commits

Author SHA1 Message Date
jvoisin
69a74c4abf refactor(readability): minor clean up
Remove a now-useless regex and its associated test.
2025-07-02 16:50:49 -07:00
Frédéric Guillot
8c3f280f32 test(readability): add test case for ExtractContent with broken reader 2025-07-01 20:14:52 -07:00
jvoisin
2f7b2e7375 perf(readability): improve getLinkDensity
- There is no need to materialize all the content of a given Node when we can
  simply compute its length directly, saving a lot of memory, on the order of
  several megabytes on my instance, with peaks at a couple of dozen.
- One might object to the usage of a recursive construct, but this is a direct
  port of goquery's Text method, so this change doesn't make anything worse.
- The computation of linkLength can be similarly computed, but this can go in
  another commit, as it's a bit trickier, since we need to get the length of
  every Node that has a `a` Node as parent, without iterating on the whole
  parent chain every time.
2025-07-01 19:40:47 -07:00
Frédéric Guillot
6eeccae7cd test(readability): increase test coverage 2025-06-30 21:29:07 -07:00
jvoisin
aed99e65c1 perf(readability): improve getClassWeight speed
Before

```console
$ go test -bench=.
goos: linux
goarch: arm64
pkg: miniflux.app/v2/internal/reader/readability
BenchmarkExtractContent-8   	     34	 86102474 ns/op
BenchmarkGetWeight-8        	  10573	    103045 ns/op
PASS
ok  	miniflux.app/v2/internal/reader/readability	5.409s
```

After

```console
$ go test -bench=.
goos: linux
goarch: arm64
pkg: miniflux.app/v2/internal/reader/readability
BenchmarkExtractContent-8   	     56	 83130924 ns/op
BenchmarkGetWeight-8        	 246541	     5241 ns/op
PASS
ok  	miniflux.app/v2/internal/reader/readability	6.026s
```

This should make ProcessFeedEntries marginally faster, while saving
some memory.
2025-06-30 19:28:20 -07:00
Frédéric Guillot
a68de4ee6a test(readability): add tests for getArticle function 2025-06-29 16:03:17 -07:00
Frédéric Guillot
5129f53d58 test(readability): add tests for removeUnlikelyCandidates function 2025-06-29 15:23:56 -07:00
Frédéric Guillot
e60f0fd142 test(readability): add tests for getClassWeight function 2025-06-29 13:24:06 -07:00
Frédéric Guillot
6d58052504
fix(readability): do not remove elements within code blocks
`<span class="hljs-comment"># exit 1</span>` will match the `unlikelyCandidatesRegexp` because it contains the `comment` string.
2025-06-19 21:03:53 -07:00
jvoisin
2df59b4865 Refactor internal/reader/readability/testdata
- Use chained strings.Contains instead of a regex for
  blacklistCandidatesRegexp, as this is a bit faster
- Simplify a Find.Each.Remove to Find.Remove
- Don't concatenate id and class for removeUnlikelyCandidates, as it makes no
  sense to match on overlaps. It might also marginally improve performances, as
  regex now have to run on two strings separately, instead of both.
- Add a small benchmark
2024-12-15 20:52:32 -08:00
Julien Voisin
6ad5ad0bb2
refactor(readability): various improvements and optimizations
- Replace a completely overkill regex
- Use `.Remove()` instead of a hand-rolled loop
- Use a strings.Builder instead of a bytes.NewBufferString
- Replace a call to Fprintf with string concatenation, as the latter are much
  faster
- Remove a superfluous cast
- Delay some computations
- Add some tests
2024-12-12 20:41:56 -08:00
Frédéric Guillot
29387f2d60 feat: implement base element handling in content scraper 2024-07-25 20:36:56 -07:00