1
0
Fork 0
mirror of https://github.com/miniflux/v2.git synced 2025-08-06 17:41:00 +00:00
Commit graph

25 commits

Author SHA1 Message Date
jvoisin
69a74c4abf refactor(readability): minor clean up
Remove a now-useless regex and its associated test.
2025-07-02 16:50:49 -07:00
jvoisin
766d4ab834 refactor(readability): make use of getSelectionLength 2025-07-02 16:47:27 -07:00
Frédéric Guillot
8c3f280f32 test(readability): add test case for ExtractContent with broken reader 2025-07-01 20:14:52 -07:00
jvoisin
8a98926674 refactor(readability): add a getSelectionLength function
When we're only interested in the length of contained Text, there is no need to
materialize it fully to then call len() on the result: we can simply iterate
over the text element and sum their length instead.
2025-07-01 19:52:53 -07:00
jvoisin
89c32d518d perf(readability): significantly improve transformMisusedDivsIntoParagraphs 2025-07-01 19:44:58 -07:00
jvoisin
2f7b2e7375 perf(readability): improve getLinkDensity
- There is no need to materialize all the content of a given Node when we can
  simply compute its length directly, saving a lot of memory, on the order of
  several megabytes on my instance, with peaks at a couple of dozen.
- One might object to the usage of a recursive construct, but this is a direct
  port of goquery's Text method, so this change doesn't make anything worse.
- The computation of linkLength can be similarly computed, but this can go in
  another commit, as it's a bit trickier, since we need to get the length of
  every Node that has a `a` Node as parent, without iterating on the whole
  parent chain every time.
2025-07-01 19:40:47 -07:00
Frédéric Guillot
6eeccae7cd test(readability): increase test coverage 2025-06-30 21:29:07 -07:00
jvoisin
aed99e65c1 perf(readability): improve getClassWeight speed
Before

```console
$ go test -bench=.
goos: linux
goarch: arm64
pkg: miniflux.app/v2/internal/reader/readability
BenchmarkExtractContent-8   	     34	 86102474 ns/op
BenchmarkGetWeight-8        	  10573	    103045 ns/op
PASS
ok  	miniflux.app/v2/internal/reader/readability	5.409s
```

After

```console
$ go test -bench=.
goos: linux
goarch: arm64
pkg: miniflux.app/v2/internal/reader/readability
BenchmarkExtractContent-8   	     56	 83130924 ns/op
BenchmarkGetWeight-8        	 246541	     5241 ns/op
PASS
ok  	miniflux.app/v2/internal/reader/readability	6.026s
```

This should make ProcessFeedEntries marginally faster, while saving
some memory.
2025-06-30 19:28:20 -07:00
jvoisin
4e1f836266 refactor(readability): simplify a bit getArticle
- Use a proper division instead of multiplying by a float.
- Extract a condition in the parent scope
- Use an else-if construct instead of a simple if
2025-06-29 16:06:34 -07:00
jvoisin
c064891314 perf(readability): Simplify removeUnlikelyCandidates
- Use an array of strings instead of a regex, like done in ef13756b1a7a7ba30fd34174a5367381fd8b4849
- Extract the `shouldRemove` function from `removeUnlikelyCandidates`, as there
  is no reason to have it there instead of being a proper standalone function.
- Improve a condition, where the goquery selection would have its `id`
  attribute left unchecked if a `class` one was present, regardless of if
  `class` was a candidate to removal or not.
- Add some comments
2025-06-29 15:31:01 -07:00
Frédéric Guillot
6d58052504
fix(readability): do not remove elements within code blocks
`<span class="hljs-comment"># exit 1</span>` will match the `unlikelyCandidatesRegexp` because it contains the `comment` string.
2025-06-19 21:03:53 -07:00
jvoisin
8a014c6abc perf(readability): minor regex improvement
- Improve the check for tags by matching only if its name is followed either by
  a space, a slash or a closing angle
- Use an anonymous group
2025-06-12 19:13:58 -07:00
jvoisin
2df59b4865 Refactor internal/reader/readability/testdata
- Use chained strings.Contains instead of a regex for
  blacklistCandidatesRegexp, as this is a bit faster
- Simplify a Find.Each.Remove to Find.Remove
- Don't concatenate id and class for removeUnlikelyCandidates, as it makes no
  sense to match on overlaps. It might also marginally improve performances, as
  regex now have to run on two strings separately, instead of both.
- Add a small benchmark
2024-12-15 20:52:32 -08:00
Julien Voisin
6ad5ad0bb2
refactor(readability): various improvements and optimizations
- Replace a completely overkill regex
- Use `.Remove()` instead of a hand-rolled loop
- Use a strings.Builder instead of a bytes.NewBufferString
- Replace a call to Fprintf with string concatenation, as the latter are much
  faster
- Remove a superfluous cast
- Delay some computations
- Add some tests
2024-12-12 20:41:56 -08:00
Julien Voisin
e6185b1393
refactor: use min/max instead of math.Min/math.Max
This saves a couple of back'n'forth casts.
2024-12-11 19:43:14 -08:00
Julien Voisin
1b0b8b9c42
refactor: use a better construct than doc.Find(…).First()
As mentioned in goquery's documentation (https://pkg.go.dev/github.com/PuerkitoBio/goquery#Single):

> By default, Selection.Find and other functions that accept a selector string
to select nodes will use all matches corresponding to that selector. By using
the Matcher returned by Single, at most the first match will be selected.
>
> The one using Single is optimized to be potentially much faster on large documents.
2024-12-11 19:40:55 -08:00
Julien Voisin
2671f57edd
refactor(readability): simplify the regexes in internal/reader/readability/readability.go
- Use strings.ToLower() instead of having case-insensitive regex
- Remove overlapping words in the regex
- Split a condition to increase readability
2024-12-07 16:56:19 -08:00
Frédéric Guillot
29387f2d60 feat: implement base element handling in content scraper 2024-07-25 20:36:56 -07:00
Frédéric Guillot
b1e73fafdf Enable go-critic linter and fix various issues detected 2024-03-17 13:52:34 -07:00
jvoisin
347740dce1 Speed up removeUnlikelyCandidates
`.Not` returns a brand new Selection, copied element by element.
2024-02-29 19:38:43 -08:00
Frédéric Guillot
97765b93a9 Revert "Minor internal/reader/readability/readability.go speedup"
This reverts commit 4db138d4b8.

```
panic: runtime error: index out of range [-1]

goroutine 49 [running]:
miniflux.app/v2/internal/reader/readability.getArticle.func1(0x8?, 0xc000b56570)
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:120 +0x2ac
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc000b56510, 0xc000892fa8)
        /home/fred/go/pkg/mod/github.com/!puerkito!bio/goquery@v1.9.0/iteration.go:10 +0x62
miniflux.app/v2/internal/reader/readability.getArticle(0xc00044f1f0, 0xc000a04a50)
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:101 +0x15d
miniflux.app/v2/internal/reader/readability.ExtractContent({0x1005d00?, 0xc0001522d0?})
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:91 +0x211
miniflux.app/v2/internal/reader/scraper.ScrapeWebsite(0xc000893688?, {0xc0007ce720, 0x54}, {0x0, 0x0})
        /home/fred/repos/miniflux/v2/internal/reader/scraper/scraper.go:63 +0x859
miniflux.app/v2/internal/reader/processor.ProcessFeedEntries(0xc000133188, 0xc000502c40, 0xc0003e6360, 0x0)
        /home/fred/repos/miniflux/v2/internal/reader/processor/processor.go:77 +0x8ea
miniflux.app/v2/internal/reader/handler.RefreshFeed(0xc000133188, 0x10cf, 0x52d5c, 0x0)
        /home/fred/repos/miniflux/v2/internal/reader/handler/handler.go:301 +0x1485
miniflux.app/v2/internal/cli.refreshFeeds.func1(0x0)
        /home/fred/repos/miniflux/v2/internal/cli/refresh_feeds.go:59 +0x2d7
created by miniflux.app/v2/internal/cli.refreshFeeds in goroutine 1
        /home/fred/repos/miniflux/v2/internal/cli/refresh_feeds.go:50 +0x5d5
```
2024-02-29 19:06:03 -08:00
jvoisin
4db138d4b8 Minor internal/reader/readability/readability.go speedup
- Don't use a capturing group in `divToPElementsRegexp`
- Remove a duplicate condition
- Replace a regex with a fixed-comparison and a `Contains`
2024-02-28 20:03:14 -08:00
jvoisin
61af08a721 Use .WriteString( instead of .Write([]byte(… 2024-02-28 19:47:30 -08:00
Frédéric Guillot
c0e954f19d Implement structured logging using log/slog package 2023-09-24 22:37:33 -07:00
Frédéric Guillot
168a870c02 Move internal packages to an internal folder
For reference: https://go.dev/doc/go1.4#internalpackages
2023-08-10 20:29:34 -07:00
Renamed from reader/readability/readability.go (Browse further)