1
0
Fork 0
mirror of https://github.com/miniflux/v2.git synced 2025-08-06 17:41:00 +00:00
Commit graph

19 commits

Author SHA1 Message Date
Frédéric Guillot
6eeccae7cd test(readability): increase test coverage 2025-06-30 21:29:07 -07:00
jvoisin
aed99e65c1 perf(readability): improve getClassWeight speed
Before

```console
$ go test -bench=.
goos: linux
goarch: arm64
pkg: miniflux.app/v2/internal/reader/readability
BenchmarkExtractContent-8   	     34	 86102474 ns/op
BenchmarkGetWeight-8        	  10573	    103045 ns/op
PASS
ok  	miniflux.app/v2/internal/reader/readability	5.409s
```

After

```console
$ go test -bench=.
goos: linux
goarch: arm64
pkg: miniflux.app/v2/internal/reader/readability
BenchmarkExtractContent-8   	     56	 83130924 ns/op
BenchmarkGetWeight-8        	 246541	     5241 ns/op
PASS
ok  	miniflux.app/v2/internal/reader/readability	6.026s
```

This should make ProcessFeedEntries marginally faster, while saving
some memory.
2025-06-30 19:28:20 -07:00
jvoisin
4e1f836266 refactor(readability): simplify a bit getArticle
- Use a proper division instead of multiplying by a float.
- Extract a condition in the parent scope
- Use an else-if construct instead of a simple if
2025-06-29 16:06:34 -07:00
jvoisin
c064891314 perf(readability): Simplify removeUnlikelyCandidates
- Use an array of strings instead of a regex, like done in ef13756b1a7a7ba30fd34174a5367381fd8b4849
- Extract the `shouldRemove` function from `removeUnlikelyCandidates`, as there
  is no reason to have it there instead of being a proper standalone function.
- Improve a condition, where the goquery selection would have its `id`
  attribute left unchecked if a `class` one was present, regardless of if
  `class` was a candidate to removal or not.
- Add some comments
2025-06-29 15:31:01 -07:00
Frédéric Guillot
6d58052504
fix(readability): do not remove elements within code blocks
`<span class="hljs-comment"># exit 1</span>` will match the `unlikelyCandidatesRegexp` because it contains the `comment` string.
2025-06-19 21:03:53 -07:00
jvoisin
8a014c6abc perf(readability): minor regex improvement
- Improve the check for tags by matching only if its name is followed either by
  a space, a slash or a closing angle
- Use an anonymous group
2025-06-12 19:13:58 -07:00
jvoisin
2df59b4865 Refactor internal/reader/readability/testdata
- Use chained strings.Contains instead of a regex for
  blacklistCandidatesRegexp, as this is a bit faster
- Simplify a Find.Each.Remove to Find.Remove
- Don't concatenate id and class for removeUnlikelyCandidates, as it makes no
  sense to match on overlaps. It might also marginally improve performances, as
  regex now have to run on two strings separately, instead of both.
- Add a small benchmark
2024-12-15 20:52:32 -08:00
Julien Voisin
6ad5ad0bb2
refactor(readability): various improvements and optimizations
- Replace a completely overkill regex
- Use `.Remove()` instead of a hand-rolled loop
- Use a strings.Builder instead of a bytes.NewBufferString
- Replace a call to Fprintf with string concatenation, as the latter are much
  faster
- Remove a superfluous cast
- Delay some computations
- Add some tests
2024-12-12 20:41:56 -08:00
Julien Voisin
e6185b1393
refactor: use min/max instead of math.Min/math.Max
This saves a couple of back'n'forth casts.
2024-12-11 19:43:14 -08:00
Julien Voisin
1b0b8b9c42
refactor: use a better construct than doc.Find(…).First()
As mentioned in goquery's documentation (https://pkg.go.dev/github.com/PuerkitoBio/goquery#Single):

> By default, Selection.Find and other functions that accept a selector string
to select nodes will use all matches corresponding to that selector. By using
the Matcher returned by Single, at most the first match will be selected.
>
> The one using Single is optimized to be potentially much faster on large documents.
2024-12-11 19:40:55 -08:00
Julien Voisin
2671f57edd
refactor(readability): simplify the regexes in internal/reader/readability/readability.go
- Use strings.ToLower() instead of having case-insensitive regex
- Remove overlapping words in the regex
- Split a condition to increase readability
2024-12-07 16:56:19 -08:00
Frédéric Guillot
29387f2d60 feat: implement base element handling in content scraper 2024-07-25 20:36:56 -07:00
Frédéric Guillot
b1e73fafdf Enable go-critic linter and fix various issues detected 2024-03-17 13:52:34 -07:00
jvoisin
347740dce1 Speed up removeUnlikelyCandidates
`.Not` returns a brand new Selection, copied element by element.
2024-02-29 19:38:43 -08:00
Frédéric Guillot
97765b93a9 Revert "Minor internal/reader/readability/readability.go speedup"
This reverts commit 4db138d4b8.

```
panic: runtime error: index out of range [-1]

goroutine 49 [running]:
miniflux.app/v2/internal/reader/readability.getArticle.func1(0x8?, 0xc000b56570)
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:120 +0x2ac
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc000b56510, 0xc000892fa8)
        /home/fred/go/pkg/mod/github.com/!puerkito!bio/goquery@v1.9.0/iteration.go:10 +0x62
miniflux.app/v2/internal/reader/readability.getArticle(0xc00044f1f0, 0xc000a04a50)
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:101 +0x15d
miniflux.app/v2/internal/reader/readability.ExtractContent({0x1005d00?, 0xc0001522d0?})
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:91 +0x211
miniflux.app/v2/internal/reader/scraper.ScrapeWebsite(0xc000893688?, {0xc0007ce720, 0x54}, {0x0, 0x0})
        /home/fred/repos/miniflux/v2/internal/reader/scraper/scraper.go:63 +0x859
miniflux.app/v2/internal/reader/processor.ProcessFeedEntries(0xc000133188, 0xc000502c40, 0xc0003e6360, 0x0)
        /home/fred/repos/miniflux/v2/internal/reader/processor/processor.go:77 +0x8ea
miniflux.app/v2/internal/reader/handler.RefreshFeed(0xc000133188, 0x10cf, 0x52d5c, 0x0)
        /home/fred/repos/miniflux/v2/internal/reader/handler/handler.go:301 +0x1485
miniflux.app/v2/internal/cli.refreshFeeds.func1(0x0)
        /home/fred/repos/miniflux/v2/internal/cli/refresh_feeds.go:59 +0x2d7
created by miniflux.app/v2/internal/cli.refreshFeeds in goroutine 1
        /home/fred/repos/miniflux/v2/internal/cli/refresh_feeds.go:50 +0x5d5
```
2024-02-29 19:06:03 -08:00
jvoisin
4db138d4b8 Minor internal/reader/readability/readability.go speedup
- Don't use a capturing group in `divToPElementsRegexp`
- Remove a duplicate condition
- Replace a regex with a fixed-comparison and a `Contains`
2024-02-28 20:03:14 -08:00
jvoisin
61af08a721 Use .WriteString( instead of .Write([]byte(… 2024-02-28 19:47:30 -08:00
Frédéric Guillot
c0e954f19d Implement structured logging using log/slog package 2023-09-24 22:37:33 -07:00
Frédéric Guillot
168a870c02 Move internal packages to an internal folder
For reference: https://go.dev/doc/go1.4#internalpackages
2023-08-10 20:29:34 -07:00
Renamed from reader/readability/readability.go (Browse further)