1
0
Fork 0
mirror of https://github.com/miniflux/v2.git synced 2025-08-06 17:41:00 +00:00
Commit graph

280 commits

Author SHA1 Message Date
Frédéric Guillot
8c3f280f32 test(readability): add test case for ExtractContent with broken reader 2025-07-01 20:14:52 -07:00
jvoisin
8a98926674 refactor(readability): add a getSelectionLength function
When we're only interested in the length of contained Text, there is no need to
materialize it fully to then call len() on the result: we can simply iterate
over the text element and sum their length instead.
2025-07-01 19:52:53 -07:00
jvoisin
435a950d64 refactor(sanitizer): minor refactorization
Use a proper switch-case instead of a bunch of if.
2025-07-01 19:48:55 -07:00
jvoisin
89c32d518d perf(readability): significantly improve transformMisusedDivsIntoParagraphs 2025-07-01 19:44:58 -07:00
jvoisin
2f7b2e7375 perf(readability): improve getLinkDensity
- There is no need to materialize all the content of a given Node when we can
  simply compute its length directly, saving a lot of memory, on the order of
  several megabytes on my instance, with peaks at a couple of dozen.
- One might object to the usage of a recursive construct, but this is a direct
  port of goquery's Text method, so this change doesn't make anything worse.
- The computation of linkLength can be similarly computed, but this can go in
  another commit, as it's a bit trickier, since we need to get the length of
  every Node that has a `a` Node as parent, without iterating on the whole
  parent chain every time.
2025-07-01 19:40:47 -07:00
Frédéric Guillot
6eeccae7cd test(readability): increase test coverage 2025-06-30 21:29:07 -07:00
jvoisin
aed99e65c1 perf(readability): improve getClassWeight speed
Before

```console
$ go test -bench=.
goos: linux
goarch: arm64
pkg: miniflux.app/v2/internal/reader/readability
BenchmarkExtractContent-8   	     34	 86102474 ns/op
BenchmarkGetWeight-8        	  10573	    103045 ns/op
PASS
ok  	miniflux.app/v2/internal/reader/readability	5.409s
```

After

```console
$ go test -bench=.
goos: linux
goarch: arm64
pkg: miniflux.app/v2/internal/reader/readability
BenchmarkExtractContent-8   	     56	 83130924 ns/op
BenchmarkGetWeight-8        	 246541	     5241 ns/op
PASS
ok  	miniflux.app/v2/internal/reader/readability	6.026s
```

This should make ProcessFeedEntries marginally faster, while saving
some memory.
2025-06-30 19:28:20 -07:00
jvoisin
d1a3f98df9 perf(fetcher): save 8 bytes in the RequestBuilder struct
before:

```
  // request_builder.go:25 | Size: 64 (Optimal: 56)
  type RequestBuilder struct {
    headers          http.Header                 ■ ■ ■ ■ ■ ■ ■ ■
    clientProxyURL   *url.URL                    ■ ■ ■ ■ ■ ■ ■ ■
    useClientProxy   bool                        ■ □ □ □ □ □ □ □
    clientTimeout    int                         ■ ■ ■ ■ ■ ■ ■ ■
    withoutRedirects bool                        ■
    ignoreTLSErrors  bool                          ■
    disableHTTP2     bool                            ■ □ □ □ □ □
    proxyRotator     *proxyrotator.ProxyRotator  ■ ■ ■ ■ ■ ■ ■ ■
    feedProxyURL     string                      ■ ■ ■ ■ ■ ■ ■ ■
                                                 ■ ■ ■ ■ ■ ■ ■ ■
  }
```

after:

```
  // request_builder.go:25 | Size: 56
  type RequestBuilder struct {
    headers          http.Header                 ■ ■ ■ ■ ■ ■ ■ ■
    clientProxyURL   *url.URL                    ■ ■ ■ ■ ■ ■ ■ ■
    clientTimeout    int                         ■ ■ ■ ■ ■ ■ ■ ■
    useClientProxy   bool                        ■
    withoutRedirects bool                          ■
    ignoreTLSErrors  bool                            ■
    disableHTTP2     bool                              ■ □ □ □ □
    proxyRotator     *proxyrotator.ProxyRotator  ■ ■ ■ ■ ■ ■ ■ ■
    feedProxyURL     string                      ■ ■ ■ ■ ■ ■ ■ ■
                                                 ■ ■ ■ ■ ■ ■ ■ ■
  }
```
2025-06-29 16:10:35 -07:00
jvoisin
4e1f836266 refactor(readability): simplify a bit getArticle
- Use a proper division instead of multiplying by a float.
- Extract a condition in the parent scope
- Use an else-if construct instead of a simple if
2025-06-29 16:06:34 -07:00
Frédéric Guillot
a68de4ee6a test(readability): add tests for getArticle function 2025-06-29 16:03:17 -07:00
jvoisin
c064891314 perf(readability): Simplify removeUnlikelyCandidates
- Use an array of strings instead of a regex, like done in ef13756b1a7a7ba30fd34174a5367381fd8b4849
- Extract the `shouldRemove` function from `removeUnlikelyCandidates`, as there
  is no reason to have it there instead of being a proper standalone function.
- Improve a condition, where the goquery selection would have its `id`
  attribute left unchecked if a `class` one was present, regardless of if
  `class` was a candidate to removal or not.
- Add some comments
2025-06-29 15:31:01 -07:00
Frédéric Guillot
5129f53d58 test(readability): add tests for removeUnlikelyCandidates function 2025-06-29 15:23:56 -07:00
Frédéric Guillot
e60f0fd142 test(readability): add tests for getClassWeight function 2025-06-29 13:24:06 -07:00
Julien Voisin
2b26a345cd
perf(processor): minify content even further
There is no need to keep comments (conditionals or not, as IE isn't a thing
anymore), nor default attribute values.
2025-06-29 12:55:34 -07:00
Frédéric Guillot
3de31a1a4d test(processor): add more unit tests for minifyContent function 2025-06-29 12:53:23 -07:00
jvoisin
560be66147 refactor(misc): Use proper slog.XXX instead of slog.Any
This has close to no impact for now, as our slog.Debug/Info/... are leaking
their parameters to the heap, but using proper typing instead of Any allows
to skip some reflection-based computation, making things marginally faster,
and removing the corresponding heap leak.
2025-06-29 12:30:17 -07:00
Frédéric Guillot
6d58052504
fix(readability): do not remove elements within code blocks
`<span class="hljs-comment"># exit 1</span>` will match the `unlikelyCandidatesRegexp` because it contains the `comment` string.
2025-06-19 21:03:53 -07:00
Frédéric Guillot
db49e41acf refactor(processor): move FilterEntryMaxAgeDays filter to filter package 2025-06-19 17:56:45 -07:00
Frédéric Guillot
e6b814199b feat(filter): add EntryDate=max-age:duration filter
Example: `EntryDate=max-age:30d` or `EntryDate=max-age:1h`

Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h", "d".
2025-06-19 17:25:19 -07:00
Frédéric Guillot
9c05c3c493 feat(filter): merge user and feed entry filter rules 2025-06-19 16:24:57 -07:00
Frédéric Guillot
2a9d91c783 feat: add entry filters at the feed level 2025-06-19 15:15:16 -07:00
Frédéric Guillot
cb59944d6b refactor(processor): move RewriteEntryURL function to rewrite package 2025-06-19 13:22:29 -07:00
Frédéric Guillot
c12476c1a9 refactor(filter): avoid code duplication between IsBlockedEntry and IsAllowedEntry functions 2025-06-19 12:55:00 -07:00
Frédéric Guillot
bc6ab44ff2 fix(filter): skip invalid rules instead of exiting the loop 2025-06-19 12:36:35 -07:00
Frédéric Guillot
6282ac1f38 refactor(processor): move filters to a filter package 2025-06-19 12:06:30 -07:00
jvoisin
96c0ef4efd refactor(processor): massive refactoring of filters.go
- Use proper variable names for `key=value` strings parts
- Explicitly assign false to the `match` boolean
- Use an explicit `len(parts) == 2` assertion to help the compiler remove
  `isSliceInBounds` calls.
- Refactor identical code into a containsRegexPattern function.
- Early exit when parsing the first date fails when using the `Between`
  operator, instead of trying to parse the second one.
2025-06-19 11:43:47 -07:00
jvoisin
b139ac4a2c refactor(youtube): Remove a regex and make use of fetchWatchTime 2025-06-19 11:43:47 -07:00
jvoisin
c818d5bbb8 refactor(youtube): initiliaze two maps to the proper length 2025-06-19 11:43:47 -07:00
jvoisin
e366710529 refactor(processor): remove a useless type declaration 2025-06-19 11:43:47 -07:00
jvoisin
5cff4d7117 refactor(processor): remove a duplication function call
As youtubeVideoID is assigned to getVideoIDFromYouTubeURL(entry.URL),
there is no need to call the latter again when we can simly use youtubeVideoID
instead.
2025-06-19 11:43:47 -07:00
jvoisin
f31a784eaa refactor(processor): refactor common code into a fetchWatchTime function
Both nebula and odysee were using the same function to parse time.
2025-06-19 11:43:47 -07:00
jvoisin
7edfcc3cf7 refactor(processor): remove a useless type declaration 2025-06-19 11:43:47 -07:00
jvoisin
fe4b00b9f8 refactor(processor): extract some functions into an utils.go file 2025-06-19 11:43:47 -07:00
jvoisin
46b159ac58 refactor(processor): simplify bilibili processing
- Use strings.Contains instead of a regex
- Use strings concatenation instead of a call to fmt.Sprintf
- Use `any` instead of `interface{}`
2025-06-19 11:43:47 -07:00
jvoisin
86c58e11f6 perf(reader): use a non-cryptographic hash when possible
There is no need to use SHA256 everywhere, especially on small inputs where we
don't care about its cryptographic properties. We're using FNV as it's the
faster available hash in go's standard library, and we're picking its "a"
version as it's slightly better avalanche characteristics, which are
relevant for small inputs.

This commit has the side-effect of invalidating all favicons saved in the
database, which is desirable to benefit from the resize process implemented in
777d0dd2, as it didn't apply retro-actively.

We're also making use of hex.EncodeToString instead of fmt.Sprintf, as it's
marginally faster.

Note that we can't change the usage of sha256 for feed.Hash as it's used to
deduplicate entries in the database.
2025-06-18 20:28:23 -07:00
jvoisin
43546976d2 refactor(tests): use b.Loop() instead of for range b.N
See https://tip.golang.org/doc/go1.24#new-benchmark-function
2025-06-18 20:12:55 -07:00
Frédéric Guillot
6af4d69c39 test(sanitizer): add test case to cover Vimeo iframe rewrite without query string 2025-06-17 17:55:39 -07:00
Frédéric Guillot
27015a5e34 test(sanitizer): add unit test for 0x0 pixel tracker 2025-06-17 17:42:55 -07:00
jvoisin
cdb57b3843 perf(sanitizer): minor simplifications of the sanitizer
- Factorize some conditions
- Remove useless `default` case and move the return at the end of the functions
- Use strings.CutPrefix instead of strings.HasPrefix + strings.TrimPrefix
- Use switch-case constructs instead of slices.Contains, as this reduces the
  complexity of the functions and allows them to be inlined, as well as helping
  the compiler to optimize them, as it sucks at interprocedural optimizations.
2025-06-17 17:42:45 -07:00
jvoisin
152ef578d2 feat(sanitizer): consider images of size 0x0 as pixel trackers 2025-06-17 17:32:00 -07:00
jvoisin
72486b9bd1 refactor(processor): minor simplification of a loop
This makes the code a tad clearer.
2025-06-17 17:30:13 -07:00
jvoisin
81df0b2a16 perf(rewrite): make getPredefinedRewriteRules O(1) 2025-06-17 17:27:36 -07:00
jvoisin
b296f21e98 refactor(internal): add an urllib.DomainWithoutWWW function 2025-06-17 17:27:36 -07:00
jvoisin
af15032145 perf(fetcher): pre-allocate the cipherSuites 2025-06-17 16:53:00 -07:00
jvoisin
8660f5e3c7 perf(media): minor regex simplification
The previous regex was using the [ABC..D]*[ABC] pattern, resulting in a lot of
backtracking. The new regex is stopping the matching at the first space or end
of text (and removes the trailing `.` should one be present).

The backtracking was taking around 50% of the CPU time spent in atom.Parse
2025-06-17 16:49:07 -07:00
Frédéric Guillot
da4ab4263c feat(rewrite): add parkablogs.com to the referer override list 2025-06-16 20:28:11 -07:00
jvoisin
237672a62c perf(sanitizer): use a switch-case instead of a map
This removes a heap allocation, and should be way faster. It also makes the
code shorted/simpler.
2025-06-16 14:54:48 -07:00
jvoisin
e9d4a130fd refactor(sanitizer): remove two useless www. prefixes
No need to have those prefixes, as the check is for substrings, so removing
them will improve the amount of matches.
2025-06-16 14:53:15 -07:00
Frédéric Guillot
b95c9023ee refactor(sanitizer): make isValidAttribute() check O(1) 2025-06-13 21:44:25 -07:00
Frédéric Guillot
3538c4271b refactor(sanitizer): use global variables to avoid recreating slices on every call 2025-06-13 21:34:07 -07:00