- Use an array of strings instead of a regex, like done in ef13756b1a7a7ba30fd34174a5367381fd8b4849
- Extract the `shouldRemove` function from `removeUnlikelyCandidates`, as there
is no reason to have it there instead of being a proper standalone function.
- Improve a condition, where the goquery selection would have its `id`
attribute left unchecked if a `class` one was present, regardless of if
`class` was a candidate to removal or not.
- Add some comments
- Use chained strings.Contains instead of a regex for
blacklistCandidatesRegexp, as this is a bit faster
- Simplify a Find.Each.Remove to Find.Remove
- Don't concatenate id and class for removeUnlikelyCandidates, as it makes no
sense to match on overlaps. It might also marginally improve performances, as
regex now have to run on two strings separately, instead of both.
- Add a small benchmark
- Replace a completely overkill regex
- Use `.Remove()` instead of a hand-rolled loop
- Use a strings.Builder instead of a bytes.NewBufferString
- Replace a call to Fprintf with string concatenation, as the latter are much
faster
- Remove a superfluous cast
- Delay some computations
- Add some tests
As mentioned in goquery's documentation (https://pkg.go.dev/github.com/PuerkitoBio/goquery#Single):
> By default, Selection.Find and other functions that accept a selector string
to select nodes will use all matches corresponding to that selector. By using
the Matcher returned by Single, at most the first match will be selected.
>
> The one using Single is optimized to be potentially much faster on large documents.