1
0
mirror of https://github.com/go-gitea/gitea.git synced 2025-01-10 05:17:43 +03:00
gitea/modules/indexer/internal/bleve
Bruno Sofiato f64fbd9b74
Updated tokenizer to better matching when search for code snippets ()
This PR improves the accuracy of Gitea's code search. 

Currently, Gitea does not consider statements such as
`onsole.log("hello")` as hits when the user searches for `log`. The
culprit is how both ES and Bleve are tokenizing the file contents (in
both cases, `console.log` is a whole token).

In ES' case, we changed the tokenizer to
[simple_pattern_split](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html#:~:text=The%20simple_pattern_split%20tokenizer%20uses%20a,the%20tokenization%20is%20generally%20faster.).
In such a case, tokens are words formed by digits and letters. In
Bleve's case, it employs a
[letter](https://blevesearch.com/docs/Tokenizers/) tokenizer.

Resolves 

---------

Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
2024-11-06 20:51:20 +00:00
..
batch.go Replace interface{} with any () 2023-07-04 18:36:08 +00:00
indexer.go Add io.Closer guidelines () 2024-02-25 13:05:23 +00:00
query.go Determine fuzziness of bleve indexer by keyword length () 2024-03-23 16:45:13 +01:00
util_test.go Updated tokenizer to better matching when search for code snippets () 2024-11-06 20:51:20 +00:00
util.go Updated tokenizer to better matching when search for code snippets () 2024-11-06 20:51:20 +00:00