Elasticsearch tokenizer filters

Elasticsearch provides various tokenizer filters that can be used to modify the tokenization process of text during indexing and searching. These filters are applied to individual tokens generated by tokenizers and can transform or filter the tokens based on specific criteria. Here are some commonly used Elasticsearch tokenizer filters:

  1. Lowercase Filter: Converts tokens to lowercase. Useful for case-insensitive searches.

  2. Uppercase Filter: Converts tokens to uppercase.

  3. ASCII Folding Filter: Replaces non-ASCII characters with their ASCII equivalents. For example, "é" becomes "e".

  4. Stop Filter: Removes common words (stop words) from the token stream. Stop words are typically frequently occurring words that add little meaning to the search, such as "a," "an," "the," etc.

  5. Stemmer Filter: Applies stemming algorithms to reduce words to their root form. For example, "running," "runs," and "ran" would all be stemmed to "run."

  6. Synonym Filter: Replaces tokens with their synonyms based on a configured synonym dictionary. Useful for expanding the search scope to include similar terms.

  7. Word Delimiter Filter: Splits tokens into subwords and applies various rules like splitting on camel case, punctuation, or numeric changes. For example, "HelloWorld" can be split into "Hello" and "World."

  8. Length Filter: Filters out tokens based on their length. Tokens shorter or longer than specified thresholds can be excluded.

  9. Pattern Replace Filter: Replaces tokens that match a specified regular expression pattern with a replacement string.

  10. Phonetic Token Filters: Provide phonetic algorithms like Soundex or Metaphone to generate tokens based on their phonetic representation. This allows for approximate matching based on pronunciation.

These are just a few examples, and Elasticsearch provides many more token filters that can be customized and combined to suit specific requirements. Token filters are typically used in conjunction with tokenizers to control the indexing and search behavior of text in Elasticsearch.

댓글

이 블로그의 인기 게시물

Using the MinIO API via curl

Fundamentals of English Grammar #1

리눅스의 부팅과정 (프로세스, 서비스 관리)

How to checkout branch of remote git, 깃 리모트 브랜치 체크아웃

In HBase, the "memory to disk" flush operation

Chromium 개발 환경 세팅, 크로미움 개발 준비하기

To switch to a specific tag in a Git repository

urllib3 with proxy settings

To download a file from MinIO using Spring Boot, 스프링부트 Minio 사용하기

CDPEvents in puppeteer