Whitespace Analyzer

The whitespace analyzer divides text into searchable terms (tokens) wherever it finds a whitespace character. It leaves all text in its original letter case.

If you select Refine Your Index, the Atlas UI displays a section titled View text analysis of your selected index configuration within the Index Configurations section. If you expand this section, the Atlas UI displays the index and search tokens that the whitespace analyzer generates for each sample string. You can see the tokens that the whitespace analyzer creates for a built-in sample document and query string when you create or edit an index in the Atlas UI Visual Editor.

Important

MongoDB Search won't index string fields where analyzer tokens exceed 32766 bytes in size. If using the keyword analyzer, string fields which exceed 32766 bytes will not be indexed.

Example

The following example index definition specifies an index on the title field in the sample_mflix.movies collection using the whitespace analyzer. To follow along with this example, load the sample data on your cluster and either use mongosh or navigate to the Create a Search Index page in the Atlas UI following the steps in the Create an MongoDB Search Index tutorial.

Then, using the movies collection as your data source, follow the example procedure to create an index from mongosh or the Atlas UI Visual Editor or JSON editor.

➤ Use the Select your language drop-down menu to set the interface for the example on this page.

Click Refine Your Index to configure your index.
In the Index Configurations section, toggle Dynamic Mapping to off.
In the Field Mappings section, click Add Field to open the Add Field Mapping window.
Select title from the Field Name dropdown.
Click Customized Configuration.
Click the Data Type dropdown and select String if it isn't already selected.

Expand String Properties and make the following changes:

Index Analyzer	Select `lucene.whitespace` from the dropdown.
Search Analyzer	Select `lucene.whitespace` from the dropdown.
Index Options	Use the default `offsets`.
Store	Use the default `true`.
Ignore Above	Keep the default setting.
Norms	Use the default `include`.

Click Add.
Click Save Changes.
Click Create Search Index.

Replace the default index definition with the following index definition.

{
  "mappings": {
    "fields": {
      "title": {
        "type": "string",
        "analyzer": "lucene.whitespace",
        "searchAnalyzer": "lucene.whitespace"
      }
    }
  }
}

Click Next.
Click Create Search Index.

1 db.movies.createSearchIndex(
2   "default",
3   {   
4     "mappings": {
5       "fields": {
6         "title": {
7           "type": "string",
8           "analyzer": "lucene.whitespace",
9           "searchAnalyzer": "lucene.whitespace"
10         }
11       }
12     }
13   }
14 )

The following query searches for the term Lion's in the title field.

Click the Query button for your index.
Click Edit Query to edit the query.
Click on the query bar and select the database and collection.

Replace the default query with the following and click Find:

[
  {
    "$search": {
      "text": {
        "query": "Lion's",
        "path": "title"
      }
    }
  }
]

SCORE: 3.7370920181274414  _id:  "573a13ebf29313caabdcfc8d"
   awards: Object
   cast: Array (4)
   countries: Array (1)
   directors: Array (1)
   fullplot: "A documentary on young actress, Marianna Palka, as she confronts her r…"
   genres: Array (3)
   imdb: Object
   languages: Array (1)
   lastupdated: "2015-09-03 00:37:45.227000000"
   num_mflix_comments: 0
   plot: "A documentary on young actress, Marianna Palka, as she confronts her r…"
   poster: "https://m.media-amazon.com/images/M/MV5BMTgzMTc2OTg2N15BMl5BanBnXkFtZT…"
   released: 2014-01-18T00:00:00.000+00:00
   runtime: 15
   title: "The Lion's Mouth Opens"
   type: "movie"
   writers: Array (1)
   year: 2014

db.movies.aggregate([
  {
    "$search": {
      "text": {
         "query": "Lion's",
         "path": "title"
      }
    }
  },
  {
    "$project": {
      "_id": 0,
      "title": 1
    }
  }
])

[ { title: "The Lion's Mouth Opens" } ]

MongoDB Search returns these documents by doing the following for the text in the title field using the lucene.whitespace analyzer:

Retain the original letter case for the text.
Divide the text into tokens wherever it finds a whitespace character.

The following table shows the tokens (searchable terms) that MongoDB Search creates using the Whitespace Analyzer and, by contrast, the Simple Analyzer and Keyword Analyzer for the documents in the results:

Title	Whitespace Analyzer Tokens	Simple Analyzer Tokens	Keyword Analyzer Tokens
`The Lion's Mouth Opens`	`The`, `Lion's`, `Mouth`, `Opens`	`the`, `lion`, `s`, `mouth`, `opens`	`The Lion's Mouth Opens`

The index that uses whitespace analyzer is case-sensitive. Therefore, MongoDB Search is able to match the query term Lion's to the token Lion's created by the whitespace analyzer.

Back

Simple

Keyword

1	db.movies.createSearchIndex(
2	"default",
3	{
4	"mappings": {
5	"fields": {
6	"title": {
7	"type": "string",
8	"analyzer": "lucene.whitespace",
9	"searchAnalyzer": "lucene.whitespace"
10	}
11	}
12	}
13	}
14	)