Support Resources

LucidWorks Forum
KnowledgeBase

LucidWorks Platform v2.0

PDF Version

Older Versions

LWE Guide 1.8
LWE Guide 1.7
LWE Guide 1.6

This is the documentation for LucidWorks Platform v2.0, the latest release is v2.1.

Skip to end of metadata
Go to start of metadata

The LucidWorks Platform contains a schema.xml file for each collection, which is used to define the fields for the index (among other things). It is the same schema.xml file that is used with a Solr installation, however Lucid Imagination has added fields to support various features of LucidWorks and to make it easier for users to get up and running. Not all users will need all fields, however, so they may want to trim the schema.xml file so it is easier to read. The following table shows the default fields, how they are used, and if they can be removed for local installations.

One of the primary added values of LucidWorks is the integration of content crawlers for web sites, filesystems and other repositories of content. Many of the fields added to schema.xml are for this purpose and should be retained. In many cases, if they are removed from the schema, they will be recreated the next time a crawler that uses them crawls new content. However, if not using the LucidWorks crawlers, they can generally be safely removed. They will be added based on a dynamic rule ("*" rule) in the schema.xml file that should be retained to avoid unexpected failures of the crawlers. If this rule is left in place, nearly any field in the schema can be removed as it will be added back if it is needed.

This functionality is available in LucidWorks Enterprise but not LucidWorks Cloud.
Only delete the "*" rule if you are absolutely positive other deleted fields will not be needed in your specific implementation. Deleting this rule may also complicate future upgrades, as it is not possible to predict when Lucid Imagination will add new fields to the schema.xml file to support future functionality.

Guidelines for Removing Fields from the Schema

Essential Fields

There are five essential fields which must be retained in schema.xml for LucidWorks to continue to function. These are:

  • id
  • data_source
  • data_source_name
  • data_source_type
  • text_all
  • timestamp

The text_all field is required because schema.xml declares it as the default search field for the Lucene RequestHandler (query parser), which is also the default for the basic Solr query parser. If you are using lucid or DisMax, however, and will never use the Lucene or Solr query parsers, the field could be deleted. However, it may be best to retain it.

We have created a sample schema that includes only the essential fields listed above that can be used for collection creation. See Using Collection Templates for more information.

Built-In Search UI Fields

LucidWorks includes a default search UI that can be used as-is or replaced with a fully local interface. If using it as-is, even for testing or during initial implementation, the following fields must also be retained in schema.xml:

  • url
  • title
  • body
  • author
  • keywords
  • description
  • dateCreated
  • lastModified
  • pageCount
  • mimeType
  • author_display
  • keywords_display
  • timestamp

The Search UI includes these fields for results display and default faceting, so for it to work properly, these fields should be retained.

Fields to Support Specific Features

Several fields are included in schema.xml in support of specific LucidWorks features. They can be removed if those features are disabled or not in use.

Feature Fields
Click Scoring Relevance Framework click
click_terms
click_val
ACL acl
Spell Check spell
Auto Completion autocomplete
Enterprise Alerts timestamp

Crawler Fields

The crawlers included with LucidWorks create fields in schema.xml that begin with attr_ and are used to store document-specific metadata during the crawl processes. They are not generally used otherwise by LucidWorks (such as in search results or other computations). Due to the dynamic "*" rule, they will be added back to schema.xml if not in place. If not using the LucidWorks crawlers, they can be removed, but it is recommended to retain them if possible.

Other Dynamic Fields

Several other dynamic fields (all including an '', such as **{}i, *_s, *_l, etc.) are defined in* schema.xml. These can be removed if they will not be used - the only two we recommend that you retain are the "" rule and the attr* fields.

Table of Fields

The table below notes whether a field will be indexed, stored, used for facets or included in results. This is default behavior, and can be modified locally. After customization, this table may not reflect the state of your schema.xml file.
Field Name Type Indexed Stored Used for Facets Included in Results Used for Can Be Deleted
acl string X X     Storing Access Control List information. Only if never using Access Control List (ACL) query-time document security.
attr_* (any field starting with 'attr_') string X X     Created by the crawlers and used for a wide array of document-specific metadata.  Not specifically declared in the schema.xml file, but dynamically created during crawls. Yes, but automatically created by LucidWorks crawlers, so will be recreated at next crawl run.
author text_en X X   X Raw author pulled from documents. Used by default in the built-in Search UI. Only if never using built-in Search UI.
author_display string X   X   Used for display of authors in facets. Used by default in the built-in Search UI. Only if never using built-in Search UI.
autocomplete textSpell X X     Stores terms for the auto-complete index.  By default, it is created by copying terms from the title, body, description and author fields. Only if never using built-in auto-complete functionality.
batch_id string X X     Identifies the batch that added the document. Yes.
bcc text_en X X     Used in processing email messages. Yes.  Will be added dynamically if an indexed document contains this field.
belongsToContainer text_en X X     Used to store the URL of the archive file (.zip, .mbox, etc.) which contains the file. Yes.
body text_en X X     The body of a document (generally, the main text). Used by default for display in the built-in Search UI. Only if never using built-in Search UI.
byteSize int   X     The size of the document. Yes. Will be added dynamically if an indexed document contains this field and was crawled by the lucid.aperture crawler (local file systems and web sites).
cc text_en X X     Used in processing email messages. Yes. Will be added dynamically if an indexed document contains this field.
characterSet string   X     The character set used for the document. Only populated if it is declared in the document (most commonly with HTML files). Yes. Will be added dynamically if an indexed document contains this field.
click string X X     Used with the Click Scoring Relevance Framework and contains the boost value. Only if Click Scoring will not be used.
click_terms text_ws X X     Used with the Click Scoring Relevance Framework and contains the top terms associated with the document. Only if Click Scoring will not be used.
click_val string X X     Used with the Click Scoring Relevance Framework and contains a string representation for the boost value for the document. The format allows it to be used for processing function queries. Only if Click Scoring will not be used.
contentCreated date X X     The creation date for the document, if available. Yes. Will be added dynamically if an indexed document contains this field. However, it will not be added as a date, but a string, which may cause sorting issues if the field is used again later.
crawl_uri string X       A copy of the URL for the document. Yes.
creator text_en X X     The creator of the document, if available. Yes. Will be added dynamically if an indexed document contains this field.
data_source string X X     The ID of the data source that crawled this document. No. Field is essential.
data_source_name string X X X   The name of the data source that crawled this document. No. Field is essential.
data_source_type string X X   X The type of data source that crawled this document. No. Field is essential.
dateCreated date X X   X The date the content was created, if available. Only if never using built-in Search UI.
description text_en X X   X The description from a document, if it exists in the document. For example, Microsoft Office document properties contains a description field that can be filled in by the user. Only if never using built-in Search UI.
email text_en X X     Not currently used by any LucidWorks crawlers. Yes. Will be added dynamically if an indexed document contains this field.
fileName text_en X X     The name of the file. Yes.
fileSize int X X     The size of the file. Yes.
from text_en X X     Used in processing email messages. Yes. Will be created dynamically if indexing a document that contains this field.
fullname text_en X X     Data in this field is mapped to "author". Yes.
generator text_en X X     The name of the software that generated the document, if available. Yes.
id string X X   X Unique ID for the document. No. Field is essential.
id_highlight text_en X X     No longer used by LucidWorks and will be removed in a later version. Yes.
incubationdate_dt date X X     Used in older Solr example documents. Yes.
keywords text_en X X   X The keyword list from a Microsoft Office document. Only if never using built-in Search UI.
keywords_display comma-separated X   X   Terms from the keyword field formatted for display to users. Only if never using built-in Search UI.
lastModified date X X   X Date the content was last modified. Only if never using built-in Search UI.
mimeType string X X X X The type of document (PDF, Microsoft Office, etc.). Only if never using built-in Search UI.
name text_en X X     Data in this field is mapped to "title". Yes.
otherDates date X X     Dates other than dateCreated or lastModified would be mapped to this field. Yes.
pageCount int X X   X The number of pages in a Microsoft Office document such as Word or PowerPoint. Only if never using built-in Search UI.
partOf string X X     Typically used for an email attachment, this points to the larger document of which this document is a part.  
price float X X     Example field that could be used for processing e-commerce data. Yes.
retrievalDate date X X     Not currently used, but could be used for the date a web document was retrieved from its server. Yes.
rootElementOf text_en X X     Populated only for the root or initial document of a crawl. Yes.
signatureField string X X     Part of Solr's default schema. Yes.
spell textSpell X       Stores the terms to be used in creating the spell check index. Created by copying terms from the title, body, description and author fields. Only if never using built-in spelling checker.
text_all text_en X       Used to combine text fields for faster searching.  Created by copying terms from the id, url, title, description, keywords, author and body fields. No. Field is essential.
text_medium text_en X X     Not currently used. Yes.
text_small text_en X X     Not currently used. Yes.
timestamp date X X X X Time the document was crawled and used for date faceting and display of activities in the LucidWorks Admin UI. Also used for Enterprise Alerts to know when the document was added to the index for alerts processing. No, field is considered essential.
title text_en X X     The title of the document. Only if never using built-in Search UI.
to text_en X X     Used in processing email messages. Yes. Will be created dynamically if indexing a document that contains this field.
type text_en X X     Used by the lucid.aperture crawler to store Aperture's classification of an information object, separate from its MIME type. Yes.
url string   X   X The URL to access the document. Only if never using built-in Search UI.
username text_en X X     No longer used and may be removed in a later version. Yes.
weight float X X     Example field that could be used for processing e-commerce data. Yes.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.