|
The LucidWorks Platform contains a schema.xml file for each collection, which is used to define the fields for the index (among other things). It is the same schema.xml file that is used with a Solr installation, however Lucid Imagination has added fields to support various features of LucidWorks and to make it easier for users to get up and running. Not all users will need all fields, however, so they may want to trim the schema.xml file so it is easier to read. The following table shows the default fields, how they are used, and if they can be removed for local installations. One of the primary added values of LucidWorks is the integration of content crawlers for web sites, filesystems and other repositories of content. Many of the fields added to schema.xml are for this purpose and should be retained. In many cases, if they are removed from the schema, they will be recreated the next time a crawler that uses them crawls new content. However, if not using the LucidWorks crawlers, they can generally be safely removed. They will be added based on a dynamic rule ("*" rule) in the schema.xml file that should be retained to avoid unexpected failures of the crawlers. If this rule is left in place, nearly any field in the schema can be removed as it will be added back if it is needed. |
This functionality is available in LucidWorks Enterprise but not LucidWorks Cloud.
|
| Only delete the "*" rule if you are absolutely positive other deleted fields will not be needed in your specific implementation. Deleting this rule may also complicate future upgrades, as it is not possible to predict when Lucid Imagination will add new fields to the schema.xml file to support future functionality. |
Guidelines for Removing Fields from the Schema
Essential Fields
There are five essential fields which must be retained in schema.xml for LucidWorks to continue to function. These are:
- id
- data_source
- data_source_name
- data_source_type
- text_all
- timestamp
The text_all field is required because schema.xml declares it as the default search field for the Lucene RequestHandler (query parser), which is also the default for the basic Solr query parser. If you are using lucid or DisMax, however, and will never use the Lucene or Solr query parsers, the field could be deleted. However, it may be best to retain it.
| We have created a sample schema that includes only the essential fields listed above that can be used for collection creation. See Using Collection Templates for more information. |
Built-In Search UI Fields
LucidWorks includes a default search UI that can be used as-is or replaced with a fully local interface. If using it as-is, even for testing or during initial implementation, the following fields must also be retained in schema.xml:
- url
- title
- body
- author
- keywords
- description
- dateCreated
- lastModified
- pageCount
- mimeType
- author_display
- keywords_display
- timestamp
The Search UI includes these fields for results display and default faceting, so for it to work properly, these fields should be retained.
Fields to Support Specific Features
Several fields are included in schema.xml in support of specific LucidWorks features. They can be removed if those features are disabled or not in use.
| Feature | Fields |
|---|---|
| Click Scoring Relevance Framework | click click_terms click_val |
| ACL | acl |
| Spell Check | spell |
| Auto Completion | autocomplete |
| Enterprise Alerts | timestamp |
Crawler Fields
The crawlers included with LucidWorks create fields in schema.xml that begin with attr_ and are used to store document-specific metadata during the crawl processes. They are not generally used otherwise by LucidWorks (such as in search results or other computations). Due to the dynamic "*" rule, they will be added back to schema.xml if not in place. If not using the LucidWorks crawlers, they can be removed, but it is recommended to retain them if possible.
Other Dynamic Fields
Several other dynamic fields (all including an '', such as **{}i, *_s, *_l, etc.) are defined in* schema.xml. These can be removed if they will not be used - the only two we recommend that you retain are the "" rule and the attr* fields.
Table of Fields
| The table below notes whether a field will be indexed, stored, used for facets or included in results. This is default behavior, and can be modified locally. After customization, this table may not reflect the state of your schema.xml file. |
| Field Name | Type | Indexed | Stored | Used for Facets | Included in Results | Used for | Can Be Deleted |
|---|---|---|---|---|---|---|---|
| acl | string | X | X | Storing Access Control List information. | Only if never using Access Control List (ACL) query-time document security. | ||
| attr_* (any field starting with 'attr_') | string | X | X | Created by the crawlers and used for a wide array of document-specific metadata. Not specifically declared in the schema.xml file, but dynamically created during crawls. | Yes, but automatically created by LucidWorks crawlers, so will be recreated at next crawl run. | ||
| author | text_en | X | X | X | Raw author pulled from documents. Used by default in the built-in Search UI. | Only if never using built-in Search UI. | |
| author_display | string | X | X | Used for display of authors in facets. Used by default in the built-in Search UI. | Only if never using built-in Search UI. | ||
| autocomplete | textSpell | X | X | Stores terms for the auto-complete index. By default, it is created by copying terms from the title, body, description and author fields. | Only if never using built-in auto-complete functionality. | ||
| batch_id | string | X | X | Identifies the batch that added the document. | Yes. | ||
| bcc | text_en | X | X | Used in processing email messages. | Yes. Will be added dynamically if an indexed document contains this field. | ||
| belongsToContainer | text_en | X | X | Used to store the URL of the archive file (.zip, .mbox, etc.) which contains the file. | Yes. | ||
| body | text_en | X | X | The body of a document (generally, the main text). Used by default for display in the built-in Search UI. | Only if never using built-in Search UI. | ||
| byteSize | int | X | The size of the document. | Yes. Will be added dynamically if an indexed document contains this field and was crawled by the lucid.aperture crawler (local file systems and web sites). | |||
| cc | text_en | X | X | Used in processing email messages. | Yes. Will be added dynamically if an indexed document contains this field. | ||
| characterSet | string | X | The character set used for the document. Only populated if it is declared in the document (most commonly with HTML files). | Yes. Will be added dynamically if an indexed document contains this field. | |||
| click | string | X | X | Used with the Click Scoring Relevance Framework and contains the boost value. | Only if Click Scoring will not be used. | ||
| click_terms | text_ws | X | X | Used with the Click Scoring Relevance Framework and contains the top terms associated with the document. | Only if Click Scoring will not be used. | ||
| click_val | string | X | X | Used with the Click Scoring Relevance Framework and contains a string representation for the boost value for the document. The format allows it to be used for processing function queries. | Only if Click Scoring will not be used. | ||
| contentCreated | date | X | X | The creation date for the document, if available. | Yes. Will be added dynamically if an indexed document contains this field. However, it will not be added as a date, but a string, which may cause sorting issues if the field is used again later. | ||
| crawl_uri | string | X | A copy of the URL for the document. | Yes. | |||
| creator | text_en | X | X | The creator of the document, if available. | Yes. Will be added dynamically if an indexed document contains this field. | ||
| data_source | string | X | X | The ID of the data source that crawled this document. | No. Field is essential. | ||
| data_source_name | string | X | X | X | The name of the data source that crawled this document. | No. Field is essential. | |
| data_source_type | string | X | X | X | The type of data source that crawled this document. | No. Field is essential. | |
| dateCreated | date | X | X | X | The date the content was created, if available. | Only if never using built-in Search UI. | |
| description | text_en | X | X | X | The description from a document, if it exists in the document. For example, Microsoft Office document properties contains a description field that can be filled in by the user. | Only if never using built-in Search UI. | |
| text_en | X | X | Not currently used by any LucidWorks crawlers. | Yes. Will be added dynamically if an indexed document contains this field. | |||
| fileName | text_en | X | X | The name of the file. | Yes. | ||
| fileSize | int | X | X | The size of the file. | Yes. | ||
| from | text_en | X | X | Used in processing email messages. | Yes. Will be created dynamically if indexing a document that contains this field. | ||
| fullname | text_en | X | X | Data in this field is mapped to "author". | Yes. | ||
| generator | text_en | X | X | The name of the software that generated the document, if available. | Yes. | ||
| id | string | X | X | X | Unique ID for the document. | No. Field is essential. | |
| id_highlight | text_en | X | X | No longer used by LucidWorks and will be removed in a later version. | Yes. | ||
| incubationdate_dt | date | X | X | Used in older Solr example documents. | Yes. | ||
| keywords | text_en | X | X | X | The keyword list from a Microsoft Office document. | Only if never using built-in Search UI. | |
| keywords_display | comma-separated | X | X | Terms from the keyword field formatted for display to users. | Only if never using built-in Search UI. | ||
| lastModified | date | X | X | X | Date the content was last modified. | Only if never using built-in Search UI. | |
| mimeType | string | X | X | X | X | The type of document (PDF, Microsoft Office, etc.). | Only if never using built-in Search UI. |
| name | text_en | X | X | Data in this field is mapped to "title". | Yes. | ||
| otherDates | date | X | X | Dates other than dateCreated or lastModified would be mapped to this field. | Yes. | ||
| pageCount | int | X | X | X | The number of pages in a Microsoft Office document such as Word or PowerPoint. | Only if never using built-in Search UI. | |
| partOf | string | X | X | Typically used for an email attachment, this points to the larger document of which this document is a part. | |||
| price | float | X | X | Example field that could be used for processing e-commerce data. | Yes. | ||
| retrievalDate | date | X | X | Not currently used, but could be used for the date a web document was retrieved from its server. | Yes. | ||
| rootElementOf | text_en | X | X | Populated only for the root or initial document of a crawl. | Yes. | ||
| signatureField | string | X | X | Part of Solr's default schema. | Yes. | ||
| spell | textSpell | X | Stores the terms to be used in creating the spell check index. Created by copying terms from the title, body, description and author fields. | Only if never using built-in spelling checker. | |||
| text_all | text_en | X | Used to combine text fields for faster searching. Created by copying terms from the id, url, title, description, keywords, author and body fields. | No. Field is essential. | |||
| text_medium | text_en | X | X | Not currently used. | Yes. | ||
| text_small | text_en | X | X | Not currently used. | Yes. | ||
| timestamp | date | X | X | X | X | Time the document was crawled and used for date faceting and display of activities in the LucidWorks Admin UI. Also used for Enterprise Alerts to know when the document was added to the index for alerts processing. | No, field is considered essential. |
| title | text_en | X | X | The title of the document. | Only if never using built-in Search UI. | ||
| to | text_en | X | X | Used in processing email messages. | Yes. Will be created dynamically if indexing a document that contains this field. | ||
| type | text_en | X | X | Used by the lucid.aperture crawler to store Aperture's classification of an information object, separate from its MIME type. | Yes. | ||
| url | string | X | X | The URL to access the document. | Only if never using built-in Search UI. | ||
| username | text_en | X | X | No longer used and may be removed in a later version. | Yes. | ||
| weight | float | X | X | Example field that could be used for processing e-commerce data. | Yes. |
