|
Key
This line was removed.
This word was removed. This word was added.
This line was added.
|
Changes (2)
View Page HistoryThe crawler works by going to the seed page (the "URL" specified in the configuration form), collecting the content for indexing, and extracting any links to other pages. It then follows those links to collect content on other pages, extracting links to those pages, etc. When creating a Web data source, pay special attention to the "Crawl depth" and "Constrain to" parameters. If left unchanged from their defaults, they will cause an "unbounded" crawl that could continue for a long time, particularly if you are crawling a site with links to many pages outside the main site. An unbounded crawl may also cause out of memory errors in your system. At a minimum, set the "Constrain to" parameter to "tree", which will keep the crawler within the base URL you specify.
Aperture extracts fields from the HTML of a crawled page and passes the extracted data to LucidWorks for further processing, including any field mapping that has been defined. As of LucidWorks v2.1, the Web data source is able to extract all fields from the <META> section of an HTML page (prior versions were only able to extract the "author", "description", and "keywords" fields, so mapping of custom <META> fields would fail). These custom <META> fields are added to the index with a "meta_" prefix before the tag name (i.e., a custom field of "date" would be inserted in the index as "meta_date"). If you wish to map these fields to another field, use the "meta\_\*" field name; if you do not map them, and they do not exist in your {*{{{}{*}schema.xml{*}}} {{schema.xml}} *for the collection, they will be added to the index as "attr_meta\_\*" because of a default [dynamic rule|help:Dynamic Fields] to add "attr_" to any field that does not exist in the {{schema.xml}}.
To configure a Web site as a data source for the index, select Web from the Data Source Overview screen and click *Create*.
| Crawl depth | The number of levels of links from the base URL to be crawled. Enter 0 to crawl only the URL entered, or 1 or higher to crawl that number of links from the base URL. Leaving this field empty will crawl everything linked from the base URL and linked to those links, even if it is 10 or more levels away. If the base URL is a public internet site, unless you constrain the crawl using "Constrain to" or define Include or Exclude Paths, the LucidWorks crawler may run forever. Note that the LucidWorks crawler is not designed to create an index of the entire internet, and there may be severe performance or index space problems if you do not stop it manually. |
| Constrain to | If you choose *tree*, the crawler will only access pages using the URL as the base path. It is recommended to always change this to tree, unless you are sure your site does not contain a lot of links to the broader internet. For example, if crawling [http://www.cnn.com/US], the tree option is the equivalent of [http://www.cnn.com/US\*|http://www.cnn.com/US*] and LucidWorks would not crawl [http://www.cnn.com/WORLD], [http://www.cnnmexico.com/], or [http://us.cnn.com/]. \\
| Constrain to | If you choose *tree*, the crawler will only access pages using the URL as the base path. It is recommended to always change this to tree, unless you are sure your site does not contain a lot of links to the broader internet. For example, if crawling [http://www.cnn.com/US], the tree option is the equivalent of [http://www.cnn.com/US\*|http://www.cnn.com/US*] and LucidWorks would not crawl [http://www.cnn.com/WORLD], [http://www.cnnmexico.com/], or [http://us.cnn.com/]. \\
\\
If you require more advanced crawl limiting, you should choose *none* and use the Include Paths or Exclude Paths options to limit the crawl to the specific site. |
| Include paths | The Directories on the site that should be crawled for indexing. If you leave this field blank, all paths will be followed (except when subtree is chosen as a constraint), even if they lead away from the original URL entered. To limit crawling to a specific site, repeat the URL in this site with a regular expression to indicate all pages from the site (that is, if you entered the URL {{[http://www.lucidimagination.com]}}, the entry in the Include Paths would be {{[http://www\.lucidimagination\.com/.\*|http://www.lucidimagination.com]}}. See [help:Using Regular Expressions] for more information. |
| Include paths | The Directories on the site that should be crawled for indexing. If you leave this field blank, all paths will be followed (except when subtree is chosen as a constraint), even if they lead away from the original URL entered. To limit crawling to a specific site, repeat the URL in this site with a regular expression to indicate all pages from the site (that is, if you entered the URL {{[http://www.lucidimagination.com]}}, the entry in the Include Paths would be {{[http://www\.lucidimagination\.com/.\*|http://www.lucidimagination.com]}}. See [help:Using Regular Expressions] for more information. |