Use the Crawling menu set date and URL masks, passwords, content types, connections, form definitions, and URL entrypoints.
Most websites have one primary entry point or home page that a customer initially visits. This main entry point is the URL address from which the search robot begins index crawling. However, if your website has multiple domains or subdomains, or if portions of your site are not linked from the primary entry point, you can use URL Entrypoints to add more entry points.
All website pages below each specified URL entry point are indexed. You can combine URL entry points with masks to control exactly which portions of a website that you want to index. You must rebuild your website index before the effects of URL Entrypoints settings are visible to customers.
The main entry point is typically the URL of the website that you want to index and search. You configure this main entry point in Account Settings.
See Configuring your account settings.
After you have specified the main URL entry point, you can optionally specify additional entry points that you want to crawl in order. Most often you will specify additional entry points for web pages that are not linked from pages under the main entry point. Specify additional entry points when your website spans more than one domain as in the following example:
https://www.domain.com/
https://www.domain.com/not_linked/but_search_me_too/
https://more.domain.com/
You qualify each entry point with one or more of the following space-separated keywords in the table below. These keywords affect how the page is indexed.
Important: Be sure that you separate a given keyword from the entry point and from each other by a space; a comma is not a valid separator.
Keyword |
Description |
---|---|
noindex |
If you do not want to index the text on the entry point page, but you do want to follow the page's links, add
Separate the keyword from the entry point with a space as in the following example: This keyword is equivalent to a robots meta tag with
|
nofollow |
If you want to index the text in the entry point page but you do not want to follow any of the page's links, add
Separate the keyword from the entry point with a space as in the following example: This keyword is equivalent to a robots meta tag with
|
form |
When the entry point is a login page,
|
See also About Content Types.
See also About Index Connector.
If your website has multiple domains or subdomains and you want them crawled, you can use URL Entrypoints to add more URLs.
To set your website’s main URL entry point, you use Account Settings.
See Configuring your account settings.
To add multiple URL entry points that you want indexed
On the product menu, click Settings > Crawling > URL Entrypoints.
On the URL Entrypoints page, in the Entrypoints field, enter one URL address per line.
(Optional) In the Add Index Connector Configurations drop-down list, select an index connector that you want to add as an entry point for indexing.
The drop-down list is only available if you have previously added one or more index connector definitions.
Click Save Changes.
(Optional) Do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
URL masks are patterns that determine which of your website documents the search robot indexes or not indexes.
Be sure that you rebuild your site index so that the results of your URL masks are visible to your customers.
See Configuring an incremental index of a staged website.
The following are two kinds of URL masks that you can use:
Include URL masks tell the search robot to index any documents that match the mask’s pattern.
Exclude URL masks tell the search robot to index matching documents.
As the search robot travels from link to link through your website, it encounters URLs and looks for masks that match those URLs. The first match determines whether to include or exclude that URL from the index. If no mask matches an encountered URL, that URL is discarded from the index.
Include URL masks for your entrypoint URLs are automatically generated. This behavior ensures that all encountered documents on your website are indexed. It also conveniently does away with links that “leave” your website. For example, if an indexed page links to https://www.yahoo.com, the search robot does not index that URL because it does not match the include mask automatically generated by the entrypoint URL.
Each URL mask that you specify must be on a separate line.
The mask can specify any of the following:
A full path as in https://www.mydomain.com/products.html
.
A partial path as in https://www.mydomain.com/products
.
A URL that uses wild cards as in https://www.mydomain.com/*.html
.
A regular expression (for advanced users).
To make a mask a regular expression, insert the keyword regexp
between the mask type ( exclude
or include
) and the URL mask.
The following is a simple exclude URL mask example:
exclude https://www.mydomain.com/photos
Because this example is an exclude URL mask, any document that matches the pattern is not indexed. The pattern matches any encountered item, both files and folders, so that https://www.mydomain.com/photos.html
and https://www.mydomain.com/photos/index.html
, both of which match the exclude URL, are not indexed. To match only files in the /photos/
folder, the URL mask must contain a trailing slash as in the following example:
exclude https://www.mydomain.com/photos/
The following exclude mask example uses a wild card. It tells the search robot to overlook files with the “.pdf” extension. The search robot does not add these files to your index.
exclude *.pdf
A simple include URL mask is the following:
include https://www.mydomain.com/news/
Only documents that are linked by way of a series of links from a URL entrypoint, or that are used as a URL entrypoint themselves, are indexed. Solely listing a document’s URL as an include URL mask does not index an unlinked document. To add unlinked documents to your index, you can use the URL Entrypoints feature.
Include masks and exclude masks can work together. You can exclude a large portion of your website from indexing by creating an exclude URL mask yet include one or more of those excluded pages with an include URL mask. For example, suppose your entrypoint URL is the following:
https://www.mydomain.com/photos/
The search robot crawls and indexes all of the pages under /photos/summer/
, /photos/spring/
and /photos/fall/
(assuming that there are links to at least one page in each directory from the photos
folder). This behavior occurs because the link paths enable the search robot to find the documents in the /summer/
, /spring/
, and /fall/
, folders and the folder URLs match the include mask that is automatically generated by the entrypoint URL.
You can choose to exclude all pages in the /fall/
folder with an exclude URL mask as in the following example:
exclude https://www.mydomain.com/photos/fall/
Or, selectively include only /photos/fall/redleaves4.html
as part of the index with the following URL mask:
include https://www.mydomain.com/photos/fall/redleaves4.html
For the above two mask examples to work as intended, the include mask is listed first, as in the following:
include https://www.mydomain.com/photos/fall/redleaves4.html
exclude https://www.mydomain.com/photos/fall/
Because the search robot follows directions in the order that they are listed, the search robot first includes /photos/fall/redleaves4.html
, and then excludes the rest of the files in the /fall
folder.
If the instructions are specified in the opposite way as in the following:
exclude https://www.mydomain.com/photos/fall/
include https://www.mydomain.com/photos/fall/redleaves4.html
Then /photos/fall/redleaves4.html
is not included, even though the mask specifies that it is included.
A URL mask that appears first always takes precedence over a URL mask that appears later in the mask settings. Additionally, if the search robot encounters a page that matches an include URL mask and an exclude URL mask, the mask that is listed first always takes precedence.
See Configuring an incremental index of a staged website.
You can qualify each include mask with one or more space-separated keywords, which affect how the matched pages are indexed.
A comma is not valid as a separator between the mask and the keyword; you can only use spaces.
Keyword |
Description |
---|---|
noindex |
If you do not want to index the text on the pages that match the URL mask, but you want to follow the matched pages links, add
The above example specifies that the search robot follow all links from files with the
The
|
nofollow |
If you want to index the text on the pages that match the URL mask, but you do not want to follow the matched page's links, add
The
|
regexp |
Used for both include and exclude masks. Any URL mask preceded with
The search robot excludes matching files such as
If you had the following exclude regular expression URL mask: The search robot does not to include any URL containing a CGI parameter such as
If you had the following include regular expression URL mask: The search robot follows all links from files with the ".swf" extension. The
See Regular Expressions . |
You can use URL Masks to define which parts of your website that you want or do not want crawled and indexed.
Use the Test URL Masks field to test whether a document is or is not included after you index.
Be sure that you rebuild your site index so that the results of your URL masks are visible to your customers.
See Configuring an incremental index of a staged website.
To add URL masks to index or not index parts of your website
On the product menu, click Settings > Crawling > URL Masks.
(Optional) On the URL Masks page, in the Test URL Masks field, enter a test URL mask from your website, and then click Test.
In the URL Masks field, type include
(to add a website that you want crawled and indexed), or type exclude
(to block a website from getting crawled and indexed), followed by the URL mask address.
Enter one URL mask address per line. Example:
include https://www.mycompany.com/summer
include https://www.mycompany.com/spring
exclude regexp .*\.xml
exclude https://www.mycompany.com/fall
Click Save Changes.
(Optional) Do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
You can use Date Masks to include or exclude files from your search results based on the age of the file.
Be sure that you rebuild your site index so that the results of your URL masks are visible to your customers.
See Configuring an incremental index of a staged website.
The following are two kinds of date masks that you can use:
Include date masks (“include-days” and “include-date”)
Include date masks index files that are dated on or before the specified date.
Exclude date masks (“exclude-days” and “exclude-date”)
Exclude date masks index files that are dated on or before the specified date.
By default, the file date is determined from meta tag information. If no Meta tag is found, the date of a file is determined from the HTTP header that is received from the server when the search robot downloads a file.
Each date mask that you specify must be on a separate line.
The mask can specify any of the following:
https://www.mydomain.com/products.html
https://www.mydomain.com/products
https://www.mydomain.com/*.html
regexp
before the URL.Both include and exclude date masks can specify a date in one of the two following ways. The masks are only be applied if the matched files were created on or before the specified date:
A number of days. For example, suppose your date mask is the following:
exclude-days 30 https://www.mydomain.com/docs/archive/)
The number of specified days is counted back. If the file is dated on or before the arrived upon date, the mask is applied.
An actual date using the format YYYY-MM-DD. For example, suppose your date mask is the following:
include-date 2011-02-15 https://www.mydomain.com/docs/archive/)
If the matched document is dated on or before the specified date, the date mask is applied.
The following is a simple exclude date mask example:
exclude-days 90 https://www.mydomain.com/docs/archive
Because this is an exclude date mask, any file that matches the pattern is not indexed and is 90 days old or older. When you exclude a document, no text is indexed and no links are followed from that file. The file is effectively ignored. In this example, both files and folders might match the specified URL pattern. Notice that both https://www.mydomain.com/docs/archive.html
and https://www.mydomain.com/docs/archive/index.html
match the pattern and are not indexed if they are 90 days old or older. To match only files in the /docs/archive/
folder, the date mask must contain a trailing slash as in the following:
exclude-days 90 https://www.mydomain.com/docs/archive/
Date masks can also be used with wild cards. The following exclude mask tells the search robot to overlook files with the “.pdf” extension that are dated on or before 2011-02-15. The search robot does not add any matched files to your index.
exclude-date 2011-02-15 *.pdf
Include date mask looks similar, only matched files are added to the index. The following include date mask example tells the search robot to index the text from any files that are zero days old or older in the /docs/archive/manual/
area of the website.
include-days 0 https://www.mydomain.com/docs/archive/manual/
Include masks and exclude masks can work together. For example, you can exclude a large portion of your website from indexing by creating an exclude date mask yet include one or more of those excluded pages with an include URL mask. If your entrypoint URL is the following:
https://www.mydomain.com/archive/
The search robot crawls and indexes all of the pages under /archive/summer/
, /archive/spring/
, and /archive/fall/
(assuming that there are links to at least one page in each folder from the archive
folder). This behavior occurs because the link paths enable the search robot to “find” the files in the /summer/
, /spring/
, and /fall/
folders and the folder URLs match the include mask automatically generated by the entrypoint URL.
See Configuring your account settings.
You may choose to exclude all pages over 90 days old in the /fall/
folder with an exclude date mask as in the following:
exclude-days 90 https://www.mydomain.com/archive/fall/
You can selectively include only /archive/fall/index.html
(regardless of how old it is–any file 0 days or older are matched) as part of the index with the following date mask:
include-days 0 https://www.mydomain.com/archive/fall/index.html
For the above two mask examples to work as intended, you must list the include mask first as in the following:
include-days 0 https://www.mydomain.com/archive/fall/index.html
exclude-days 90 https://www.mydomain.com/archive/fall/
Because the search robot follows directions in the order they are specified, the search robot first includes /archive/fall/index.html
, and then excludes the rest of the files in the /fall
folder.
If the instructions are specified in the opposite way as in the following:
exclude-days 90 https://www.mydomain.com/archive/fall/
include-days 0 https://www.mydomain.com/archive/fall/index.html
Then /archive/fall/index.html
is not included, even though the mask specifies that it should be. A date mask that appears first always takes precedence over a date mask that might appear later in the mask settings. Additionally, if the search robot encounters a page that matches both an include date mask and an exclude date mask, the mask that is listed first always takes precedence.
See Configuring an incremental index of a staged website.
You can qualify each include mask with one or more space-separated keywords, which affect how the matched pages are indexed.
A comma is not valid as a separator between the mask and the keyword; you can only use spaces.
Keyword |
Description |
---|---|
noindex |
If you do not want to index the text on the pages that are dated on or before the date that is specified by the include mask, add
Be sure you separate the keyword from the mask with a space. The above example specifies that the search robot follow all links from files with the ".swf" extension that are 10 days old or older. However, it disables indexing of all text contained in those files. You may want make sure that the text for older files is not indexed but still follow all links from those files. In such cases, use an include date mask with the "noindex" keyword instead of using an exclude date mask. |
nofollow |
If you want to index the text on the pages that are dated on or before the date that is specified by the include mask, but you do not want to follow the matched page's links, add
Be sure you separate the keyword from the mask with a space. The
|
server-date |
Used for both include and exclude masks. The search robot generally downloads and parses every file before checking the date masks. This behavior occurs because some file types can specify a date inside the file itself. For example, an HTML document can include meta tags that set the date of the file. If you are going to exclude many files based on their date, and you do not want to put an unnecessary load on your servers, you can use
This keyword instructs the search robot to trust the date of the file that is returned by your server instead of parsing each file. For example, the following exclude date mask ignores pages that match the URL if the documents are 90 days or older, according to the date that is returned by the server in the HTTP headers: If the date that is returned by the server is 90 days or more past,
You should not use
|
regexp |
Use for both include and exclude masks. Any date mask that is preceded by
If the search robot encounters files that match an exclude regular expression date mask, it does not index those files. If the search robot encounters files that match an include regular expression date mask, it indexes those documents. For example, suppose you have the following date mask: The mask tells the search robot to exclude matching files that are 180 days or older. That is, files that contain the word "archive" in their URL. See Regular Expressions . |
You can use Date Masks to include or exclude files from customer search results based on the age of the files.
Use the Test Date and Test URL fields to test whether a file is or is not included after you index.
Be sure that you rebuild your site index so that the results of your URL masks are visible to your customers.
See Configuring an incremental index of a staged website.
To add date masks to index or not index parts of your website
On the product menu, click Settings > Crawling > Date Masks.
(Optional) On the Date Masks page, in the Test Date field, enter a date formatted as YYYY-MM-DD (for example, 2011-07-25
); in the Test URL field, enter a URL mask from your website, and then click Test.
In the Date Masks field, enter one date mask address per line.
Click Save Changes.
(Optional) Do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
To access portions of your website that are protected with HTTP Basic Authentication, you can add one or more passwords.
Before the effects of the Password settings is visible to customers, you must rebuild your site index.
See Configuring an incremental index of a staged website.
On the Passwords page, you type each password on a single line. The password consists of a URL or realm, a user name, and a password, as in the following example:
https://www.mydomain.com/ myname mypassword
Instead of the using a URL path, like above, you could also specify a realm.
To determine the correct realm to use, open a password-protected web page with a browser and look at the “Enter Network Password” dialog box.
The realm name, in this case, is “My Site Realm.”
Using the realm name above, your password might look like the following:
My Site Realm myusername mypassword
If your web site has multiple realms, you can create multiple passwords by entering a user name and password for each realm on a separate line as in the following example:
Realm1 name1 password1
Realm2 name2 password2
Realm3 name3 password3
You can intermix passwords that contain URLs or realms so that your password list might look like the following:
Realm1 name1 password1
https://www.mysite.com/path1/path2 name2 password2
Realm3 name3 password3
Realm4 name4 password4
https://www.mysite.com/path1/path5 name5 password5
https://www.mysite.com/path6 name6 password6
In the list above, the first password is used that contains a realm or URL that matches the server’s authentication request. Even if the file at https://www.mysite.com/path1/path2/index.html
is in Realm3
, for example, name2
and password2
are used because the password that is defined with the URL is listed above the one defined with the realm.
You can use Passwords to access password-protected areas of your website for crawling and indexing purposes.
Before the effects of your password are additions are visible to customers, be sure you rebuild your site index
See Configuring an incremental index of a staged website.
To add passwords for accessing areas of your website that require authentication
On the product menu, click Settings > Crawling > Passwords.
On the Passwords page, in the Passwords field, enter a realm or URL, and its associated user name, and password, separated by a space.
Example of a realm password and a URL password on separate lines:
Realm1 name1 password1
https://www.mysite.com/path1/path2 name2 password2
Only add one password per line.
Click Save Changes.
(Optional) Do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
You can use Content Types to select which types of files that you want to crawl and index for this account.
Content types that you can choose to crawl and index include PDF documents, text documents, Adobe Flash movies, files from Microsoft Office applications such as Word, Excel, and Powerpoint, and text in MP3 files. The text that is found within the selected content types is searched along with all of the other text on your website.
Before the effects of the Content Types settings is visible to customers, you must rebuild your site index.
See Configuring an incremental index of a staged website.
If you select the option Text in MP3 Music Files on the Content Types page, an MP3 file is crawled and indexed in one of two ways. The first and most common way is from an anchor href tag in an HTML file as in the following:
<a href="MP3-file-URL"></a>
The second way is to enter the URL of the MP3 file as a URL entrypoint.
An MP3 file is recognized by its MIME type “audio/mpeg”.
Be aware that MP3 music file sizes can be quite large, even though they usually contain only a small amount of text. For example, MP3 files can optionally store such things as the album name, artist name, song title, song genre, year of release, and a comment. This information is stored at the very end of the file in what is called the TAG. MP3 files containing TAG information are indexed in the following way:
Note that each MP3 file that is crawled and indexed on your website counts as one page.
If your website contains many large MP3 files, you may exceed the indexing byte limit for your account. If this happens, you can deselect Text in MP3 Music Files on the Content Types page to prevent the indexing of all MP3 files on your website.
If you only want to prevent the indexing of certain MP3 files on your website, you can do one of the following:
Surround the anchor tags that link to the MP3 files with <nofollow>
and </nofollow>
tags. The search robot does not follow links between those tags.
Add the URLs of the MP3 files as exclude masks.
See About URL Masks.
You can use Content Types to select which types of files that you want to crawl and index for this account.
Content types that you can choose to crawl and index include PDF documents, text documents, Adobe Flash movies, files from Microsoft Office applications such as Word, Excel, and Powerpoint, and text in MP3 files. The text that is found within the selected content types is searched along with all of the other text on your website.
Before the effects of the Content Types settings is visible to customers, you must rebuild your site index.
See Configuring an incremental index of a staged website.
To crawl and index Chinese, Japanese, or Korean MP3 files, complete the steps below. Then, in Settings > Metadata > Injections, specify the character set that is used to encode the MP3 files.
See About Injections.
To select content types to crawl and index
On the product menu, click Settings > Crawling > Content Types.
On the Content Types page, check the file types that you want to crawl and index on your website.
Click Save Changes.
(Optional) Do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
You can use Connections to add up to ten HTTP connections that the search robot uses to index your website.
Increasing the number of connections can significantly reduce the amount of time that it takes to complete a crawl and index. However, be aware that each additional connection increases the load on your server.
You can reduce the amount of time it takes to index your website by using Connections to increase the number of simultaneous HTTP connections that the crawler uses. You can add up to ten connections.
Be aware that each additional connection increases the load that is placed on your server.
To add connections to increase indexing speed
On the product menu, click Settings > Crawling > Connections.
On the Parallel Indexing Connections page, in the Number of Connections field, enter the number of connections (1-10) that you want to add.
Click Save Changes.
(Optional) Do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
You can use Form Submission to help you recognize and process forms on your website.
During the crawling and indexing of your website, each encountered form is compared with the form definitions that you have added. If a form matches a form definition, the form is submitted for indexing. If a form matches more than one definition, the form is submitted once for each matched definition.
You can use Form Submission to help process forms that are recognized on your website for indexing purposes.
Be sure that you rebuild your site index so that the results of your changes are visible to your customers.
See Configuring an incremental index of a staged website.
To add form definitions for indexing forms on your website
On the product menu, click Settings > Crawling > Form Submission.
On the Form Submission page, click Add New Form.
On the Add Form Definition page, set the Form Recognition and Form Submission options.
The five options in the Form Recognition section on the Form Definition page are used to identify forms in your web pages that can be process.
The three options in the Form Submission section are used to specify the parameters and values that are submitted with a form to your web server.
Enter one recognition or submission parameter per line. Each parameter must include a name and a value.
Option |
Description |
---|---|
Form Recognition |
|
Page URL Mask |
Identify the web page or pages that contain the form. To identify a form that appears on a single page, enter the URL for that page as in the following example: To identify forms that appear on multiple pages, specify a URL mask that uses wildcards to describe the pages. To identify forms encountered on any ASP page under You can also use a regular expression to identify multiple pages. Just specify the
|
Action URL Mask |
Identifies the action attribute of the
Like the page URL mask, the action URL mask can take the form of a single URL, a URL with wildcards, or a regular expression. The URL mask can be any of the following:
If you do not want to index the text on pages that are identified by a URL mask or by an action URL mask, or if you do not want links followed on those pages, you can use the
See About URL Entrypoints . See About URL Masks . |
Form Name Mask |
Identifies forms if the
You can use a simple name (
You can usually leave this field empty because forms typically do not have a name attribute. |
Form ID Mask |
Identifies forms if the
You can use a simple name (
You can usually leave this field empty because forms typically do not have a name attribute. |
Parameters |
Identify forms that contain, or do not contain, a named parameter or a named parameter with a specific value. For example, to identify a form that contains an e-mail parameter that is preset to rick_brough@mydomain.com, a password parameter, but not a first-name parameter, you would specify the following parameter settings, one per line: |
Form Submission |
|
Override Action URL |
Specify when the target of the form submission is different from what is specified in the form's action attribute. For example, you might use this option when the form is submitted by way of a JavaScript function that constructs a URL value that is different from what is found in the form. |
Override Method |
Specify when the target of the form submission is different from what is used in the form's action attribute and when the submitting JavaScript has changed the method. The default values for all form parameters (
|
Parameters |
You can prefix form submission parameters with the
When you prefix a parameter with
For example, suppose you want to submit the following parameters:
Your form submission parameter would look like the following: The method attribute of the
If the
|
Click Add.
(Optional) Do any of the following:
Click Live.
Click Push Live.
You can edit an existing form definition if a form on your website has changed or if you just need to change the definition.
Be aware that there is no History feature on the Form Submission page to revert any changes that you make to a form definition.
Be sure that you rebuild your site index so that the results of your changes are visible to your customers.
See Configuring an incremental index of a staged website.
To edit a form definition
On the product menu, click Settings > Crawling > Form Submission.
On the Form Submission page, click Edit to the right of a form definition that you want to update.
On the Edit Form Definition page, set the Form Recognition and Form Submission options.
See the table of options under Adding form definitions for indexing forms on your website.
Click Save Changes.
(Optional) Do any of the following:
Click Live.
Click Push Live.
You can delete an existing form definition if the form no longer exists on your website, or if you no longer want to process and index a particular form.
Be aware that there is no History feature on the Form Submission page to revert any changes that you make to a form definition.
Be sure that you rebuild your site index so that the results of your changes are visible to your customers.
See Configuring an incremental index of a staged website.
To delete a form definition
On the product menu, click Settings > Crawling > Form Submission.
On the Form Submission page, click Delete to the right of a form definition that you want to remove.
Make sure you choose the right form definition to delete. There is no delete confirmation dialog box when you click Delete in the next step.
On the Delete Form Definition page, click Delete.
(Optional) Do any of the following:
Click Live.
Click Push Live.
Use Index Connector to define additional input sources for indexing XML pages or any kind of feed.
You can use a data feed input source to access content that is stored in a form that is different from what is typically discovered on a website using one of the available crawl methods. Each document that is crawled and indexed directly corresponds to a content page on your website. However, a data feed either comes from an XML document or from a comma- or tab-delimited text file, and contains the content information to index.
An XML data source consists of XML stanzas, or records, that contain information that corresponds to individual documents. These individual documents are added to the index. A text data feed contains individual new-line-delimited records that correspond to individual documents. These individual documents are also added to the index. In either case, an index connector configuration describes how to interpret the feed. Each configuration describes where the file resides and how the servers access it. The configuration also describes “mapping” information. That is, how each record’s items are used to populate the metadata fields in the resulting index.
After you add an Index Connector definition to the Staged Index Connector Definitions page, you can change any configuration setting, except for the Name or Type values.
The Index Connector page shows you the following information:
The name of defined index connectors that you have configured and added.
One of the following data source types for each connector that you have added:
Whether the connector is enabled or not for the next crawl and indexing done.
The address of the data source.
See also About Index Connector
Step |
Process |
Description |
---|---|---|
1 |
Download the data source. |
For Text and Feed configurations, it is a simple file download. |
2 |
Break down the downloaded data source into individual pseudo-documents. |
For Text , each newline-delimited line of text corresponds to an individual document, and is parsed using the specified delimiter, such as a comma or tab. For Feed , each document's data is extracted using a regular expression pattern in the following form: Using Map on the Index Connector Add page, create a cached copy of the data and then create a list of links for the crawler. The data is stored in a local cache and is populated with the configured fields. The parsed data is written to the local cache. This cache is read later to create the simple HTML documents that the crawler needs. For example, The <title> element is only generated when a mapping exists to the Title metadata field. Similarly, the <body> element is only generated when a mapping exists to the Body metadata field. Important: There is no support for the assignment of values to the pre-defined URL meta tag. For all other mappings, <meta> tags are generated for each field that has data found in the original document. The fields for each document are added to the cache. For each document that is written to the cache, a link is also generated as in the following examples: The configuration's mapping must have one field identified as the Primary Key. This mapping forms the key that is used when data is fetched from the cache. The crawler recognizes the URL index: scheme prefix, which can then access the locally cached data. |
3 |
Crawl the cached document set. |
The index: links are added to the crawler's pending list, and are processed in the normal crawl sequence. |
4 |
Process each document. |
Each link’s key value corresponds to an entry in the cache, so crawling each link results in that document’s data being fetched from the cache. It is then “assembled” into an HTML image that is processed and added to the index. |
The indexing process for XML configuration is similar to the process for Text and Feed configurations with the following minor changes and exceptions.
Because the documents for XML crawls are already separated into individual files, steps 1 and 2 in the table above do not directly apply. If you specify a URL in the Host Address and File Path fields of the Index Connector Add page, it is downloaded and processed as a normal HTML document. The expectation is that the download document contains a collection of <a href="{url}"...
links, each of which points to an XML document that is processed. Such links are converted to the following form:
<a href="index:<ic_config_name>?url="{url}">
For example, if the Adobe setup returned the following links:
<a href="https://www.adobe.com/somepath/doc1.xml">doc 1</a>
<a href="https://www.adobe.com/otherpath/doc2.xml">doc 2</a>
In the table above, step 3 does not apply and step 4 is completed at the time of crawling and indexing.
Alternately, you can intermix your XML documents with other documents that were discovered naturally through the crawl process. In such cases, you can use rewrite rules ( Settings > Rewrite Rules > Crawl List Retrieve URL Rules) to change the XML documents’ URLs to direct them to Index Connector.
See About Crawl List Retrieve URL Rules.
For example, supposed you have the following rewrite rule:
RewriteRule (^http.*[.]xml$) index:Adobe?key=$1
This rule translates any URL ending with .xml
into an Index Connector link. The crawler recognizes and rewrites the index:
URL scheme. The download process is redirected through the Index Connector Apache server on the primary. Each downloaded document is examined using the same regular expression pattern that is used with Feeds. In this case, however, the manufactured HTML document is not saved in the cache. Instead, it is handed directly to the crawler for index processing.
You can define multiple Index Connector configurations for any account. The configurations are automatically added to the drop-down list in Settings > Crawl > URL Entrypoints as shown in the following illustration:
Selecting a configuration from the drop-down list adds the value to the end of the list of URL entry points.
While disabled Index Connector configurations are added to the drop-down list, you cannot select them. If you select the same Index Connector configuration a second time, it is added to the end of the list, and the previous instance is deleted.
To specify an Index Connector entry point for an incremental crawl, you can add entries using the following format:
index:<indexconnector_configuration_name>
The crawler processes each added entry if it is found on the Index Connectors page and is enabled.
Note: Because each document’s URL is constructed using the Index Connector configuration name and the document’s primary key, be sure you use the same Index Connector configuration name when performing Incremental updates! Doing so permits Adobe Search&Promote to correctly update previously indexed documents.
See also About URL Entrypoints.
The use of Setup Maps when you add an Index Connector
At the time you add an Index Connector, you can optionally use the feature Setup Maps to download a sample of your data source. The data is examined for indexing suitability.
If you chose the Index Connector type... |
The Setup Maps feature... |
---|---|
Text |
Determines the delimiter value by trying tabs first, then vertical-bars ( | ), and finally commas ( , ). If you already specified a delimiter value before you clicked Setup Maps , that value is used instead. The best-fit scheme results in the Map fields being filled out with guesses at the appropriate Tag and Field values. Additionally, a sampling of the parsed data is displayed. Be sure to select Headers in First Row if you know that the file includes a header row. The setup function uses this information to better identify the resulting map entries. |
Feed |
Downloads the data source and performs simple XML parsing. The resulting XPath identifiers are displayed in the Tag rows of the Map table, and similar values in Fields. These rows only identify the available data, and do not generate the more complicated XPath definitions. However, it is still helpful because it describes the XML data and identifies Itemtag values.
Note: The Setup Maps function downloads the entire XML source to perform its analysis. If the file is large, this operation could time out. When successful, this function identifies all possible XPath items, many of which are not desirable to use. Be sure that you examine the resulting Map definitions and remove the ones that you do not need or want. |
XML |
Downloads the URL of a representative individual document, not the primary link list. This single document is parsed using the same mechanism that is used with Feeds, and the results are displayed. Before you click Add to save the configuration, be sure that you change the URL back to the primary link list document. |
Important: The Setup Maps feature may not work for large XML data sets because its file parser attempts to read the entire file into memory. As a result, you could experience an out-of-memory condition. However, when the same document is processed at the time of indexing, it is not read into memory. Instead, large documents are processed “on the go,” and are not read entirely into memory first.
The use of Preview when you add an Index Connector
At the time you add an Index Connector, you can optionally use the feature Preview to validate the data, as though you were saving it. It runs a test against the configuration, but without saving the configuration to the account. The test accesses the configured data source. However, it writes the download cache to a temporary location; it does not conflict with the main cache folder that the indexing crawler uses.
Preview only processes a default of five documents as controlled by Acct:IndexConnector-Preview-Max-Documents. The previewed documents are displayed in source form, as they are presented to the indexing crawler. The display is similar to a “View Source" feature in a Web browser. You can navigate the documents in the preview set using standard navigation links.
Preview does not support XML configurations because such documents are processed directly and not downloaded to the cache.
Each Index Connector configuration defines a data source and mappings to relate the data items defined for that source to metadata fields in the index.
Before the effects of the new and enabled definition is visible to customers, rebuild your site index.
To add an Index Connector definition
On the product menu, click Settings > Crawling > Index Connector.
On the Stage Index Connector Definitions page, click Add New Index Connector.
On the Index Connector Add page, set the connector options that you want. The options that are available depend on the Type that you selected.
Option |
Description |
---|---|
Name |
The unique name of the Index Connector configuration. You can use alphanumeric characters. The characters "_" and "-" are also allowed. |
Type |
The source of your data. The data source type that you select affects the resulting options that are available on the Index Connector Add page. You can choose from the following:
|
Data source type: Text |
|
Enabled |
Turns the configuration "on" to crawl and index. Or, you can turn "off" the configuration to prevent crawling and indexing. Note: Disabled Index Connector configurations are ignored if they are found in an entrypoint list. |
Host Address |
Specifies the address of the server host where your data is located. If desired, you can specify a full URI (Uniform Resource Identifier) path to the data source document as in the following examples: or The URI is broken down into the appropriate entries for the Host Address, File Path, Protocol, and, optionally, Username, and Password fields. Specifies the IP address or the URL address of the host system where the data source file is found. |
File Path |
Specifies the path to the simple flat text file, comma-delimited, tab-delimited, or other consistently delimited format file. The path is relative to the root of the host address. |
Incremental File Path |
Specifies the path to the simple flat text file, comma-delimited, tab-delimited, or other consistently delimited format file. The path is relative to the root of the host address. This file, if specified, is downloaded and processed during Incremental Index operations. If no file is specified, the file listed under File Path is used instead. |
Vertical File Path |
Specifies the path to the simple flat text file, comma-delimited, tab-delimited, or other consistently delimited format file to be used during a Vertical Update. The path is relative to the root of the host address. This file, if specified, is downloaded and processed during Vertical Update operations. Note: This feature is not enabled by default. Contact Technical Support to activate the feature for your use. |
Deletes File Path |
Specifies the path to the simple flat text file, containing a single document identifier value per line. The path is relative to the root of the host address. This file, if specified, is downloaded and processed during Incremental Index operations. The values found in this file are used to construct "delete" requests to remove previously indexed documents. The values in this file must correspond to the values found in the Full or Incremental File Path files, in the column identified as the Primary Key . Note: This feature is not enabled by default. Contact Technical Support to activate the feature for your use. |
Protocol |
Specifies the protocol that is used to access the file. You can choose from the following:
|
Timeout |
Specifies the timeout, in seconds, for FTP, SFTP, HTTP or HTTPS connections. This value must be between 30 and 300. |
Retries |
Specifies the maximum number of retries for failed FTP, SFTP, HTTP or HTTPS connections. This value must be between 0 and 10. A value of zero (0) will prevent retry attempts. |
Encoding |
Specifies the character encoding system that is used in the specified data source file. |
Delimiter |
Specifies the character that you want to use to delineate each field in the specified data source file. The comma character ( , ) is an example of a delimiter. The comma acts as a field delimiter that helps to separate data fields in your specified data source file. Select Tab? to use the horizontal-tab character as the delimiter. |
Headers in First Row |
Indicates that the first row in the data source file contains header information only, not data. |
Minimum number of documents for indexing |
If set to a positive value, this specifies the minimum number of records expected in the file downloaded. If fewer records are received, the index operation is aborted. Note: This feature is not enabled by default. Contact Technical Support to activate the feature for your use. Note: This feature is only used during full Index operations. |
Map |
Specifies column-to-metadata mappings, using column numbers.
|
Data source type: Feed |
|
Enabled |
Turns the configuration "on" to crawl and index. Or, you can turn "off" the configuration to prevent crawling and indexing. Note: Disabled Index Connector configurations are ignored if they are found in an entrypoint list. |
Host Address |
Specifies the IP address or the URL address of the host system where the data source file is found. |
File Path |
Specifies the path to the primary XML document that contains multiple "rows" of information. The path is relative to the root of the host address. |
Incremental File Path |
Specifies the path to the incremental XML document that contains multiple "rows" of information. The path is relative to the root of the host address. This file, if specified, is downloaded and processed during Incremental Index operations. If no file is specified, the file listed under File Path is used instead. |
Vertical File Path |
Specifies the path to the XML document that contains multiple sparse "rows" of information to be used during a Vertical Update. The path is relative to the root of the host address. This file, if specified, is downloaded and processed during Vertical Update operations. Note: This feature is not enabled by default. Contact Technical Support to activate the feature for your use. |
Deletes File Path |
Specifies the path to the simple flat text file, containing a single document identifier value per line. The path is relative to the root of the host address. This file, if specified, is downloaded and processed during Incremental Index operations. The values found in this file are used to construct "delete" requests to remove previously indexed documents. The values in this file must correspond to the values found in the Full or Incremental File Path files, in the column identified as the Primary Key . Note: This feature is not enabled by default. Contact Technical Support to activate the feature for your use. |
Protocol |
Specifies the protocol that is used to access the file. You can choose from the following:
|
Itemtag |
Identifies the XML element that you can use to identify individual XML lines in the data source file that you specified. For example, in the following Feed fragment of an Adobe XML document, the Itemtag value is record : |
Minimum number of documents for indexing |
If set to a positive value, this specifies the minimum number of records expected in the file downloaded. If fewer records are received, the index operation is aborted. Note: This feature is not enabled by default. Contact Technical Support to activate the feature for your use. Note: This feature is only used during full Index operations. |
Map |
Lets you specify XML-element-to-metadata mappings, using XPath expressions.
|
Data source type: XML |
|
Enabled |
Turns the configuration "on" to crawl and index. Or, you can turn "off" the configuration to prevent crawling and indexing. Note: Disabled Index Connector configurations are ignored if they are found in an entrypoint list. |
Host Address |
Specifies the URL address of the host system where the data source file is found. |
File Path |
Specifies the path to the primary XML document that contains links (
The path is relative to the root of the host address. |
Protocol |
Specifies the protocol that is used to access the file. You can choose from the following:
Note: The Protocol setting is only used when there is information specified in the Host Address and/or File Path fields. Individual XML documents are downloaded using either HTTP or HTTPS, according to their URL specifications. |
Itemtag |
Identifies the XML element that defines a "row" in the data source file that you specified. |
Map |
Lets you specify column-to-metadata mappings, using column numbers.
|
(Optional) Click Setup Maps to download a sample of your data source. The data is examined for indexing suitability. This feature is available for Text and Feed Types, only.
(Optional) Click Preview to test the actual working of the configuration. This feature is available for Text and Feed Types, only.
Click Add to add the configuration to the Index Connector Definitions page and to the Index Connector Configurations drop-down list on the URL Entrypoints page.
On the Index Connector Definitions page, click rebuild your staged site index.
(Optional) On the Index Connector Definitions page, do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
You can edit an existing Index Connector that you have defined.
Not all options are available for you to change, such as the Index Connector Name or Type from the Type drop-down list.
To edit an Index Connector definition
On the product menu, click Settings > Crawling > Index Connector.
On the Index Connector page, under the Actions column heading, click Edit for an Index Connector definition name whose settings you want to change.
On the Index Connector Edit page, set the options you want.
See the table of options under Adding an Index Connector definition.
Click Save Changes.
(Optional) On the Index Connector Definitions page, click rebuild your staged site index.
(Optional) On the Index Connector Definitions page, do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
You can review the configuration settings of an existing index connector definition.
After an Index Connector definition is added to the Index Connector Definitions page, you cannot change its Type setting. Instead, you must delete the definition and then add a new one.
To view the settings of an Index Connector definition
You can copy an existing Index Connector definition to use as the basis for a new Index Connector that you want to create.
When copying an Index Connector definition, the copied definition is disabled by default. To enable or “turn on” the definition, you must edit it from the Index Connector Edit page, and select Enable.
See Editing an Index Connector definition.
To copy an Index Connector definition
On the product menu, click Settings > Crawling > Index Connector.
On the Index Connector page, under the Actions column heading, click Copy for an Index Connector definition name whose settings you want to duplicate.
On the Index Connector Copy page, enter the new name of the definition.
Click Copy.
(Optional) On the Index Connector Definitions page, do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
You can change the name of an existing Index Connector definition.
After you rename the definition, check Settings > Crawling > URL Entrypoints. You want to ensure that the new definition name is reflected in the drop-down list on the URL Entrypoints page.
See Adding multiple URL entry points that you want indexed.
To rename an Index Connector definition
On the product menu, click Settings > Crawling > Index Connector.
On the Index Connector page, under the Actions column heading, click Rename for Index Connector definition name that you want to change.
On the Index Connector Rename page, enter the new name of the definition in the Name field.
Click Rename.
Click Settings > Crawling > URL Entrypoints. If the previous Index Connector’s name is present in the list, remove it, and add the newly renamed entry.
See Adding multiple URL entry points that you want indexed. 1. (Optional) On the Index Connector Definitions page, do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
You can delete an existing Index Connector definition that you no longer need or use.
To delete an Index Connector definition