Use the Filtering menu to use scripts that change the content of a web document before it is indexed.
You can use Filtering Script to change the content of a Web document before it is indexed.
You can insert HTML tags, remove irrelevant content, and even create new HTML metadata based on a document’s URL, MIME type, and existing content. The filtering script is a Perl script, which provides powerful string handling and the flexibility of regular expression matching. You use the filtering script with an initialization script, termination script, URL masks script, and test URL.
The filtering script is run each time a document is read from your website. The script runs as a standard filter, In other words, reads data from STDIN, transforms that data in some way, and writes the results to STDOUT. You can use the filtering script to print status messages from the filtering script to the index log. You either printing the messages to STDERR, or by way of the _search_debug_log()
subroutine.
Some GNU diff options that you can use while in Expert (diff) mode on the Staged Filtering Script page, include the following:
GNU diff option |
Description |
---|---|
-b |
Ignores changes in amount of white space. |
-B |
Ignores changes that insert or delete blank lines. |
-c |
Uses the context output format, showing three lines of context. |
-C lines |
Uses the context output format, showing lines (an integer) lines of context, or three if lines is not given. |
-i |
Ignores changes in case; consider upper- and lowercase letters equivalent. |
-f |
Makes output that looks similar like an ed script but has changes in the order that they appear in the file. |
-n |
Outputs RCS-format diffs; like -f except that each command specifies the number of lines that are affected. |
-u |
Uses the unified output format, showing three lines of context. |
-U lines |
Uses the unified output format, showing lines (an integer) of context, or three if lines is not given. |
You can use local variables, global variables, or both in these scripts. All global variables are prefaced with the namespace “main::”. When the filtering script is started, its environment contains the following standard file handles:
Additionally, you can write custom messages to the index log using the _search_debug_log()
subroutine, as in the following example:
# Log information to the Index Log
_search_debug_log("Done processing document: " . $main::search_url);
These messages appear with the word DEBUG
as a preface, and are not logged as errors.
The following is an example of filtering. Web page <title>
fields often begin with the company name. Even though this information is useful for site navigation purposes, it is not relevant when searching. If the titles of all MegaCorp web pages start with a common string, such as the following:
<title>MegaCorp -- meaningful title
here</title>
You should remove " MegaCorp --
" from the beginning of each document title and count each document processed with the filtering script. To do so, you can use the following script:
# Make sure this is an HTML document.
if ($main::ws_content_type =~ /^text\/html/) {
# Read the entire document into a local scalar variable.
my @docarray = <>;
my $doc = join("", @docarray);
# Remove "MegaCorp -- " from the title.
$doc =~ s/(<TITLE>)MegaCorp -- /$1/gis;
# Print the resulting document.
print $doc;
# Count that we've filtered one more document.
$main::doc_count++;
}
You can use the following variables in any filtering script:
Variable | Description |
---|---|
$main::search_crawl_type |
The value of $main::search_crawl_type indicates the type of index operation underway. Deprecated form: $main::ws_crawl_type The index operations and associated values include the following:
|
$main::search_clear_cache |
The value indicates whether the “Clear index cache” indexing option was requested for the current index operation. If “Clear index cache” was requested, the value of $main::search_clear_cache is " 1 ". Deprecated form: $main::ws_clear_cache |
$main::search_fields |
The value contains a tab-separated list of the metadata fields that are defined in the account. By default, the value is: url title desc keys target body alt date charset language Deprecated form: $main::ws_fields |
$main::search_collections |
The value contains a tab-separated list of the Collections that are defined in the account. Deprecated form: $main::ws_collections |
$main::search_url |
The value is the fully qualified URL of the document. Deprecated form: $main::ws_url |
$main::search_content_type |
The value is the content-type of the document as fetched from the http-equiv meta tag. A typical value is “text/html; charset=iso-8859-1”. Deprecated form: $main::ws_content_type |
$main::search_content_class |
The value is the content class of the document, as derived from the content-type field. Deprecated form: $main::ws_content_class |
$main::search_syntax_check |
The value reflects the use of the “Check Syntax” button. If clicked, the value is 1 (one); otherwise, its value is 0 (zero). Deprecated form: $main::ws_syntax_check |
$main::search_last_mod_date |
If provided by the web server, this value contains the Epoch representation (seconds since January 1, 1970) of the document’s last-modified date. You can format this value by using the Perl localtime() library call. |
All global variables are prefaced with the namespace “main::”: $main::doc_count = 0;
All local variables are declared with “my”: my $i = 0;
Subroutines are defined in the initialization script. They do not need an explicit “main::” namespace: sub my_sub {
...
}
Test the $main::search_content_type
before you make changes to a file. Testing can help you avoid making careless changes to binary files, like SWF files or PDF files:
if ($main::search_content_type =~ /^text\/html/) { ...
The $main::search_content_type
is the full Content-Type header delivered by your server. It can sometimes contain a simple MIME type, such as “text/html”. Or, it can contain a MIME type followed by other information, like the document’s character set encoding, such as “text/html; charset=iso-8859-1”.
For each type of non-HTML document, $main::search_content_type
can take various values. Testing for each value in your script becomes cumbersome. For example, some Word documents have content type values of “application/msword”, “application/vnd.ms-word” or “application/x-msword”. In such cases, $main::search_content_class
can take the following values:
In the example, testing $main::search_content_class
for “word” would match any of the three possible content-type values.
If nothing is printed to STDOUT from the filtering script, then the document is used exactly as it was downloaded. That is, if you do not need to change anything in a document, then you do not need to copy STDIN to STDOUT for that document.
If you want to remove all text from a document, print a valid file STDOUT. For example, to completely remove all text from an HTML document, you do the following: print "<html></html>";
The filtering script is a Perl script that is run for each document that is downloaded from your website.
You use the filtering script in conjunction with an initialization script, termination script, and URL masks script.
Be sure that you rebuild your site index so that the results of your filtering script are visible to your customers.
See Configuring an incremental index of a staged website.
To add a filtering script
On the product menu, click Settings > Filtering > Filtering Script.
(Optional) On the Filtering Script page, in the Test URL field, enter the URL of a document on your website.
Click a testing option to see changes to the raw HTML text.
Option |
Description |
---|---|
Test URL field |
Lets you enter the URL of a document on your website. |
Test |
Tests the URL against the filtering scripts and URL masks. The test URL document is downloaded, which is then used as the STDIN input to the filtering script. The initialization, filtering, and termination scripts are then run. If there is any STDOUT output from the filtering script, that output is displayed in a new browser window. |
Test only |
Tests the script's operation only. |
Preview |
Lets you view the page. |
Full visual |
Generates a full before-and-after table view of the documents. |
Short visual |
Shows only the differences between the before-and-after views. |
Expert (diff) |
Displays the raw output of the GNU diff command that is used to compare the files, using the supplied command line options. |
Filtering Script |
Lets you paste your filtering script in the field provided. |
Save Changes |
Saves the filtering script. |
Check Syntax |
Lets you do a quick syntax check of your script by running the initialization, filtering, and termination scripts. It does not update and save your script. All Perl compiler errors and warnings, and all STDERR output are printed. Before the effects of the script are visible to customers, you must rebuild your site index. |
GNU diff command line options
Some GNU diff options that you can use while in Expert (diff) mode on the Staged Filtering Script page, include the following:
GNU diff command line option |
Description |
---|---|
-b |
Ignores changes in amount of white space. |
-B |
Ignores changes that insert or delete blank lines. |
-c |
Uses the context output format, showing three lines of context. |
-C lines |
Uses the context output format, showing lines (an integer) lines of context, or three if lines is not given. |
-i |
Ignores changes in case; consider upper- and lowercase letters equivalent. |
-f |
Makes output that looks similar like an ed script but has changes in the order that they appear in the file. |
-n |
Outputs RCS-format diffs; like -f except that each command specifies the number of lines that are affected. |
-u |
Uses the unified output format, showing three lines of context. |
-U lines |
Uses the unified output format, showing lines (an integer) of context, or three if lines is not given. |
Click Test to test against the filtering scripts and URL masks.
Clicking Test does not update and save your filtering script.
In the Filtering Script field, paste your script.
(Optional) Click Check Syntax to perform a quick syntax check of your script by running the filtering, initialization, and termination scripts.
Check Syntax does not update and save your script.
Click Save Changes.
(Optional) Rebuild your staged site index if you want to preview the results.
(Optional) On the Filtering Script page, do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
You can use Initialization Script to change the content of a Web document before it is indexed.
You can insert HTML tags, remove irrelevant content, and even create new HTML metadata based on a document’s URL, MIME type, and existing content. The initialization script is a Perl script, which provides powerful string handling and the flexibility of regular expression matching. You use the initialization script with a filtering script, termination script, URL masks script, and test URL.
The initialization script is run once before indexing begins. Use this script to initialize any global variables and subroutines that are used by your filtering script. You can use the initialization script to print status messages from the filtering script to the index log. You either print the messages to STDERR, or by way of the _search_debug_log()
subroutine.
Some GNU diff options that you can use while in Expert (diff) mode on the Staged Initialization Script page, include the following:
GNU diff option |
Description |
---|---|
-b |
Ignores changes in amount of white space. |
-B |
Ignores changes that insert or delete blank lines. |
-c |
Uses the context output format, showing three lines of context. |
-C lines |
Uses the context output format, showing lines (an integer) lines of context, or three if lines is not given. |
-i |
Ignores changes in case; consider upper- and lowercase letters equivalent. |
-f |
Makes output that looks similar like an ed script but has changes in the order that they appear in the file. |
-n |
Outputs RCS-format diffs; like -f except that each command specifies the number of lines that are affected. |
-u |
Uses the unified output format, showing three lines of context. |
-U lines |
Uses the unified output format, showing lines (an integer) of context, or three if lines is not given. |
You can use local variables, global variables, or both in these scripts. All global variables are prefaced with the namespace “main::”. When the initialization script is started, its environment contains the following standard file handles:
Additionally, you can write custom messages to the index log using the _search_debug_log()
subroutine, as in the following example:
# Log information to the Index Log
_search_debug_log("Done processing document: " . $main::search_url);
These messages appear with the word DEBUG
as a preface, and are not logged as errors.
An example of an initialization script is the following:
# My subroutine to do something.
sub my_sub_for_the_filtering_script {
my ($param1, $param2) = @_;
...
}
# Initialize the document counter.
$main::doc_count = 0;
See Global Variables
All global variables are prefaced with the namespace “main::”: $main::doc_count = 0;
All local variables are declared with “my”: my $i = 0;
Subroutines are defined in the initialization script. They do not need an explicit “main::” namespace: sub my_sub {
...
}
Test the $main::search_content_type
before you make changes to a file. Testing can help you avoid making careless changes to binary files, like SWF files or PDF files:
if ($main::search_content_type =~ /^text\/html/) { ...
The $main::search_content_type
is the full Content-Type header delivered by your server. It can sometimes contain a simple MIME type, such as “text/html”. Or, it can contain a MIME type followed by other information, like the document’s character set encoding, such as “text/html; charset=iso-8859-1”.
For each type of non-HTML document, $main::search_content_type
can take various values. Testing for each value in your script becomes cumbersome. For example, some Word documents have content type values of “application/msword”, “application/vnd.ms-word” or “application/x-msword”. In such cases, $main::search_content_class
can take the following values:
In the example, testing $main::search_content_class
for “word” would match any of the three possible content-type values.
If nothing is printed to STDOUT from the filtering script, then the document is used exactly as it was downloaded. That is, if you do not need to change anything in a document, then you do not need to copy STDIN to STDOUT for that document.
If you want to remove all text from a document, print a valid file STDOUT. For example, to completely remove all text from an HTML document, you do the following: print "<html></html>";
The initialization script is a Perl script that is run once before any documents are indexed.
You use the initialization script in conjunction with a filtering script, termination script, and URL masks script.
Be sure that you rebuild your site index so that the results of your initialization script are visible to your customers.
See Configuring an incremental index of a staged website.
To add an initialization script
On the product menu, click Settings > Filtering > Initialization Script.
(Optional) On the Initialization Script page, in the Test URL field, enter the URL of a document on your website.
Click a testing option to see changes to the raw HTML text.
See the filtering options table under Adding a filtering script.
Click Test to test against the filtering scripts and URL masks.
Clicking Test does not update and save your initialization script.
In the Initialization Script field, paste your script.
(Optional) Click Check Syntax to perform a quick syntax check of your script by running the filtering, initialization, and termination scripts.
Check Syntax does not update and save your script.
Click Save Changes.
(Optional) Rebuild your staged site index if you want to preview the results.
(Optional) On the Initialization Script page, do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
You can use Termination Script to change the content of a Web document before it is indexed.
You can insert HTML tags, remove irrelevant content, and even create new HTML metadata based on a document’s URL, MIME type, and existing content. The initialization script is a Perl script, which provides powerful string handling and the flexibility of regular expression matching. You use the termination script with an initialization script, filtering script, termination script, URL masks script, and test URL.
The termination script is run once after all the documents are indexed. You can use the termination script to print status messages from the filtering script to the index log. You either print the messages to STDERR, or by way of the _search_debug_log()
subroutine.
Some GNU diff command line options that you can use while in Expert (diff) mode on the Staged Termination Script page, include the following:
GNU diff command line option |
Description |
---|---|
-b |
Ignores changes in amount of white space. |
-B |
Ignores changes that insert or delete blank lines. |
-c |
Uses the context output format, showing three lines of context. |
-C lines |
Uses the context output format, showing lines (an integer) lines of context, or three if lines is not given. |
-i |
Ignores changes in case; consider upper- and lowercase letters equivalent. |
-f |
Makes output that looks similar like an ed script but has changes in the order that they appear in the file. |
-n |
Outputs RCS-format diffs; like -f except that each command specifies the number of lines that are affected. |
-u |
Uses the unified output format, showing three lines of context. |
-U lines |
Uses the unified output format, showing lines (an integer) of context, or three if lines is not given. |
You can use local variables, global variables, or both in these scripts. All global variables are prefaced with the namespace “main::”. When the termination script is started, its environment contains the following standard file handles:
Additionally, you can write custom messages to the index log using the _search_debug_log()
subroutine, as in the following example:
# Log information to the Index Log
_search_debug_log("Done processing document: " . $main::search_url);
These messages appear with the word DEBUG
as a preface, and are not logged as errors.
To display the number of documents that were processed by your filtering script as an error line in the index log, you can use the following termination script:
# Print the value of the document counter.
print STDERR "Total docs: $main::doc_count\n";
# Or, using the log subroutine:
_search_debug_log("Total docs: " . $main::doc_count);
See Global Variables
All global variables are prefaced with the namespace “main::”: $main::doc_count = 0;
All local variables are declared with “my”: my $i = 0;
Subroutines are defined in the initialization script. They do not need an explicit “main::” namespace: sub my_sub {
...
}
Test the $main::search_content_type
before you make changes to a file. Testing can help you avoid making careless changes to binary files, like SWF files or PDF files:
if ($main::search_content_type =~ /^text\/html/) { ...
The $main::search_content_type
is the full Content-Type header delivered by your server. It can sometimes contain a simple MIME type, such as “text/html”. Or, it can contain a MIME type followed by other information, like the document’s character set encoding, such as “text/html; charset=iso-8859-1”.
For each type of non-HTML document, $main::search_content_type
can take various values. Testing for each value in your script becomes cumbersome. For example, some Word documents have content type values of “application/msword”, “application/vnd.ms-word” or “application/x-msword”. In such cases, $main::search_content_class
can take the following values:
In the example, testing $main::search_content_class
for “word” would match any of the three possible content-type values.
If nothing is printed to STDOUT from the filtering script, then the document is used exactly as it was downloaded. That is, if you do not need to change anything in a document, then you do not need to copy STDIN to STDOUT for that document.
If you want to remove all text from a document, print a valid file STDOUT. For example, to completely remove all text from an HTML document, you do the following: print "<html></html>";
The termination script is a Perl script that is run once after all documents are indexed.
You use the termination script in conjunction with a filtering script, termination script, and URL masks script.
Be sure that you rebuild your site index so that the results of your initialization script are visible to your customers.
See Configuring an incremental index of a staged website.
To add a termination script
On the product menu, click Settings > Filtering > Termination Script.
(Optional) On the Termination Script page, in the Test URL field, enter the URL of a document on your website.
Click a testing option to see changes to the raw HTML text.
See the table of filtering options under Adding a filtering script.
Click Test to test against the filtering scripts and URL masks.
Clicking Test does not update and save your termination script.
In the Termination Script field, paste your script.
(Optional) Click Check Syntax to perform a quick syntax check of your script by running the initialization, filtering, and termination scripts.
Check Syntax does not update and save your script.
Click Save Changes.
(Optional) Rebuild your staged site index if you want to preview the results.
(Optional) On the Termination Script page, do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
With filtering, you can change the content of a web document before it is indexed. You can insert HTML tags, remove irrelevant content, and even create new HTML metadata based on a document’s URL, MIME type, and existing content. The URL masks script is a Perl script that provides powerful string handling and the flexibility of regular expression matching.
To change the contents of documents that exist only in a specific portion of your website, you can specify include URL masks, exclude URL masks, or both, to define the appropriate pages.
If you want to change only the documents under "https://www.mysite.com/faqs/"
, you can use the following set of masks:
include https://www.mysite.com/faqs/
exclude *
You can also use regular expression in a URL mask script as in the following example:
include regexp ^https://www\.mysite\.com.*/faqs/.*$
exclude *
See Regular Expressions.
Scripted URL masks are considered in the order that you entered them in the URL Masks field. When a document URL matches a mask, that document is included or excluded based on the type of mask. If a document’s URL does not match any URL mask, the document is included only if its MIME type is “text/html”. All other MIME types are excluded.
Specify URL include masks and exclude masks to change the contents of documents that exist only in a specific portion of your website.
Before the effects of the URL Masks settings are visible to visitors, rebuild your site index.
To add a URL mask script
On the product menu, click Settings > Filtering > URL Masks.
(Optional) On the URL Masks page, in the Test URL field, enter a URL of a document on your website, and then click Test to test the URL against the filtering scripts and masks.
The test URL document is downloaded, which is used as the STDIN input to the filtering script. Then the filtering, initialization, and termination scripts are run. If there is any STDOUT output from the filtering script that output is displayed in a new browser window.
Clicking Test does not update and save your script.
In the URL Masks field, enter one URL mask per line.
(Optional) Click Check Syntax to perform a quick syntax check of your URL masks by running the filtering, initialization, and termination scripts.
Check Syntax does not update and save your script.
Click Save Changes.
(Optional) Rebuild your staged site index if you want to preview the results.
(Optional) On the URL Masks page, do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.
Lets you select which content types that you want filtered for this account.
The text found within the selected content types is converted to HTML and then processed using the script that is specified in Filtering Script.
The Content Types that you can select from include the following:
Before the effects of the Content Types settings or changes to the settings are visible to customers, you must rebuild your site index.
Select the content types that you want to pass to the script that is specified in Filtering Script.
To select the content types that are filtered
On the product menu, click Settings > Filtering > Content Types.
On the Content Types page, check the content types that you want pass to the filter script.
Click Save Changes.
(Optional) Rebuild your staged site index if you want to preview the results.
(Optional) On the Content Types page, do any of the following:
Click History to revert any changes that you have made.
Click Live.
Click Push Live.