The Indexer

Indexing is not a very fast process, and the speed may range from 30Kb to 250Kb per second, depending on index size and computer power. The indexer shouldn't be started too often, and frequency of starts depends on web-site update frequency. For static sites one execution of the indexer will do.

During indexing process three files are created:

The indexer can also create statistics file - stats.log, which can be processed right after having the server indexed to store information in database.

Two indexing modes a available:

Starting Indexer

To start the Indexer it is necessary to run indexer(.exe) with the following options:

Example for Windows

C:\indexer.exe localhost
or
C:\indexer.exe --config=D:\www\search.conf disk

Example for Unix/Linux

./indexer name_of_task
or
./indexer.exe --config=/home/www/search.conf disk

Working with 'search.conf' file

All indexer settings are stored in 'search.conf' file. The file has the following structure:

[Job name_of_task]
[Index]
Parameter1	Value1
Parameter2	Value2
Parameter3	Value3
[Index]
Parameter1	Value1
Parameter2	Value2
Parameter3	Value3

For each action parameters and their values are set, one on a line. Parameter and value are separated by spaces or tabs.

You may use single-line commentaries in the configuration file. Each commentary starts with symbol "#".


Description of parametrs

URL <url>

URL	url

Address starting with 'http://...' in HTTP-mode, or local path in local drive mode.

Example:

For HTTP:
URL	http://www.novgorod.ru/frisbee/

For disk (Windows): 
URL	c:/pub/home/frisbee/

For disk (Unix): 
URL	/pub/home/frisbee/

Extensions <ext>

Extensions ext1,ext2,ext3

Sets a list of extensions of files to be indexed. Can be used in local drive mode only, and is ignored in HTTP indexing mode. Extensions are separated by "," (comma).

Example:

Extensions htm,html,shtml,shtm

Type <typ>

Type typ

Sets type of the search index:

Default value - Normal

Example:

Type Strict

Path <path>

Path path

Spesifies working directory. Index files and a log-file are saved to this directory.

Example:

Path c:\www\novgorod
or
Path /home/www/novgorod

CharSet <cset>

CharSet cset

Sets the way character coding of the files to be indexed will be identified. The values may be:

Example:

CharSet ByHTTPHeader

MaxFiles <num>

MaxFiles num

Sets maximum number of files to be indexed, 10000 by default. Be careful when selecting value, because many servers contain huge numbers of links, for example http://news.novgorod.ru/

Example:

MaxFiles 50

Statistic <stat>

Statistic stat

Sets the way reports are saved. Reports are generated at the end of action Index and are saved to file stats.log. Available options:

Statistics are saved to file stats.log.

Example:

Statistic Append

Exclude <excl>

Exclude excl1,excl2,excl3

Sets a list of words to be excluded. Addresses containing at least one of excluded words are not included in indexing queue. Words are separated by "," (comma)

Example:

Exclude editpost.php?,reply.php?,admin/

AddOption <opt>

AddOption opt

Sets indexing method. Can be used in HTTP indexing mode only. The following values are available:

Example:

AddOption SubPages

StopWordsFile <file>

StopWordsFile file

Задает имя файла, в котором храняться стоп-слова.

StopWordsFile stop.txt

Language <lng>

Sets language. If this parameter is specified a field 'Accept-Language' is included in HTTP header. This variable may effect document content on some sites.

Example:

Language ru

AFrom <path>

AFrom path
Sets substring which will be replaced in URL by string specified in parameter ATo.

Example:

AFrom  /home/dir/mysite/
ATo    http://search.codenet.ru/

ATo <url>

ATo url
Sets substring which will replace AFrom in URL. Used together with AFrom.

Example:

AFrom http://127.0.0.1/
ATo   http://www.codenet.ru/

or

AFrom c:/documents/www/www.codenet.ru/
ATo   http://www.codenet.ru/

StartWord <word>

StartWord word

Sets starting word. Page description will be composed of words following the starting one. Hence, it is possible to exclude menus and the like from description. The starting word is obligatory.

Example:

StartWord about

MetaDescription <yesno>

MetaDescription yesno

Sets page description method. Description can be displayed in search results with help of the special symbol %E. Available values are "Yes" or "No". Default is 'No'. If 'Yes' is used, the system attempts to get description from '<META name="description...' tag. If tag can not be found or the value is 'No', description is composed of the first words in the document (see. startword)

Example:

MetaDescription Yes

MetaRobots <yesno>

MetaRobots yesno

If the parameter has value "No", the tag '<META name="robots"...' is ignored, otherwise the tag is analysed for presence of NOINDEX, NOFOLLOW, NONE. More details can be found in section Use of "Robots" META-tags. Default value is "Yes"

Example:

MetaRobots No

UseRobotsTxt <yesno>

UseRobotsTxt <yesno>

If set to "Yes", indexing rules are taken from file 'robots.txt', stored in web-server root directory. Default value is "No". More information about working with 'robots.txt' is available in section robots.txt - Exclusions Standard for Robots. Robot's name is "CNSearch".

Example:

UseRobotsTxt yes

Working through proxy-server

Starting with version 0.91 an option of working through proxy-server became available. 4 new directives were added ProxyServer, ProxyPort, ProxyLogin, and ProxyPassword


ProxyServer <serv>

ProxyServer server

Specifies proxy-server. The indexer connects directly by default. Works with ProxyPort.

Example:

ProxyServer proxy.domain.ru

ProxyPort <port>

ProxyPort port

Sets proxy port. Works with ProxyServer.

Example:

ProxyPort 8080

ProxyLogin <login>

ProxyLogin login

Sets proxy login. Used only in case the proxy server requires authorization. Works with ProxyPassword.

Example:

ProxyLogin alex

ProxyPassword <password>

ProxyPassword password

Sets proxy password. Used only in case the proxy server requires authorization. Works with ProxyLogin.

Example:

ProxyPassword qwerty

Morphology Support (testing mode).

To distinguish between morphological forms you need to create file 'lang.cns' and save it in the directory, where index files are stored (or will be created). We do not include file 'lang.cns' in this distribution, because of its size - 16 Mb.

If file 'lang.cns' is not found, the search and indexing process will be performed without taking morphology into account.

We have developed a special utility allowing building 'lang.cns' from ispell dictionaries. You may find necessary dictionaries at http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html.

ispell dictionary comprises of two files - a list of words (lang.dict) and a set of word formation rules (lang.aff). These files may have some other names in downloaded archives. You will have to rename them to 'lang.dict' and 'lang.aff'.

ATTENTION!!! If you have built the index taking into consideration morphology, you will have to search also taking into consideration morpholgy and using the same dictionary.


Stop-words.

Starting with the version 1.3 CNSearch Pro can avoid indexing frequently used words (articles, pronouns, prepositions) to increase search speed and reduce volume of information stored in the search index. These words are called 'stop-words'.

Stop-words are defined at the indexing stage. It is done with the help of the special file containing one stop-word per line. For example:

- file: stopwords.txt ---------------
a
an
is
the
this
-------------------------------------

Name of the file containing stop-words is indicated in the Indexer configuration file in the option StopWordsFile, for example:

StopWordsFile	stopwords.txt

For you visitors to know which words from their search phrase have been ignored, they may be listed with the help of the special symbol "%P" as shown in the picture:

Word combination "Stop Words" may be changed for some other one (for example, when translating to the foreign language) by changing parameter StopWords in the Frontend configuration file.

Up

Back | Contents | Proceed