mnoGoSearch has a built-in
parser for MP3 files. It can extract
the Album,
the Artist,
the Song as well as
the Year MP3 tags from an MP3 file.
You can create a full-featured MP3 search
engine using mnoGoSearch.
To activate indexing of MP3 tags, you can use
the CheckMP3
and CheckMP3Only
commands into indexer.conf, as well
as activate processing of MP3 sections (they are disabled
by default). This is an example of an indexer.conf
file with MP3 related commands:
Section MP3.Song 21 128
Section MP3.Album 22 128
Section MP3.Artist 23 128
Section MP3.Year 24 128
CheckMP3 *.mp3
Hrefonly *
With the above configuration,
indexer will
check all
*.mp3 files for MP3 tags,
and will collect new links from other file types without
indexing.
When you use the CheckMP3
command, indexer downloads only
128 bytes from the files with the given
extension(s) to detect and parse MP3 tags.
Note:
indexer downloads MP3 files
efficiently from FTP servers, as well as from HTTP
servers supporting HTTP/1.1 protocol
with the Range request header,
to request partial content. Old HTTP servers
not supporting the Range HTTP header
may not work well together with mnoGoSearch.
If you want to restrict searches by Author,
Album, Song or
Year, you can use the standard
mnoGoSearch ways to restrict
searches described in the Section called Changing weights of the different document parts at search time in Chapter 10 and
the Section called Restricting search words to a section
in Chapter 10.
For example,
if you want to restrict search by song and author name,
you use the standard mnoGoSearch way
to specify sections: Song: help Author:Beatles.
With the default sections given in indexer.conf-dist,
you may find useful to add this HTML form element into
search.htm to restrict search area:
Search in:
<SELECT NAME="wf">
<OPTION VALUE="111100000000000000000000" SELECTED="$(wf)">All MP3 sections</OPTION>
<OPTION VALUE="000100000000000000000000" SELECTED="$(wf)">MP3 Song name</OPTION>
<OPTION VALUE="001000000000000000000000" SELECTED="$(wf)">MP3 Album</OPTION>
<OPTION VALUE="010000000000000000000000" SELECTED="$(wf)">MP3 Artist</OPTION>
<OPTION VALUE="100000000000000000000000" SELECTED="$(wf)">MP3 Year</OPTION>
</SELECT>
mnoGoSearch can index
SQL tables with long text
columns with help of so called
htdb:/ virtual URL scheme.
Using the htdb:/ virtual scheme,
you can build a full-text index for your SQL
tables as well as index your database driven Web servers.
Note: You have to have a PRIMARY KEY or an
UNIQUE INDEX on the table you want to index
with HTDB.
HTDB is implemented using the following
indexer.conf commands:
HTDBAddr, HTDBList,
HTDBLimit, HTDBDoc.
The purposes of the HTDBAddr command
is to specify a database connection string. It uses the same
syntax to DBAddr. If no HTDBAddr
command is specified, the data will be fetched using the same connection
specified in DBAddr.
The HTDBList command is used to specify
an SQL query which generates a list of documents
using either absolute or relative URL
notation, for example:
HTDBList "SELECT CONCAT('htdb:/',id) FROM messages"
or
HTDBList "SELECT id FROM messages"
Note:
HTDBList allows to fetch
non-htdb URLs as well.
So it gives another options to use HTDB:
you can store the list of "real URLs"
(e.g. HTTP-style URLs)
in the database and fetch them with help of HTDB.
HTDBList "SELECT url FROM mytable"
Server urllist htdb:/
Realm page *
The SQL query given in
HTDBList is used for all
documents having the '/' sign
in the end of URL. This query
is an analog for a file system directory listing.
The HTDBLimit command is
used to specify the maximum number of records fetched
by a single SELECT query given in the
HTDBList command.
HTDBLimit helps to reduce
memory consumption when indexing large SQL
tables. For example:
HTDBLimit 512
The HTDBDoc command specifies an
SQL query to get a single document
from the database using its PRIMARY KEY
value. The HTDBDoc query is
executed for all HTDB documents not having
the '/' in the end of their URL.
An SQL query given in the
HTDBDoc command
must return a single row result.
If the HTDBDoc query
returns an empty set or multiple records,
the HTDB retrieval system generates
a HTTP 404 Not Found response.
This can happen at re-indexing time if the record
was deleted from the table since last re-indexing.
You can use
HoldBadHrefs 0
to remove the deleted records from the mnoGoSearch
tables as well.
mnoGoSearch understands
three types of HTDBDoc SQL
queries.
A single-column result with a fully
formatted HTTP response,
including standard HTTP
response status line. Take a look into the Section called HTTP response codes mnoGoSearch understands in Chapter 3
to know how indexer handles
various HTTP status codes.
A HTDBDoc SQL
query can also optionally include HTTP
headers understood by indexer,
such as Content-Type,
Last-Modified,
Content-Encoding and other headers.
So you can build a very flexible indexing system by returning
different HTTP status codes and headers.
Example:
HTDBDoc "SELECT CONCAT(\
'HTTP/1.0 200 OK\\r\\n',\
'Content-type: text/plain\\r\\n',\
'\\r\\n',\
msg) \
FROM messages WHERE id='$1'"
A multiple-column result, with the status line
starting from the "HTTP/"
substring in the beginning of the first column.
All columns are concatenated using the
Carriage-Return + New-Line
(\r\n) delimiters to generate
a HTTP-alike response.
The first column returning an empty string is
considered as a delimiter between the headers
and the content part of the HTTP
response, and is replaced to "\r\n\r\n".
This type of queries is a simpler way of the
previous type. It helps to avoid using concatenation
operators and functions, and the "\r\n"
header delimiters.
Example:
HTDBDoc "SELECT 'HTTP/1.0 200 OK','Content-type: text/plain','',msg \
FROM messages WHERE id='$1'"
A single- or a multiple-column result without the
"HTTP/" header.
This is the simplest HTDBDoc
response type. The SQL column names
returned by the query are associated with the
Section names configured
in indexer.conf.
Example:
Section body 1 256
Section title 2 256
HTDBDoc "SELECT title, body FROM messages WHERE id='$1'"
In this example, the values of the columns
title and body
are associated with the sections
title and body
respectively.
The columns with the names status
and last_mod_time have a special
meaning - the HTTP status code,
and the document modification time respectively.
Status should be an integer code according
to HTTP notation,
and the modification time should be in Unix timestamp format -
the number of seconds since
January, 1, 1970.
Example:
HTDBDoc "SELECT title, body, \
CASE WHEN messages.deleted THEN 404 ELSE 200 END as status,\
timestamp as last_mod_time FROM messages WHERE id='$1'"
The above example demonstrates how to use the special columns.
The SQL query will return
status "404 Not found" for
all documents marked as deleted, which will
make indexer
remove these documents from the search database
when re-indexing the data. Also, this query
makes indexer use
the column timestamp
as the document modification time.
If a column contains data in HTML format,
you can specify the html keyword in
the corresponding Section command,
which will make indexer apply
the HTML parser to this column and
therefore remove all HTML tags and comments:
Example:
Section title 1 256
Section wiki_text 2 16000 html
HTDBDoc "SELECT title, wiki_text FROM messages WHERE id='$1'"
The path parts
of an URL can be passed as
parameters to the HTDBList and
HTDBDoc SQL queries.
All parts are to be used as $1,
$2, ... $N, where
the number represents the N-th path part,
that is the part of URL after
the N-th slash sign:
htdb:/part1/part2/part3/part4/part5
$1 $2 $3 $4 $5
For example, you have this indexer.conf command:
HTDBList "SELECT id FROM catalog WHERE category='$1'"
When mnoGoSearch prepares to fetch
a document with the URL htdb:/cars/,
$1 will be replaced to "cars":
SELECT id FROM catalog WHERE category='cars'
You can use long URLs to
pass multiple parameters into both
HTDBList and
HTDBDoc queries.
For example:
HTDBList "SELECT column4 FROM table WHERE column1='$1' AND column2='$2' and column3='$3'"
HTDBDoc "SELECT title, body FROM table WHERE column1='$1' AND column2='$2' and column3='$3' column4='$4'"
Server htdb:/path1/path2/path3/
Using multiple parameters helps to refer
to a certain record using parts of
a compound
PRIMARY KEY
or
UNIQUE INDEX.
It's possible to index multiple HTDB sources
using multiple HTDBList,
HTDBDoc and Server
commands in the same indexer.conf.
Section body 1 256
Section title 2 256
HTDBList "SELECT id FROM t1"
HTDBDoc "SELECT title, body FROM t1 WHERE id=$2"
Server htdb:/t1/
HTDBList "SELECT id FROM t2"
HTDBDoc "SELECT title, body FROM t2 WHERE id=$2"
Server htdb:/t2/
HTDBList "SELECT id FROM t3"
HTDBDoc "SELECT title, body FROM t3 WHERE id=$2"
Server htdb:/t3/
With help of the htdb:/ scheme
you can quickly create a full-text index and use it
further in your SQL application.
Imagine you have a large SQL
table which stores a Web board messages in plain text format,
and you want to add search functionality to your Web board.
Say, the messages are stored in the table messages
with two columns id
and msg, where id
is an integer PRIMARY KEY and
msg
is a long text column containing messages.
Using a usual SQL LIKE
search may take a very long time to return a result:
SELECT id, message FROM messages WHERE message LIKE '%someword%'
With help of the htdb:/ scheme provided by
mnoGoSearch you can create
a full-text index on the table messages.
In order to do so you can
edit your indexer.conf as follows:
DBAddr mysql://foo:bar@localhost/mnogosearch/?dbmode=single
Section msg 1 256
HTDBAddr mysql://foofoo:barbar@localhost/database/
HTDBList "SELECT id FROM messages"
HTDBDoc "SELECT msg FROM messages WHERE id='$1'"
Server htdb:/
When started, indexer will insert
the URL htdb:/
into the database and will execute the SQL
query given in HTDBList, which
will produce the values 1,
2, 3,..., N
in the result. The values will be interpreted as links relative
to htdb:/. A list of new URLs
in the form htdb:/1, htdb:/2,
..., htdb:/N will be added into the database.
Then the HTDBDoc SQL
query will be executed for every added URL.
HTDBDoc will return the column
msg as a document content, which will be associated
with the section mgs and parsed.
Word information will be stored in the table dict
(assuming the single storage mode).
After indexing is done, you can use mnoGoSearch
tables to perform search:
SELECT url.url
FROM url,dict
WHERE dict.url_id=url.rec_id
AND dict.word='someword';
The table dict has an index
on the column word, so the above
query will be executed much faster than the queries
using the LIKE operator on the
table messages.
You can also use multiple words in search:
SELECT url.url, count(*) as c
FROM url,dict
WHERE dict.url_id=url.rec_id
AND dict.word IN ('some','word')
GROUP BY url.url
ORDER BY c DESC;
Both queries will return htdb:/XXX
values from the url.url field.
Then your application can cut the "htdb:/"
prefix from the returned values to get the
PRIMARY KEY values from the table
messages.
You can also use HTDB to
index your database driven Web server. It allows to
index your documents without having to invoke your
the Web server at indexing time,
which should require less CPU
resources than direct HTTP
indexing and therefore should offload the Web
server machine.
The main idea of indexing a database driven Web
server is to map HTTP requests
into HTDB requests at indexing time.
So indexer will fetch the
source data directly from the SQL
database, meanwhile search.cgi
will return real URLs in usual
HTTP notation.
This can be achieved using the aliasing mechanisms
provided by mnoGoSearch.
Take a look at a sample file
doc/samples/htdb.conf,
which is included into
mnoGoSearch source distribution.
It is the indexer.conf file used
to index the Web board at the
mnoGoSearch site
.
The HTDBList command
generates URLs in the form:
http://www.mnogosearch.org/board/message.php?id=XXX
where XXX is
a PRIMARY KEY value
from the table messages.
For every PRIMARY KEY value
a fully formatted HTTP
response is generated, containing a text/html
document with headers and this content:
<HTML>
<HEAD>
<TITLE>Subject goes here</TITLE>
<META NAME="Description" Content="Author name goes here">
</HEAD>
<BODY>
Message text goes here
</BODY>
At the end of doc/samples/htdb.conf
you can find these commands:
Server htdb:/
Realm http://www.mnogosearch.org/board/message.php?id=*
Alias http://www.mnogosearch.org/board/message.php?id= htdb:/
The first command tells indexer to execute
the HTDBList query,
which generates a list of messages in the form:
http://www.mnogosearch.org/board/message.php?id=XXX
The second command tells indexer
to allow messages matching the given
pattern using string match with the '*'
wildcard at the end.
The third command replaces the substring
http://www.mnogosearch.org/board/message.php?id=
in the URL to
htdb:/ before a message is downloaded,
which forces indexer to
use the SQL table as the data source
for a document instead of sending an HTTP
request to the Web server.
After indexing is done, search.cgi
will display search result using the usual HTTP
notation, for example:
http://www.mnogosearch.org/board/message.php?id=1000
mnoGoSearch offers special
virtual URL methods
exec:/ and cgi:/.
These methods allow to use output of an external program
as a source for indexing. mnoGoSearch
can work with any executable program that returns results
to STDOUT. The result must be conform to the
HTTP standard and return full HTTP response headers
(including HTTP status line and at least the Content-Type
HTTP response header) followed by the document content.
For example, when indexing both
cgi:/usr/local/bin/myprog and
exec:/usr/local/bin/myprog,
indexer will execute
the /usr/local/bin/myprog program.
When executing a program given in a cgi:/ URL,
indexer emulates environment in the way
this program would run in when executed under a HTTP server. It
creates the REQUEST_METHOD=GET environment variable,
and the QUERY_STRING variable according to the HTTP
standards. For example, if
cgi:/usr/local/apache/cgi-bin/test-cgi?a=b&d=e
is being indexed, indexer creates
QUERY_STRING with
a=b&d=e value. cgi:/ virtual
URL scheme allows indexing your site without having to invoke web
servers even if you want to index CGI scripts. For example, you have
a web site with static documents under
/usr/local/apache/htdocs/ and with CGI scripts
under
/usr/local/apache/cgi-bin/. You can use the following
configuration:
Server http://localhost/
Alias http://localhost/cgi-bin/ cgi:/usr/local/apache/cgi-bin/
Alias http://localhost/ file:///usr/local/apache/htdocs/
In case of an exec:/ URL, indexer
does not create the QUERY_STRING variable, instead
it passes all parameters in the command line. For example, when indexing
exec:/usr/local/bin/myprog?a=b&d=e, this
command will be executed:
/usr/local/bin/myprog "a=b&d=e"
The exec:/ virtual scheme can be used as an
external retrieval system. It allows using protocols which are not
supported natively by mnoGoSearch.
For example, you can use curl program which is available
from http://curl.haxx.se/
to index HTTPS sites when mnoGoSearch
is compiled without built-in HTTPS support.
Put this short script to
/usr/local/mnogosearch/bin/ under
name curl.sh.
#!/bin/sh
/usr/local/bin/curl -i $1 2>/dev/null
This script takes an URL given as a command line parameter
and executes curl to download the given URL.
The -i argument tells curl
to output result together with HTTP response headers.
Add these commands into indexer.conf:
Server https://some.https.site/
Alias https:// exec:/usr/local/mnogosearch/etc/curl.sh?https://
When indexing
https://some.https.site/path/to/page.html,
indexer will translate this URL to
exec:/usr/local/mnogosearch/etc/curl.sh?https://some.https.site/path/to/page.html
then execute the curl.sh script:
/usr/local/mnogosearch/etc/curl.sh "https://some.https.site/path/to/page.html"
and load its output for indexing.
Note:
indexer loads up to
MaxDocSize bytes
when executing an exec:/ or
cgi:/.
mnoGoSearch supports some mirroring functionality.
To enable mirroring, you can specify the path where indexer
will create the mirrors of your sites with help of the
MirrorRoot command. For example:
MirrorRoot /path/to/mirror
You can also configure indexer to store HTTP headers on the disk.
This can be helpful if you want to use the local mirror for quick
reindexing of the remote site. Use the MirrorRoot command
to activate storing the HTTP headers. For example:
MirrorHeadersRoot /path/to/headers
Note: indexer
does not download more than MaxDocSize
bytes from every documents. If a document is larger,
it will be only partially downloaded. Make sure
that MaxDocSize is large enough if you
want to use the mirror created by as a real
site mirror.
mnoGoSearch can use a previously created
mirror as a crawler cache. It can be useful when you do experiments
with mnoGoSearch to find the best
configuration: you modify your indexer.conf,
then clear the database and index the same sites again.
To reduce Internet traffic you can activate loading documents
from the mirror using the MirrorPeriod command.
For example:
MirrorPeriod 2h
MirrorPeriod specify the period of time
when indexer considers the local mirrored copy
of a document as valid. If indexer finds that
the local mirrored copy is fresh enough, it will not download
the same document again and use the local copy instead.
If the local is older than MirrorPeriod says,
then indexer will download the document
from its original location again, and update the locally mirrored copy.
If MirrorHeadersRoot is not specified
and therefore the original HTTP headers are not available,
then indexer will detect Content-Type
of a document using the AddType commands.
The parameter MirrorPeriod
should be in the form: xxxA[yyyB[zzzC]], where
xxx, yyy,
zzz are numbers (can be negative!).
Spaces are allowed between xxx and
A and yyy and so on.
A, B,
C can be one of the following:
s - second
M - minute
h - hour
d - day
m - month
y - year
Note: The letters are similar to the
descriptors understood by the
strptime()
and strftime() C functions.
Examples:
15s - 15 seconds
4h30M - 4 hours and 30 minutes
1y6m-15d - 1 year and six month minus 15 days
1h-10M+1s - 1 hour minus 10 minutes plus 1 second
If you specify only a number without any characters,
it is assumed that the time is given in seconds.
Note: If you start mirroring in a already existing
database, indexer will refuse
to create the mirror immediately because of the
traffic optimization method described at
the Section called Crawling time optimization in Chapter 3.
You can run indexer -am once
to turn off optimization, or clear the database
using indexer -C and then
run indexer without any arguments.
It is possible to dump and restore a mnoGoSearch SQL database
using standard tools supplied with the database software,
such as mysqldump or
pg_dump. This approach works fine
in case of a single SQL database.
However, if you use multiple SQL databases to store mnoGoSearch data,
or use mnoGoSearch cluster solution and
want to re-distribute data between more SQL databases
(say, when adding a new machine into cluster), or
want to reduce the number of separate SQL databases (say, when removing
a machine from cluster), the standard method of dumping and restoring
SQL data will not work because of conflicts in auto-generated values
(auto_increment values, SEQUENCE
values, IDENTITY values and so so).
Starting from the version 3.3.9, mnoGoSearch
includes dump and restore tools which allows to workaround this problem.
Note:
As of version 3.3.9, mnoGoSearch dump and restore
tools work only with MySQL. Support for the other databases
will be added in the future releases.
In order to create a dump of your
mnoGoSearch database, you can run:
indexer -Edumpdata > dumpfile.sql
or pipe data to
gzip:
indexer -Edumpdata | gzip > dumpfile.sql.gz
to reduce the dump size.
The dump file created by indexer -Edump
is a usual SQL dump file, which does not include auto-generated
values. A piece of a dump file in case of MySQL database
looks like:
--seed=39
INSERT INTO url (...all columns except rec_id...) VALUES (...);
INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'body','Modules Directives FAQ...');
INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'CachedCopy','eNrtWc1v2zgWv+ev...');
INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'Charset','utf-8');
INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'Content-Language','en');
INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'Content-Type','text/html');
INSERT INTO urlinfo (url_id,sname,sval) VALUES(last_insert_id(),'title','Apache HTTP Server Ver...');
INSERT INTO bdicti VALUES(last_insert_id(),1,0x6B6F00011EC296170000726577726974696E6700017E4D,0...');
The dump file consists of chunks of
INSERT instructions for every document.
The structure of the dump file forces
MySQL to assign a new auto-increment
value for the column
url.rec_id and use this value to insert data
into the child tables
urlinfo and
bdicti at restore time.
Additionally, every chunk consists of the comment --seed=xxx which
is used to distribute data between multiple database properly at restore time.
By default, indexer -Edump dumps data from all databases
specified in indexer.conf file. You can use the -D command
line argument to dump data from a certain database only. For example:
indexer -Edump -D2
will dump data from the database described by the second command
DBAddr in
indexer.conf.
To restore a search database from a dump file, use:
indexer -Esql -v2 < dumpfile.sql
or in case of
.gz file:
zcat dumpfile.sql.gz | indexer -Esql -v2
indexer will load the data back to the
SQL database.
In case if you have two or more
DBAddr
commands in the current
indexer.conf file,
indexer will also properly
distribute the data between the corresponding
SQL databases.