MediaWiki search bugs

From Exterior Memory
(Redirected from MediaWiki search tuning)
Jump to: navigation, search

Short Words are Never Found

Problem

Searching in MediaWiki gives no results while searching for words of 3 letters or shorter.

For example, searching on "PHP", "XML", "IP", etc. results in no hits, even while those words are present on one of the Wiki pages.

Reason

MediaWiki uses MySQL full text search, which by default only indexes words of 4 characters or longer.

Resolution

You can change this in the MySQL server configuration.

Open /etc/my.conf (/etc/mysql/my.conf on Debian) and add the following parameter after the "[mysqld]" section:

[mysqld]
ft_min_word_len=2

Than, restart mysql. For example using:

sudo /etc/init.d/mysql restart

More information

The result of this change is that the index for full text searches now also contains 2 and 3 letter long words. After restarting, the information is re-indexed.

http://dev.mysql.com/doc/refman/4.1/en/fulltext-fine-tuning.html

Drawbacks

The ft_min_word_len can only be set per server, not per database. So this change also applies to other users of the same database. The only disadvantage I can think of is that the size of the index files increases needlesly in some cases.

I added a feature request to MySQL to allow a per-database configuration. See: MySQL bug #12657 and MySQL bug #18695.

Duplicate Pages in Search Results

Problem

The search results of this very wiki gave multiple pages. E.g. searching for LaTeX returned the results

Instead of each page just once:

Reason

For some reason, my searchindex table did contain duplicate entries, one for each time the page was updated.

This should not be possible, as each entry in the table should be for a unique page id. Obviously, this was not the case for me.

Resolution

I deleted and recreated the searchindex table, and repopulated it with data.

To delete and recreate the table (the SQL can be found in the file maintenance/tables.sql):

DROP TABLE searchindex;
CREATE TABLE searchindex (
  si_page int unsigned NOT NULL,
  si_title varchar(255) NOT NULL default ,
  si_text mediumtext NOT NULL,
  UNIQUE KEY (si_page),
  FULLTEXT si_title (si_title),
  FULLTEXT si_text (si_text)
) ENGINE=MyISAM;

Reindex Search Table

The updateSearchIndex.php script reindexes recently changed pages. Unfortunately, the recently changes table is purged after 13 weeks. You first need to repopulate the recent changes table.

In LocalSettings.php, set the recentchanges expiry time to very long:

$wgRCMaxAge = 10*365*24*3600;  # 10 years

Rebuild the recentchanges database:

php maintenance/rebuildrecentchanges.php

To repopulate the search index, run the updateSearchIndex.php script:

php maintenance/updateSearchIndex.php -s 19930101000000 20930101000000

(this updates all pages created between 1993 and 2093).

Test

To test if everything works as it should, create and update an searchindex row (this example adds entry with page id 99999; make sure this does not exist in your wiki before trying this):

REPLACE INTO searchindex (si_page,si_title,si_text) VALUES (99999,'sandbox search test page','list of keywords number one');
REPLACE INTO searchindex (si_page,si_title,si_text) VALUES (99999,'sandbox search test page','list of keywords number two');

If all works as planned, only the last entry with a given si_page should be present:

mysql> SELECT (si_page,si_title,si_text) FROM searchindex WHERE si_page=99999;
+---------+--------------------------+-----------------------------+
| si_page | si_title                 | si_text                     |
+---------+--------------------------+-----------------------------+
|   99999 | sandbox search test page | list of keywords number two | 
+---------+--------------------------+-----------------------------+
1 row in set (0.00 sec)

If you get two rows, this fails.

To delete the test entry, simply execute:

DELETE FROM searchindex WHERE si_page=99999;

Host names are Never Found

Problem

Searching in MediaWiki for `hostname` does not list pages that include `hostname.example.org`.

Reason

My hypothesis:

MediaWiki seems to treat `hostname.example.org` as a single word.

Resolution

Search as if `hostname` is a partial word. E.g. search for `hostname*`.

Note: I hope it is also possible to change the word delimiter (word separator) for the indexer. However, I have not found a way how to do that (yet).