| |
|
 |
Building a fulltext search engine with PHP
|
In this article I will show you how you can use the database library from eZ publish to build database independent fulltext search engine. I will show you a relatively simple algorithm to get search result ranking and stop words. You will get the full source code for a complete working program to get you started.
Database structure
The database structure used in this example is shown below. You have three tables; object, word and object_word_link. The object table contains the actual data which are indexed. The word table continans all the words found in indexed objects. The object_word_link table are used to get keep track of which words are in which objects.
DROP TABLE object;
CREATE TABLE object
(
id int primary key auto_increment,
data text
);
DROP TABLE object_word_link;
CREATE TABLE object_word_link
(
id int primary key auto_increment,
object_id int not null,
word_id int not null,
frequency float not null
);
drop table word;
CREATE TABLE word
(
id int primary key auto_increment,
word varchar( 150 ),
object_count int not null
);
CREATE INDEX object_word_link_object_id ON object_word_link (object_id);
CREATE INDEX object_word_link_wird_id ON object_word_link (word_id);
CREATE INDEX object_word_link_frequency ON object_word_link (frequency);
CREATE INDEX word_object_count ON word (object_count);
|
Creating index
When you have the text you want to index you first need to filter out unwanted character like multiple whitespace and characters like:,.'" and then you need to create an array of the words in the text. The code snippet below shows how you can do this.
// Strip multiple whitespaces
$IndexText = str_replace(".", " ", $IndexText );
$IndexText = str_replace(",", " ", $IndexText );
$IndexText = str_replace("'", " ", $IndexText );
$IndexText = str_replace("\"", " ", $IndexText );
$IndexText = str_replace("\n", " ", $IndexText );
$IndexText = str_replace("\r", " ", $IndexText );
$IndexText = preg_replace("(\s+)", " ", $IndexText );
// Split text on whitespace
$indexArray =& split( " ", $IndexText );
// Count the total words in index text
$totalWordCount = count( $indexArray );
// Count the number of instances of each word
$wordCountArray = array_count_values( $indexArray );
// Strip double words
$indexArray = array_unique( $indexArray );
// Count unique words
$uniqueWordCount = count( $indexArray ); | When you have all the words in the text you want to index you need to store each word in the word table and store the reference to the object. If the word already exists you increment the object_count value for this word.
To get relevance ranking of your result you need to know how relevant the given word is in this context. The algorighm used here is to divide the number of occorances of the word divided by the total number of words in an article. For example if a word is mentioned once in a 100 word text you will get a frequency value of 1/100. Compared to a text with 50 occorances of the word, which would get a frequency value of 0.5. In a search result you would sort by the frequency value and get the most relevant item at the top of the search result.
Below you will find the code to store the words and create the reference to the object.
foreach ( $indexArray as $indexWord )
{
// Store word if it does not exist.
$wordRes = array();
$db->array_query( $wordRes, "SELECT * FROM word WHERE word='$indexWord'" );
if ( count( $wordRes ) == 1 )
{
$wordID = $wordRes[0]["id"];
$db->query( "UPDATE word SET object_count=( object_count + 1 ) WHERE id='$wordID'" );
}
else
{
$wordID = $db->nextID( "word", "id" );
$db->query( "INSERT INTO word ( id, word, object_count ) VALUES ( '$wordID', '$indexWord', '1' )" );
}
print( "Indexing word: '$indexWord'" );
// Calculate the relevans ranking for this word
$frequency = ( $wordCountArray[$indexWord] / $totalWordCount );
print( "Internal normalized word frequency: $frequency" );
$linkID = $db->nextID( "object_word_link", "id" );
$db->query( "INSERT INTO
object_word_link ( id, word_id, object_id, frequency )
VALUES ( '$nextID', '$wordID', '$objectID', '$frequency' )" );
} |
Searching the text
To query the indexed data you first need to split up the input text into words. The code below shows how you can create an array of the input text.
// Strip multiple whitespace
$SearchText = preg_replace("(\s+)", " ", $db->escapeString( $SearchText ) );
// Split text on whitespace
$searchArray =& split( " ", $SearchText ); | In this example we've stored the number of objects in which a word is present. This value can be used to filter out words which are present in to many objects. In this example we calculate the percentage of objects for which the given word are present. For performance issues it could be a good idea to pre calculate this value. The code snippet below shows how you build an search query on the indexed data using logical OR for multiple word searches.
// Get the total number of objects
$objectCount = array();
$db->array_query( $objectCount, "SELECT COUNT(*) AS count FROM object" );
$totalObjectCount = $objectCount[0]["count"];
// Search words can at most be present in 70% of the objects
$stopWordFrequency = 0.7;
$wordSQL = "";
$i = 0;
// Build the word query string
foreach ( $searchArray as $searchWord )
{
if ( $i == 0 )
$wordSQL .= "word.word='" .strToLower( $searchWord ) ."' ";
else
$wordSQL .= " OR word.word='" .strToLower( $searchWord ) ."' ";
$i++;
}
// Build the full search query using logical OR if multiple words are searched on
$searchQuery = "SELECT object.id, object_word_link.frequency
FROM object, object_word_link, word
WHERE object.id=object_word_link.object_id
AND word.id=object_word_link.word_id
AND ( $wordSQL )
AND ( ( word.object_count / $totalObjectCount ) < $stopWordFrequency )
ORDER BY object_word_link.frequency DESC";
$objectRes = array();
// Execute the query
$db->array_query( $objectRes, $searchQuery ); |
Live example
You can test this example search engine live here.
Source code
The attached file, search_engine.tar.gz contains the complete sourcecode to a simple fulltext search engine. The source is licenced with GPL.
Attached files:
Comment List
| Topic: |
Author: |
Time: |
|
Search query in PHP
|
Viper P
|
19.02.2006 19:58
|
|
This database that has been created, where was it created and how can i replicate it???
|
|
could not get it to work
|
Leblanc Meneses
|
22.07.2004 18:50
|
|
sorry guys...umm I think i'm the only one that couldn't get it to work.....
Where am I supposed to put my files I want the search script to index?
- I put them inside the folder containing
searchengine.php
I created the database manually then dragging the schema.sql into my c:mysqlbin>
then I typed: then typed:mysql -h localhost -u root -p search < schema.sql
or just mysql database_name < database.sql
my site.ini is:
[site]
DatabaseImplementation=mysql
Server=localhost
Database=search
User=root
Password=mypassword
# If you need to specify the socket to use with mysql use this variable
MySQLSocket=disabled
Does this matter in windows?
The folder classes/cache/ must be writeable for the webserver user.
Leblanc M.
http://www.xposure.us/Leblanc/V1/
|
|
Check may be this will be usefull if developed
|
Ayodele Okeowo
|
02.07.2004 18:17
|
|
I developed this search form using Kevin Yank's Book. It's really working as I ain't thought of. the codes below:
The Mysql table is:
CREATE TABLE acc (
acc_id int primary key,
username varchar(64),
password varchar(64),
ts timestamp,
act_ts timestamp);
the .html file
<html>
<head><title>Searching username</title>
</head>
<body>
<img src="web.gif" align="absbottom"> <font size="4"><b>Welcome to my DataBase Search Engine</b></font><br />
<font size="2"> This search engine is all about finding<br />
the users present in my database and knowing thier various<br /> informations you need to know. Happy searching.</font></p>
<form action="searchresult.php" method="post">
Enter word:
<input name="searchtext" type="text" size="30" />
<input type="submit" name="submit" value="Search" /></br>
</form>
Enjoy Searching. <font size="1" face="arial" color="#cd4532">Powered by HARYOH.com</font>
</body>
</html>
the .php files
<html>
<head>
<title> Search </title>
</head>
<body>
<h1>Your Search Result</h1>
<p><a href="searchusername.html">New Search</a></p>
<?php
$dbcnx = mysql_connect('localhost', 'root', 'derinola');
mysql_select_db('accounts');
// The basic SELECT statement
$select = 'SELECT DISTINCT acc_id, username';
$from = ' FROM acc';
$where = ' WHERE 1';
$searchtext = $_POST['searchtext'];
if ($searchtext != '' || $searchtext = '') { // Some search text was specified
$where .= " AND username LIKE '%$searchtext%'";
}else{
die('<p>Please, specify your search entering word.</br>' .
'date("l, F dS Y.")' . ' <a href="searchusername.html">Click </a>'.
' here to visit our database center.</p>');
}
?>
<table border="0">
<tr><th>Search Texts</th><th>Matches</th></tr>
<?php
$search = @mysql_query($select . $from . $where);
if (!$search) {
echo('</table>');
die('<p>Error retrieving username from database!<br />'.
'Error: ' . mysql_error() . '</p>');
}
while ($user = mysql_fetch_array($search)) {
echo("<tr valign='top'>n");
$id = $user['acc_id'];
$username = htmlspecialchars($user['username']);
echo("<td>$username</td>n");
echo("<td>$id</td>n");
echo("</tr>n");
}
?>
</table>
</body>
</html>
these scripts are working quite alright but remains something that I want to add;
I want that when I enter a wrong construction of words, it should complain that I've entered a wrong groups of words and also it should give me the COUNT of each words it finds.
AYo okeowo
|
|
Help me with a better \"Search code\" for seacrc
|
Ali Mokoni
|
26.03.2005 01:10
|
|
> I developed this search form using Kevin Yank's Book. It's
> really working as I ain't thought of. the codes below:
> The Mysql table is:
> CREATE TABLE acc (
> acc_id int primary key,
> username varchar(64),
> password varchar(64),
> ts timestamp,
> act_ts timestamp);
> the .html file
> <html>
> <head><title>Searching username</title>
> </head>
> <body>
> <img src="web.gif"
> align="absbottom"> <font
> size="4"><b>Welcome to my DataBase Search
> Engine</b></font><br />
> <font size="2"> This search engine is all
> about finding<br />
> the users present in my database and knowing thier
> various<br /> informations you need to know. Happy
> searching.</font></p>
> <form action="searchresult.php"
> method="post">
> Enter word:
> <input name="searchtext"
> type="text" size="30" />
> <input type="submit" name="submit"
> value="Search" /></br>
> </form>
> Enjoy Searching. <font size="1"
> face="arial" color="#cd4532">Powered
> by HARYOH.com</font>
> </body>
> </html>
>
> the .php files
> <html>
> <head>
> <title> Search </title>
> </head>
> <body>
> <h1>Your Search Result</h1>
> <p><a href="searchusername.html">New
> Search</a></p>
> <?php
> $dbcnx = mysql_connect('localhost', 'root', 'derinola');
> mysql_select_db('accounts');
> // The basic SELECT statement
> $select = 'SELECT DISTINCT acc_id, username';
> $from = ' FROM acc';
> $where = ' WHERE 1';
> $searchtext = $_POST['searchtext'];
> if ($searchtext != '' || $searchtext = '') { // Some search
> text was specified
> $where .= " AND username LIKE
> '%$searchtext%'";
> }else{
> die('<p>Please, specify your search entering
> word.</br>' .
> 'date("l, F dS Y.")' . ' <a
> href="searchusername.html">Click </a>'.
>
> ' here to visit our database center.</p>');
> }
> ?>
> <table border="0">
> <tr><th>Search
> Texts</th><th>Matches</th></tr>
> <?php
> $search = @mysql_query($select . $from . $where);
> if (!$search) {
> echo('</table>');
> die('<p>Error retrieving username from
> database!<br />'.
> 'Error: ' . mysql_error() . '</p>');
> }
> while ($user = mysql_fetch_array($search)) {
> echo("<tr valign='top'>n");
> $id = $user['acc_id'];
> $username = htmlspecialchars($user['username']);
> echo("<td>$username</td>n");
> echo("<td>$id</td>n");
> echo("</tr>n");
> }
> ?>
> </table>
> </body>
> </html>
> these scripts are working quite alright but remains
> something that I want to add;
> I want that when I enter a wrong construction of words, it
> should complain that I've entered a wrong groups of words
> and also it should give me the COUNT of each words it
> finds.
>
> AYo okeowo
|
|
Why not implement an in-memory algorithm
|
mattias waldau
|
14.06.2002 16:37
|
|
One way of handling partial words is creating an in-memory solution.
The algorithm is as follows
Preprocessing:
1. append all texts (will probably be smaller than 1-2 MBs). Remove all HTML-formatting and blanks.
2. create an array of pointers so that the first pointer points at the first char, the second at the second char etc.
(If the size of the text is 1 MB, this array will be 4 MB).
3. Sort the arrray of pointers
Searching:
1. Binary search in the array.
If you want more info, search for "suffix array" at citeseer, for example
http://citeseer.nj.nec.com/sadakane98fast.html
The searching can easily be extended to and, just search for each key at a time, and take the intersection.
An alternative algorithm is to write out the file to disk and use grep. Grep is very fast.
|
|
what about partial words
|
Klaus Hansen
|
13.06.2002 11:19
|
|
well
works as expected with fulltext whole-words but doesn't find partial words or words that are not exact. for a "modern" search-engine in my opinion its not enough.
reg. klaus
|
|
RE: what about partial words
|
Faisal Khan
|
15.08.2002 20:42
|
|
> well
> works as expected with fulltext whole-words but doesn't find
> partial words or words that are not exact. for a
> "modern" search-engine in my opinion its not
> enough.
>
> reg. klaus
Jesus man! haven't you found out that php is *not* a language for building anything like a *modern search engine*?
|
|
RE: what about partial words
|
David Benjamin
|
19.09.2002 18:58
|
|
> Jesus man! haven't you found out that php is *not* a
> language for building anything like a *modern search
> engine*?
That's a matter of opinion. In one sense, I think you're correct, because to build a sufficiently fast, complex, fulltext search in PHP (without cheating with MySQL) you'd want to put your most sensitive logic in C/C++. Otherwise, you just can't compete with the "industry leaders" or what-have-you.
But I think your argument is more one of ideology, as in, PHP is too lenient, has weak typing, poor OOP support, simplistic syntax, no exception handling, etc. Perhaps you would prefer a language crippled by software engineering like Java. Or maybe you're a Python nut like me. But Python doesn't do strong typing or private data.
I've done a lot with PHP, and I think that with discipline it can be made into a fine language for building large applications. It's fast, supports a large array of commonly needed web functionality, has a helpful and strong user base, and can be extended with native code. It may not do distributed systems like Java (actually, with the use of WDDX or XML-RPC, it can do a bit of that, too) but when was the last time you saw a distributed Java application fast enough to compete with, say, a "modern search engine"?
I'm open to corrections. =)
Peace,
Dave
|
|
 |
|
|