iSnare.com - Free Content Articles Directory
Authors Contents [Advanced Search][Add OpenSearch][Job Search]
Distribute your articles to thousands of article sites for only $2 and below! Read more...

Index  Computers and Technology
 

Make a Search Engine in PHP and MySQL

 
[ Contact the Author] [ Send to a Friend] [ Article Publisher] [Make PDF] [ Print] [ Bookmark & Share]
 
Read our Terms of Service before reprinting this article. The submitter specified above has claimed the rights to this article.
Simon Byholm

Why would you want to make a search engine anyway? There already is a search engine to rule them all. You can use Google to find just about anything in the Internet and I doubt you will ever have the same computing and storage capabilities as the big G.

So why then make your own search engine?

To make money of course!

... and to become famous as the creator of the next big search engine or because as a programmer or engineer you like challenges. Making a search engine for the public Internet is tricky and if you're like me you like to solve tricky problems.

The third application is a customized, high speed site search for you large
thousands of pages website. An indexed search engine will be a lot faster than
a full text search function and if Google's site search isn't flexible enough
for your site you can make your own search functionality.

THE BASICS OF SEARCH

The basis of any BIG search engine is a word to web page index, basically a long list of words and how well they relate to different web pages.

To make a search engine you have to do four things:

Decide what pages to fetch and fetch them
Parse out words, phrases and links from the page
Give a score to every keyword or key phrase indicating how well the phrase relates to that pages and store the scores in the search engine index
Provide a way for users to query the index and get a list of matching web pages

This is not hard for a seasoned programmer. It can be done in a day if you know regular expressions and have some experience with HTML and databases.

Now you have a working search engine, just add a lot of computers and hard drives and you'll soon index all of the Internet. If you're not prepared to go that far a one terabyte disk will hold an index of about 50 million pages.

HOW TO SCORE PAGES

After completing basic search functionality there's a lot of work before anyone will want to use your new machine.

An index is not enough. What's challenging is how to score pages to give the end user the search results that's most relevant to his idea of what hi is searching for.

You'll need to decide how much weight to put on keywords in the tile tag, description and main web page contents. To make good scoring you will also want to boost keywords found in the URL of the page and check the anchor text of inbound links.

Keeping track of inbound links is the most useful and most challenging of the above, you'll need to keep a separate database table with info on all links between pages you index.

WHAT TO INDEX AND NOT TO INDEX

Other obstacles you will find when you start indexing real Internet content is the fact that there is wast amounts of useless junk floating around everywhere and eventually your index will become full of spam, affiliate pages, parked domains, work in progress homepages without content, link farms used by search engine optimizers, mirror sites using data feeds to create thousands of pages with product listings or other reproduced content etc, etc...

When indexing from the Internet you will have to find ways to filter out the junk content from what people are actually reading and searching for.

To start with you could limit how deep into sub directories you crawl, how many link hops from a domain index page you crawl and how many links per web page to allow.

PARSING WEBSITES

There's a million ways, both right and wrong to write HTML and when you index from the Internet you will need to handle all of them.

When parsing keywords from pages you not only need to handle the complete HTML standard but also all the non-standard ways that is unofficially supported by Internet browsers.

To be able to read all pages you will also need to parse client side java script, handle frames, CSS and iframes.

This is a large part of the work on a general search engine, to be able to read all sorts of content.

WHY SO MANY URLS?

Finally you'll need to deal with the fact that many websites have many URLS pointing to the same web page. Just look at this example:

dmoz.org

www.dmoz.org

dmoz.org/index.html

www.dmoz.org/index.html

All those URLs point to the same web page. If you don't make special code to handle that you'll soon have 4 results in your search engine (one for every URL) all going to the same page. Users will not like you.

There is also the possibility of query strings where a session ID after the question mark in the URL will create almost infinite URLs for the same web page.

google.com?SID=4434324325325

google.com?SID=4387483748377

google.com?SID=7654565644466

To the search engine there will be a really big number of pages all containing the same content.

The quick fix of course is to not index pages that include a query string. Or to strip the query string from pages. This works but will also remove a lot of legitimate content (think forums) from your index.

You now have all the information you need to make a site search engine. If you're going for a general Internet search engine there's a lot more details you need to include. Like robots.txt, site maps, redirects, proxies, recognizing content types, advanced ranking algorithms as well as handling terabytes of data.

I'll cover more detail in a future article. Good luck with your next search engine project.

Important NoticeDISCLAIMER: All information, content, and data in this article are sole opinions and/or findings of the individual user or organization that registered and submitted this article at Isnare.com without any fee. The article is strictly for educational or entertainment purposes only and should not be used in any way, implemented or applied without consultation from a professional. We at Isnare.com do not, in anyway, contribute or include our own findings, facts and opinions in any articles presented in this site. Publishing this article does not constitute Isnare.com's support or sponsorship for this article. Isnare.com is an article publishing service. Please read our Terms of Service for more information.

Simon Byholm is building a new search engine where he will test and describe new search algorithms. Simon is a software engineer living on the west coast of Finland and has a B.Sc degree in telecommunication and a burning interest for search engine algorithms.

Article Tags: engine [See Dictionary], pages [See Dictionary], search [See Dictionary]
Got a question about this article? Ask the community!
Article published on October 16, 2009 at Isnare.com
 
Rate this article:

How to Install Aftermarket Stereo in Vauxhall Agila
Submitted by: Jack Wylde

DESCRIPTION: The radio installation in VAUXHALL AGILA Some Cars have steering controls from new and when you replace your radio...

Don't Ruin Your Laminator - 4 Reasons to Always Use a Carrier With Your Pouch Laminating Machine
Submitted by: Jeff McRitchie

One of the most important supplies you'll need when getting ready to use a pouch laminator is a carrier...

5 Reasons to Consider the Destroyit 4107 Cross Cut Shredder
Submitted by: Jeff McRitchie

There is shredding, and then there is "Shredding" The Destroyit 4107 is a heavy duty shredder that can handle just about anything any sized office can throw at it...

Reviewing the Akiles CombMac 24E Electric Plastic Comb Binding Machine
Submitted by: Jeff McRitchie

Because it is a rare electric comb binding systems on the market that works with legal size sheets, the Akiles CombMac 24E already has somewhat of a competitive advantage...

Reviewing the Akiles DuoMac 321 Combination 3:1 and 2:1 Pitch Wire Binding Machine
Submitted by: Jeff McRitchie

Offering some nice flexibility in binding styles, the Akiles DuoMac 321 is positioned as a binding solution for businesses and organizations that want the ability to bind documents in as many as five binding styles...

Reviewing the Akiles DuoMac 421 Combination 4:1 Pitch Coil and 2:1 Pitch Wire Binding Machine
Submitted by: Jeff McRitchie

The Akiles DuoMac 421 is a combination binding system that offers medium volume users with three different binding styles...

PC Virus Removal Can Help Businesses Save Millions of Dollars
Submitted by: Adrianna Noton

The business world is now, for the most part, driven by technology The internet is of fundamental importance to the success of a business...

GBC ShredMaster CC195 Cross-Cut Shredder Review
Submitted by: Jeff McRitchie

If you've visited an office supply store recently or checked out paper shredders on the Internet, it probably seems like paper shredders are all alike, especially in the looks department...

Frequently Asked Questions About Spiral Coil Binding
Submitted by: Jeff McRitchie

Coil binding - also known as spiral binding - is a very popular method of binding, but it can be difficult to understand how it's done...

An Overview of the GBC C-75 Comb Binding Machine
Submitted by: Jeff McRitchie

If you are looking for an inexpensive comb binding machine for your office or home office you might be considering the GBC C75...

5 Great Features of the Rhin-O-Tuff HD6500
Submitted by: Jeff McRitchie

Designed for the heaviest duty users such as print and copy shops, binderies and large organizations, the Rhino Tuff HD6500 is a machine that offers top of the line flexibility and capacity...

Sony Ericsson W595 Mobile Phone Review - The Latest and Best Walkman Phone?
Submitted by: Carlson Osbourne

The one thing that most Sony Ericsson phones have in abundance is good looks No matter what lies beneath the surface, they all tend to have unique and beautiful appearances that can enhance the style factor of everyone using them...

Comparing the Swingline SmartCut EasyBlade and EasyBlade Plus Rotary Trimmers
Submitted by: Jeff McRitchie

At first glance, Swingline's SmartCut EasyBlade and EasyBlade Plus may look very similar, except for the price...

Five Reasons to Consider the Rhino-Tuff OD4000 Modular Binding Punch
Submitted by: Jeff McRitchie

Though it may be the smallest of Rhino's electric interchangeable punches, the OD4000 offers you about as much as a binding punch can offer...

Sony Ericsson W705 Mobile Phone Review - Tune Into the Beat With the Ultimate Walkman Phone
Submitted by: Carlson Osbourne

Sony Ericsson is known the world over for their amazingly functional and stylish mobile phones It is easy to see why when you take a look at some of the handsets that they have produced over the years and one of their latest additions to the Walkman range can be added to that illustrious list...

Isnare.com Footer Divider

© 2004-2009. Isnare Free Articles - An Isnare Online Technologies Free Articles Project. All Rights Reserved.   Privacy Policy