What is the best PHP web crawler class class?: Get content from one site to store my database

Recommend this page to a friend!

What is the best PHP web crawler clas...

What is the best PHP web crawler class class? #web crawler class

by kash - 9 years ago (2016-05-26)

+2	I want an efficient Web crawler that can get contents from an Web site.

1. by Manuel Lemos - 9 years ago (2016-05-27) Reply
Do you want the Web site page contents or just the links to the crawled page URLs?

2. by kash - 9 years ago (2016-05-27) in reply to comment 1 by Manuel Lemos Comment
i want the website content
3. by Axel Hahn - 9 years ago (2016-05-27) in reply to comment 2 by kash Comment
You can use a non blocking crawler - I used "rolling-curl" for my own crawler (but it is to early to make my crawler public). In a callback function you get the content where you fetch title + meta description from head and the body for the content.

If you want to follow links (recursive scan): You must parse the content to find new crawlable links. There you need to check - that you stay on the domain - make relative links absolute - check if the found url was added for crawling already - check depth of the url path (maybe). If you want to fetch foreign domains you should respect the robots.txt and index and follow rules in html head and a tags.
4. by Manuel Lemos - 9 years ago (2016-05-27) in reply to comment 2 by kash Comment
I have seen crawlers that extract all site page links and store in a database but the actual pages content I have not yet found any.
5. by Manuel Lemos - 9 years ago (2016-05-27) in reply to comment 3 by Axel Hahn Comment
Axel, there are some crawlers that retrieve the content. If your package is going to be able to store content in a database, that seems to be what kash wants.
6. by scott Winterstein - 9 years ago (2016-05-31) in reply to comment 5 by Manuel Lemos Comment
You may not find one. I have used httrack over the years of development.

Recommend package

For more information send a message to info at phpclasses dot org.