Hi,
I have a perl script that uses a spider to crawl certain websites and gathers the information from them and then inserts this information into the database. The script is almost done except it needs just a few small finishing touches which are outlined below.
Remember this script is about 85-90% completed, we just need someone to put the finishing touches on it.
Below
1) Grabbing category we are searching and inserting it into the database.
We search through an perl admin section where we first select the site, then enter in the category(for example books) and then the location. We want it so that when it gathers these results from site it will also insert the category that we have entered into a row in the database.
2) Removing the special characters for some of the words/cities.
Some of the cities/words are pulling in special characters. For example a city such as Montreal may have be spelled Montre@l etc.. We need these characters not to show up and the actual letters to be used.
3) Fixing certain sites because they were only searching one state/province.
There are a few sites that when scraped the gather only searches one state or province even if the city we enter isn't in this province. We need this fixed.
4) If the website address ends in .[login to view URL] or [login to view URL] etc.. then it doesn't insert it into the database.
We also collect if the listening has a website and if it does it is inserted into the database. The problem is some of the sites have generic yellow pages addresses that are actually an online advertisement for them because they don't have a website. We don't want these inserted into the database and only won't actual websites inserted into the database.
5) We have a php programmer doing a browser based admin section in php. One of the problems we run into is that some of the addresses don't have a postal/zip code. So we wanted to set it up so that when we select certain addresses(in the php admin section) to download then it could run another spider to spider these sites below, grab the zip/postal codes and insert them into the database corressponding to the addresses that we are doing to download?
[login to view URL]
[login to view URL]
[login to view URL];pageId=pcaf_pc_search&gear=postcode
[login to view URL]
6) Scraping [login to view URL]
If possible we would like the site about to be scraped.
If you have any questions or need any clarification please pm me. Thanks.
--Anthony