Project Brief: Blog Crawler Tool
To create a tool that can analyse a multiple number of blog URLs (from a .txt document of a .csv/.xls spreadsheet) and extract all the outgoing blog links, noting them down and returns the following information.
All outgoing links pointing to other blog URLs. It is vitally important that these are blogs and not normal websites – the tool will need to be able to not just take down every URL we are looking for the outgoing links not on every part of the blog but rather the piece that says "favourite Blogs" or "Related Blogs" rather than all of the hyperlink embedded within the text as deep links. The tool will need to be able to register this and select hyperlinks accordingly. It is normally the case that these links are usually submitted on the first page and replicated onto other pages within the blog. The tool should also be able to do the following:
Removal of any duplication URLs
Only inclusion of homepage URLs – as opposed to individual blog posts
Google PageRank of each blog URL found(there are constrictions on the number of requests that can be made per day for Google PageRank scores. Therefore, the tool will have to be able to utilise different proxy addresses to circumnavigate this problem)
Technorati Authority of each blog URL (there are constrictions on the number of requests that can be made per day for Technorati Authority scores. Therefore, the tool will have to be able to utilise different proxy addresses to circumnavigate this problem)
Blog Title - This is contained in the Meta Data of almost every blog as mentioned earlier you are aware of sourcing this data of which needs to be added into the tool – if this isn’t available in the meta data the tool should accommodate for this so that the title can be extracted.
Blog Description - This is contained in the source code of almost every website – if this is not available within the source code the tool should find this information from the “About” section – very common in blogs
Blog Keywords - This is contained in the source code of almost every website, for SEO purposes – from our earlier conversation - for the last point we would realistically need the title and description of the blog - this information can normall be found on the blog as text, we also require the keywords so that the blog can be catagorised by subject. This could be achieved in the following way then....either....The tool writes down all the tags from all of the posts and remover the duplicates or it picks out the meta tags (or both)
Input the proxies into a appropriately titled .txt file
Input multiple URLs into a .txt file
Blog Tool goes through each URL one-by-one finds all of the URLs found and for each URL found populates an excel in the following way: