I am looking for some web-scraping code written in Python (Windows 10 PC) to pull results from Google Scholar, scrape data from these results and put the data into a .csv file which is readable by Excel. This data should also be written to a local database running on Windows 10 (MySQL or MongoDB)
I need the following:
1. Simple GUI which allows for a "topic" to be entered into which is to be used to scrape Google Scholar Case Law as well as the start and end year. GUI should have a "submit" button.
2. Ability to automatically (internally) create the correct search URL based upon Federal or State case, the topic keywords and the first year. It will then "step" through each year one by one until it hits the end year.
The URL's always follow a standard format so this should not be difficult to implement.
For example, if the topic is "Trademark" and the year range is 2012 - 2015, the program create a URL for Trademark search from year 2012-2012 (single year sub-range is more manageable data-set) and do the below steps. Once this is done, it steps to Trademark from year 2013-2013 and does the below steps. It steps up by 1 year at a time until it hits 2015-2015. This is to prevent far too many search results from showing up at once and making it unmanageable.
In the above example, there would be 4 separate URL created (one for each date range) that are run separately.
3. Navigate to each URL (no visual display required for this)
4. Grab the entire list of sub-URL from the search results and navigate to each one individually and scrape and save the following data in its own CSV file (auto-name). Each sub-url would then have its own filename and data.
5. Ability to turn off saving .CSV files (this is mainly needed for testing/debugging to make it easier to see that program is working properly)
a) Exact URL of the case
b) Header info - contains name of case, court name, district and year
c) *** if the url contains the phrase "NOT TO BE PUBLISHED IN THE OFFICIAL REPORTS", this must be reflected in the CSV naming convention by adding _NOT at the end of the filename.
d) Every sub-URL will have a section called "DISCUSSION." Inside this section we need to search for and save the following:
Sub-titles inside DISCUSSION section (write sub-title to file and save text associated with sub-title)
aa) Sentences within sub-title section with citations afterwards (text inside parenthesis)- save preceding sentence and citation
bb) Any text inside sub-title section which is inside double quotes, including citation afterwards - save entire text inside double quotes and citation.
Continue the above until end of URL, then get next URL, complete and repeat. Then step forward 1 year (if date range allows) and repeat again until all results are parsed.
I have attached a scraped html image of a Google Scholar article for reference.