This work is for an adult movie review website, if you are not comfortable with that (or under 18) please don't bid. Don't go to our site or click on any link in this posting.
This project is to complete and tune a page scrape, data import and a little data manipulation. I have described what we are looking for and summarized in the numbered list below.
Sending the UPC via our form via the URL is no longer working (sending the title via another URL still works as comparison). The URL on its own with the UPC added at the end as the PHP code is developed manually does present a page in the browser.
The names of the product ID from this scrape is not importing the first several letters. For example: UPC 657447006982 saves "WSENS785" for the title_id, while the full id from the site is "DVDNEWSENS785" This is a consistent problem with 5 characters being dropped. Many files have been imported incorrectly, need to reconstruct this id based on product (they all start with "DVD") and the movie studio "NEWSENSATION", so add the two first characters (another example: DVDGREEDY16 saves as EEDY16) In our DB we have the company name in a field and the title_id in another, so add "DVD" and the first two characters of the company field to the title_id field.
Also, for some 2-disc set titles, we need the parenthesis stripped before import. For example, the data from the scrape would be "(2-disc set)" it strips the "2-disc set" out but leaves the parenthesis "()".
Before bringing the data from the scrape into HTML fields to edit we have a preview function. Clicking on the item to bring the data into the fields is no longer working. One other feed still works, so you can use that as comparison for what we are looking for.
1. Fix the function to send the UPC to URL for scrape.
2. Fix scrape to bring in full title_id.
3. Reconstruct title_id from "DVD" and the company name from DB.
4. Fix scrape routine to remove parenthesis.
5. Fix field population from data preview.
Date Import & Scrape
We need to have an import fixed to recognize a different file delimited format. (attached). We can get this as tab, comma, pipe & semi colon. The problem that we have is the data going into the wrong fields. So the bulk of the work is done. The import is triggered by placing the file manually on our FTP server.
6. Fix import to map data to the right fields in database.
Auto Search Function
Problem: auto-search feature is triggered after we submit the data from a scraped search to the database. After we submit the data for a scraped title, an auto search starts for another title that is in the DB. The auto-search should only happen after we save a title that is called from the DB, data added to that record and re-saved to the DB.
7. Stop auto search from searching after a new record data submit.
8. Another data import-change URL where data is retrieved (we have an old one up now).
In short, should be straight forward, but the biggest task is just figuring out how this works. If interested and need more info, just ask and we can provide the URLs for the interface that we have as well as the URL for the scrape, etc.
***I am willing to pay more to automate #6. Right now we have to go to a web site interface, choose the date range and category of the information we wish to have, save the file to our FTP server for import.
This screenshot shows how the UPC search does not yield results from our site search.
This screenshot shows that the UPC search that we use actually does return content that can be scraped. Circled is the data that we need to fully capture for the "title_id" value.
This screenshot shows results from the same site based on a Title search. Clicking on the result that we want does not populate the fields on the left as it should.
N.B.- we have code on a test page that works, so you should be able to combine that with our live code.
This screenshot was taken on our test page (that allows us to populate the fields on the left by clicking on the search results on the right). Shows the parenthesis problem.
For #3 - we have found that the title_ID can't be rebuilt from the company name as it is not consistent that the company name is in the title_ID (sometimes an alternate company name is used). We need to have a one-time scrape routine that uses the UPC code from our db for each record to go to the site and just scrape the title_id. You could just run this on your own and give us the updated records for us to manually import back into the database.