We need a program which runs on a linux (opensuse) server and gathers given XML-Sources. The Database-Connection and all other settings or values should be placed in a ini or xml file.
1. Get all RSS-Urls from the database (containing rss_id,url,category_id and some other fields - table layout will be provided)
2. iterate (asynchronous! - it should gather x simultanous, x should be in config) through all entries and retrieve the contents
3. Parse contents:
this should be done in 3 steps:
First: Try to parse RSS-Format / Atom-Format (all versions and all different fields f.e. contant|description|summary or pubDate,updated,dc:date,pwd:timestamp etc)
Second: If this fails the script should try to parse the elements by RegExp or string match
Third: write all entries of the current rss-url to database (into one temp-table and another table) with some informations off the media-item (f.e. category).
We have a PHP-Script which allready implements all this features, this can be provided for detailed instruction what the script should do.
IMPORTANT: The script should handle all encodings 100%!! Sometimes the XMLs have a given "utf-8" but the contents are encoded in ISO..., or only one item is utf-8 and the other in different encodings. This has to be absolutely safe!
- 200 RSS-Items for testing
- PHP Script with the current algorithm
- All sourcecodes and files
- runnable, bug free java program (or C++) for linux
- short instruction how to use it
The program should be max. performant, so please consider this during development.