Design a pipeline to ingest data into an operational data store which accounts for monitoring and logging auditing for completeness (source records should not be dropped) ability to configure and replicate the pipeline for different sources with minimal changes Implement specific part of the pipeline using the tool "Apache AirFlow/Spark"
Document the critical choices and decisions, preferably using Git Data for this procedure:
TMDB - The Movie Database (TMDb) is a community built movie and TV database. This can be used to demonstrate the design and implementation.
[login to view URL] File dumps -
[login to view URL]
Design and standards:
1. I am looking for a pipeline implementation that works on either or high availability. With the requirement, I want if you can design a solution which ingest data with integrity when run on tuned production setup.
2. You must choose Python v3 (latest) programming language or framework or libraries.
3. You can choose Docker (dockerfile, docker-compose) to setup the environment and add the same in the repository, if chosen.
4. You are free to choose any flavor of Git workflow, ideally something that can be extended by a team as well.
Please email me your solution which contains: summary for the design and implementation code (ie: DDLs, dockerfiles, pipeline implementation) Ideal case submission to share a link to public git repository with all docs and code described by a README.