Looking to have a pipeline to retrieve data via API call from data source, transform the data, and load into BigQuery on GCP.
What we are doing:
Building a web app that combines features (ie. data fields) from many different sources for many different records. For example, there are over 600 records and each record has tens of data fields from various data sources. The data will be combined into a single, rank-ordered list based on user input. For example, there will be a static list of features (ie. data fields) from various sources (eg. US Census, FBI crime data, NOAA weather, etc) and the user will fill out a questionnaire. Based on the input from the questionnaire, the static data will be scored and combined into a single number.
What we need from you:
- Write the code to data from source via API call for each record
- Adjust the data (ie. clean, impute missing, and transform)
- Load the data into BigQuery
This pipeline will be run one time per year. However, the pipeline will need to pull a lot of data (5-15 features for more than 600 records).
This projects is to build a pipeline for a single source. Based on the quality that is delivered, additional pipelines for other sources can be contracted.
The pipeline needs to written in Python and must be deployed on GCP.