Hey,
I'm a big data and hadoop developer. My expertise and past project details:
EDL – Common Component
Developed a framework using which provided many utilities which can be used to carry out specific needs of a project. Example – AWS S3 to HDFS Copy can be done using one such utility without needing to write a single code afterwards. Deployed many such utilities as a part of setting backbone of EDL – which was used as a platform to host multiple projects.
Technologies/Frameworks/Languages used
1. Hadoop Ecosystem – Hive, Impala, Apache Kafka, Spark, Python
2. Logging – Log4j, Logstash, Kibana, ELK
3. Configuration files – JSON
Team Size: 6
EDL - Real World Data
Large data files from different vendors containing patient data was received and loaded into data lake. Multiple transformations and business rules were applied as a part of automated process developed. All the parameters/properties were highly configurable. Provided functionalities like creating copy of data, modify as required and share among colleagues.
Technologies/Frameworks/Languages used
1. Hadoop Ecosystem – Hive, Impala, Map-reduce, distcp, s3cmd, oozie, CDH, HDFS, Hue, Kerberos, Python, Java, AWS, Redshift
2. Logging – Log4j, logstash, Kibana, ELK
Team Size: 6
EDL – Process Development
Automated jobs running daily would sqoop the data from oracle source tables and dump to EDL. These jobs would pick the incremental data only and this data undergoes complex joins as per the required business logic.