The Situation: We import 20 distinct data sources (JSON via FTP, scrapers, API calls, mobile apps, bulk CSV data, user web inputs, et al) into our relational database, on various schedules. This is for a single US county. Each data source has its own ETL data flow implemented with varying levels of cron, celery, custom Python and Django’s ORM.
The data is used for our nascent machine learning initiative, internal business insights and customer acquisition, plus an external, multi-tenant SaaS frontend.
This is all working just fine, for now…
Impetus for Change: We are expanding to five markets instead of one. It will be a data nightmare if I don’t make some sense of this situation.
- Sane management of ETL data flows (priority)
- Stick with Python as much as possible
- Retain and analyze raw/cleaned data in perpetuity
- Maximize scalability
- Minimize low-level complexity
- I don’t do Java much. At all.
- I’m not a particularly skilled sysadmin.
- We don’t have $100k to build this. It’s just me.
Proposed Solution: This is where I need feedback.
Hadoop, Kafka, Airflow, Avro
- Bulk data (csv, etc) goes directly to HDFS, cleaned up, posted to Kafka topic as Avro (potentially using Confluence Schema Register).
- Streaming-type data posts directly to Kafka (scrapers? mobile app data? user clicks?).
- Streaming-type Kafka topics export to HDFS for batch processing later.
- Airflow manages all the ETL data flows.
- Data warehousing via Airflow to PostgreSQL from Kafka.
What are your thoughts on my thoughts??