I need help reasoning about what technologies/processes to use in the next version of my little company’s digital infrastructure. For a look at what I’m proposing, jump to the bottom.

The Situation: We import 20 distinct sources (JSON via FTP, scrapers, API calls, mobile , bulk CSV data, user web inputs, et al) into our relational database, on various schedules. This is for a single US county. Each data source has its own ETL data flow implemented with varying levels of cron, celery, custom Python and Django’s ORM.

The data is used for our nascent machine learning initiative, internal business insights and customer acquisition, plus an external, multi-tenant SaaS frontend.

This is all working just fine, for now…

Impetus for : We are expanding to five markets instead of one. It will be a data nightmare if I don’t make some sense of this situation.


  1. Sane management of ETL data flows (priority)
  2. Stick with Python as much as possible
  3. Retain and analyze raw/cleaned data in perpetuity
  4. Maximize scalability
  5. Minimize low-level complexity


  1. I don’t do Java much. At all.
  2. I’m not a particularly skilled sysadmin.
  3. We don’t have $100k to build this. It’s just me.

Proposed Solution: This is where I need feedback.

Hadoop, , , Avro

  • Bulk data (csv, etc) goes directly to , cleaned up, posted to Kafka topic as Avro (potentially using Confluence Schema Register).
  • Streaming-type data posts directly to Kafka (scrapers? mobile data? user clicks?).
  • Streaming-type Kafka topics export to HDFS for batch processing later.
  • Airflow manages all the ETL data flows.
  • Data warehousing via Airflow to PostgreSQL from Kafka.

What are your thoughts on my thoughts??

Source link
Bigdata and data center
thanks you RSS link
( https://www.reddit.com/r/bigdata/comments/800zu5/preemptive__choices_kafka_hdfs/)


Please enter your comment!
Please enter your name here