Big Data with Big Speed
Systems-level data technology is rapidly evolving. Unfortunately this means best platform for data science and big data is different every year. Honestly, this is the main reason why I hate systems engineering. Most data scientists use R, with large-scale experiments being handed off to data engineers. With the architecture changing so much, there doesnāt appear to be much reason for an application-level, software engineer to learn systems just for the sake of learning them. Although Hadoop is still great for processing unstructured data, numerous other tools like Spark are maturing for processing structured data. Spark has the additional advantage of working mostly in memory, often resulting in much faster queries.
Hadoop, Spark, and other related technologies all have similar goals: making it so working with large amounts of data is pleasant. Their strengths come in their scalability (PB-scale data), their reliability with hardware failures, and their ease of use with many different types of data. They only seem to be getting better too: Apache Spark boasts of a 100x performance gain vs. Hadoop MapReduce jobs (in memory). Perhaps the only downfall in a way is that each technology is specialized for certain use cases. There isnāt a jack-of-all-trades as of yet. For example, Amazon Redshift is meant for large amounts of structured data, versions of MapReduce (e.g. Hadoop, Amazon EMR) are vastly superior with unstructured data. Of course, with the speed at which technology evolves in general, this could change any day.
















