What is Big Data/Hadoop?
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions
According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data.
Big Data Architect
Hadoop/HBase, Cassandra etc that can provide the technical direction towards re-architecting Bloomberg's Data Model and Access Platform. The vision is to architect a reliable and scalable data platform, provide standard interfaces to query and support analytics for our big security related data sets that is transparent, efficient and easy to access as possible by our varied applications.
Design, architect and build a data platform over Big Data Technologies
Leading innovation by exploring, investigating, recommending, benchmarking and implementing data centric technologies for the platform
Being the technical architect and point person for the data platform
Skills & Requirements
Abilities and Characteristics
Have a passion for Big Data technologies and a flexible, creative approach to problem solving.
Excellent problem solving and programming skills; proven technical leadership and communication skills
Have extensive experience with data implementations, data storage and distribution
Have made active contributions to open source projects like Apache Hadoop or Cassandra
Have a solid track record of building large scale systems utilizing Big Data Technologies
Excellent understanding of Big Data Analytics platforms and ETL in the context of Big Data
2+ years of hands-on experience with the Hadoop stack (MapReduce Programming Paradigm, HBase, Pig, Hive, Sqoop) and/or key-value store technologies such as Cassandra
2+ years of hands-on experience with administration, configuration management, monitoring, debugging, benchmarking and performance tuning of Hadoop/Cassandra
5+ years hands-on experience with open source software platforms and languages (e.g. Java, Linux, Apache, Perl/Python/PHP)
Previous experience with high-scale or distributed RDBMS (Teradata, Netezza, Greenplum, Aster Data, Vertica) a plus.
Knowledge of cloud computing infrastructure (e.g. Amazon Web Services EC2, Elastic MapReduce) and considerations for scalable, distributed systems a plus
What does Hadoop solve?
Organizations are discovering that important predictions can be made by sorting through and analyzing Big Data.
However, since 80% of this data is "unstructured", it must be formatted (or structured) in a way that that makes it suitable for data mining and subsequent analysis.
Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes.
Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.
Hadoop enables a computing solution that is:
Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.