Overview – This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikitlearn), the Natural Language Toolkit (NLTK), and Spark MLlib.
Target Audience – Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Hadoop.
Prerequisites – Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.
Course ObjectivesRecognize use cases for data science on Hadoop
Describe the Hadoop and YARN architecture
Describe supervised and unsupervised learning differences
Use Mahout to run a machine learning algorithm on Hadoop
Describe the data science life cycle
Use Pig to transform and prepare data on Hadoop
Write a Python script
Describe options for running Python code on a Hadoop cluster
Write a Pig User-Defined Function in Python
Use Pig streaming on Hadoop with a Python script
Use machine learning algorithms
Describe use cases for Natural Language Processing (NLP)