Today is a free download without charge Download

Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills - Advanced Analytics with Spark [2015, PDF/ePub, ENG]

Reply to topic

Omen ®

Longevity: 8 years 4 months

Posts: 181087


Post 21-Apr-2016 21:00


Advanced Analytics with Spark
Patterns for Learning from Data at Scale
Год издания: 2015
Автор: Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills
Издательство: O'Reilly Media
ISBN: 978-1-4919-1276-8, 978-1-4919-1270-6
Язык: Английский
Формат: PDF/ePub
Качество: Издательский макет или текст (eBook)
Интерактивное оглавление: Да
Количество страниц: 276
Описание: In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example.
You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—classification, collaborative filtering, and anomaly detection among others—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find these patterns useful for working on your own data applications.

Примеры страниц


Chapter 1Analyzing Big Data
The Challenges of Data Science
Introducing Apache Spark
About This Book
Chapter 2Introduction to Data Analysis with Scala and Spark
Scala for Data Scientists
The Spark Programming Model
Record Linkage
Getting Started: The Spark Shell and SparkContext
Bringing Data from the Cluster to the Client
Shipping Code from the Client to the Cluster
Structuring Data with Tuples and Case Classes
Creating Histograms
Summary Statistics for Continuous Variables
Creating Reusable Code for Computing Summary Statistics
Simple Variable Selection and Scoring
Where to Go from Here
Chapter 3Recommending Music and the Audioscrobbler Data Set
Data Set
The Alternating Least Squares Recommender Algorithm
Preparing the Data
Building a First Model
Spot Checking Recommendations
Evaluating Recommendation Quality
Computing AUC
Hyperparameter Selection
Making Recommendations
Where to Go from Here
Chapter 4Predicting Forest Cover with Decision Trees
Fast Forward to Regression
Vectors and Features
Training Examples
Decision Trees and Forests
Covtype Data Set
Preparing the Data
A First Decision Tree
Decision Tree Hyperparameters
Tuning Decision Trees
Categorical Features Revisited
Random Decision Forests
Making Predictions
Where to Go from Here
Chapter 5Anomaly Detection in Network Traffic with K-means Clustering
Anomaly Detection
K-means Clustering
Network Intrusion
KDD Cup 1999 Data Set
A First Take on Clustering
Choosing k
Visualization in R
Feature Normalization
Categorical Variables
Using Labels with Entropy
Clustering in Action
Where to Go from Here
Chapter 6Understanding Wikipedia with Latent Semantic Analysis
The Term-Document Matrix
Getting the Data
Parsing and Preparing the Data
Computing the TF-IDFs
Singular Value Decomposition
Finding Important Concepts
Querying and Scoring with the Low-Dimensional Representation
Term-Term Relevance
Document-Document Relevance
Term-Document Relevance
Multiple-Term Queries
Where to Go from Here
Chapter 7Analyzing Co-occurrence Networks with GraphX
The MEDLINE Citation Index: A Network Analysis
Getting the Data
Parsing XML Documents with Scala’s XML Library
Analyzing the MeSH Major Topics and Their Co-occurrences
Constructing a Co-occurrence Network with GraphX
Understanding the Structure of Networks
Filtering Out Noisy Edges
Small-World Networks
Where to Go from Here
Chapter 8Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
Getting the Data
Working with Temporal and Geospatial Data in Spark
Temporal Data with JodaTime and NScalaTime
Geospatial Data with the Esri Geometry API and Spray
Preparing the New York City Taxi Trip Data
Sessionization in Spark
Where to Go from Here
Chapter 9Estimating Financial Risk through Monte Carlo Simulation
Methods for Calculating VaR
Our Model
Getting the Data
Determining the Factor Weights
Running the Trials
Visualizing the Distribution of Returns
Evaluating Our Results
Where to Go from Here
Chapter 10Analyzing Genomics Data and the BDG Project
Decoupling Storage from Modeling
Ingesting Genomics Data with the ADAM CLI
Predicting Transcription Factor Binding Sites from ENCODE Data
Querying Genotypes from the 1000 Genomes Project
Where to Go from Here
Chapter 11Analyzing Neuroimaging Data with PySpark and Thunder
Overview of PySpark
Overview and Installation of the Thunder Library
Loading Data with Thunder
Categorizing Neuron Types with Thunder
Where to Go from Here
Appendix Deeper into Spark
Spark and the Data Scientist’s Workflow
File Formats
Spark Subprojects
Appendix Upcoming MLlib Pipelines API
Beyond Mere Modeling
The Pipelines API
Text Classification Example Walkthrough
Other forum [Profile] [PM]
Display posts from previous:    
Reply to topic

The time now is: Today 14:29

All times are GMT + 3 Hours

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum