Home > Computing and Information Technology > Business applications > Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood
46%
Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood

Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood

          
5
4
3
2
1

In Stock


Premium quality
Premium quality
Bookswagon upholds the quality by delivering untarnished books. Quality, services and satisfaction are everything for us!
Easy Return
Easy return
Not satisfied with this product! Keep it in original condition and packaging to avail easy return policy.
Certified product
Certified product
First impression is the last impression! Address the book’s certification page, ISBN, publisher’s name, copyright page and print quality.
Secure Checkout
Secure checkout
Security at its finest! Login, browse, purchase and pay, every step is safe and secured.
Money back guarantee
Money-back guarantee:
It’s all about customers! For any kind of bad experience with the product, get your actual amount back after returning the product.
On time delivery
On-time delivery
At your doorstep on time! Get this book delivered without any delay.
Add to Wishlist
X

About the Book

PEEK “UNDER THE HOOD” OF BIG DATA ANALYTICS The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance. The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within. Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system. Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to: Identify the foundations of large-scale, distributed data processing systems Make major software design decisions that optimize performance Diagnose performance problems and distributed operation issues Understand state-of-the-art research in big data Explain and use the major big data frameworks and understand what underpins them Use big data analytics in the real world to solve practical problems

Table of Contents:
Introduction xxvii Chapter 1 Data Intensive Applications 1 Anatomy of a Data-Intensive Application 1 A Histogram Example 2 Program 2 Process Management 3 Communication 4 Execution 5 Data Structures 6 Putting It Together 6 Application 6 Resource Management 6 Messaging 7 Data Structures 7 Tasks and Execution 8 Fault Tolerance 8 Remote Execution 8 Parallel Applications 9 Serial Applications 9 Lloyd’s K-Means Algorithm 9 Parallelizing Algorithms 11 Decomposition 11 Task Assignment 12 Orchestration 12 Mapping 13 K-Means Algorithm 13 Parallel and Distributed Computing 15 Memory Abstractions 16 Shared Memory 16 Distributed Memory 18 Hybrid (Shared + Distributed) Memory 20 Partitioned Global Address Space Memory 21 Application Classes and Frameworks 22 Parallel Interaction Patterns 22 Pleasingly Parallel 23 Dataflow 23 Iterative 23 Irregular 23 Data Abstractions 24 Data-Intensive Frameworks 24 Components 24 Workflows 25 An Example 25 What Makes It Difficult? 26 Developing Applications 27 Concurrency 27 Data Partitioning 28 Debugging 28 Diverse Environments 28 Computer Networks 29 Synchronization 29 Thread Synchronization 29 Data Synchronization 30 Ordering of Events 31 Faults 31 Consensus 31 Summary 32 References 32 Chapter 2 Data and Storage 35 Storage Systems 35 Storage for Distributed Systems 36 Direct-Attached Storage 37 Storage Area Network 37 Network-Attached Storage 38 DAS or SAN or NAS? 38 Storage Abstractions 39 Block Storage 39 File Systems 40 Object Storage 41 Data Formats 41 XML 42 JSON 43 CSV 44 Apache Parquet 45 Apache Avro 47 Avro Data Definitions (Schema) 48 Code Generation 49 Without Code Generation 49 Avro File 49 Schema Evolution 49 Protocol Buffers, Flat Buffers, and Thrift 50 Data Replication 51 Synchronous and Asynchronous Replication 52 Single-Leader and Multileader Replication 52 Data Locality 53 Disadvantages of Replication 54 Data Partitioning 54 Vertical Partitioning 55 Horizontal Partitioning (Sharding) 55 Hybrid Partitioning 56 Considerations for Partitioning 57 NoSQL Databases 58 Data Models 58 Key-Value Databases 58 Document Databases 59 Wide Column Databases 59 Graph Databases 59 CAP Theorem 60 Message Queuing 61 Message Processing Guarantees 63 Durability of Messages 64 Acknowledgments 64 Storage First Brokers and Transient Brokers 65 Summary 66 References 66 Chapter 3 Computing Resources 69 A Demonstration 71 Computer Clusters 72 Anatomy of a Computer Cluster 73 Data Analytics in Clusters 74 Dedicated Clusters 76 Classic Parallel Systems 76 Big Data Systems 77 Shared Clusters 79 OpenMPI on a Slurm Cluster 79 Spark on a Yarn Cluster 80 Distributed Application Life Cycle 80 Life Cycle Steps 80 Step 1: Preparation of the Job Package 81 Step 2: Resource Acquisition 81 Step 3: Distributing the Application (Job) Artifacts 81 Step 4: Bootstrapping the Distributed Environment 82 Step 5: Monitoring 82 Step 6: Termination 83 Computing Resources 83 Data Centers 83 Physical Machines 85 Network 85 Virtual Machines 87 Containers 87 Processor, Random Access Memory, and Cache 88 Cache 89 Multiple Processors in a Computer 90 Nonuniform Memory Access 90 Uniform Memory Access 91 Hard Disk 92 GPUs 92 Mapping Resources to Applications 92 Cluster Resource Managers 93 Kubernetes 94 Kubernetes Architecture 94 Kubernetes Application Concepts 96 Data-Intensive Applications on Kubernetes 96 Slurm 98 Yarn 99 Job Scheduling 99 Scheduling Policy 101 Objective Functions 101 Throughput and Latency 101 Priorities 102 Lowering Distance Among the Processes 102 Data Locality 102 Completion Deadline 102 Algorithms 103 First in First Out 103 Gang Scheduling 103 List Scheduling 103 Backfill Scheduling 104 Summary 104 References 104 Chapter 4 Data Structures 107 Virtual Memory 108 Paging and TLB 109 Cache 111 The Need for Data Structures 112 Cache and Memory Layout 112 Memory Fragmentation 114 Data Transfer 115 Data Transfer Between Frameworks 115 Cross-Language Data Transfer 115 Object and Text Data 116 Serialization 116 Vectors and Matrices 117 1D Vectors 118 Matrices 118 Row-Major and Column-Major Formats 119 N-Dimensional Arrays/Tensors 122 NumPy 123 Memory Representation 125 K-means with NumPy 126 Sparse Matrices 127 Table 128 Table Formats 129 Column Data Format 129 Row Data Format 130 Apache Arrow 130 Arrow Data Format 131 Primitive Types 131 Variable-Length Data 132 Arrow Serialization 133 Arrow Example 133 Pandas DataFrame 134 Column vs. Row Tables 136 Summary 136 References 136 Chapter 5 Programming Models 139 Introduction 139 Parallel Programming Models 140 Parallel Process Interaction 140 Problem Decomposition 140 Data Structures 140 Data Structures and Operations 141 Data Types 141 Local Operations 143 Distributed Operations 143 Array 144 Tensor 145 Indexing 145 Slicing 146 Broadcasting 146 Table 146 Graph Data 148 Message Passing Model 150 Model 151 Message Passing Frameworks 151 Message Passing Interface 151 Bulk Synchronous Parallel 153 K-Means 154 Distributed Data Model 157 Eager Model 157 Dataflow Model 158 Data Frames, Datasets, and Tables 159 Input and Output 160 Task Graphs (Dataflow Graphs) 160 Model 161 User Program to Task Graph 161 Tasks and Functions 162 Source Task 162 Compute Task 163 Implicit vs. Explicit Parallel Models 163 Remote Execution 163 Components 164 Batch Dataflow 165 Data Abstractions 165 Table Abstraction 165 Matrix/Tensors 165 Functions 166 Source 166 Compute 167 Sink 168 An Example 168 Caching State 169 Evaluation Strategy 170 Lazy Evaluation 171 Eager Evaluation 171 Iterative Computations 172 DOALL Parallel 172 DOACROSS Parallel 172 Pipeline Parallel 173 Task Graph Models for Iterative Computations 173 K-Means Algorithm 174 Streaming Dataflow 176 Data Abstractions 177 Streams 177 Distributed Operations 178 Streaming Functions 178 Sources 178 Compute 179 Sink 179 An Example 179 Windowing 180 Windowing Strategies 181 Operations on Windows 182 Handling Late Events 182 SQL 182 Queries 183 Summary 184 References 184 Chapter 6 Messaging 187 Network Services 188 TCP/IP 188 RDMA 189 Messaging for Data Analytics 189 Anatomy of a Message 190 Data Packing 190 Protocol 191 Message Types 192 Control Messages 192 External Data Sources 192 Data Transfer Messages 192 Distributed Operations 194 How Are They Used? 194 Task Graph 194 Parallel Processes 195 Anatomy of a Distributed Operation 198 Data Abstractions 198 Distributed Operation API 198 Streaming and Batch Operations 199 Streaming Operations 199 Batch Operations 199 Distributed Operations on Arrays 200 Broadcast 200 Reduce and AllReduce 201 Gather and AllGather 202 Scatter 203 AllToAll 204 Optimized Operations 204 Broadcast 205 Reduce 206 AllReduce 206 Gather and AllGather Collective Algorithms 208 Scatter and AllToAll Collective Algorithms 208 Distributed Operations on Tables 209 Shuffle 209 Partitioning Data 211 Handling Large Data 212 Fetch-Based Algorithm (Asynchronous Algorithm) 213 Distributed Synchronization Algorithm 214 GroupBy 214 Aggregate 215 Join 216 Join Algorithms 219 Distributed Joins 221 Performance of Joins 223 More Operations 223 Advanced Topics 224 Data Packing 224 Memory Considerations 224 Message Coalescing 224 Compression 225 Stragglers 225 Nonblocking vs. Blocking Operations 225 Blocking Operations 226 Nonblocking Operations 226 Summary 227 References 227 Chapter 7 Parallel Tasks 229 CPUs 229 Cache 229 False Sharing 230 Vectorization 231 Threads and Processes 234 Concurrency and Parallelism 234 Context Switches and Scheduling 234 Mutual Exclusion 235 User-Level Threads 236 Process Affinity 236 NUMA-Aware Programming 237 Accelerators 237 Task Execution 238 Scheduling 240 Static Scheduling 240 Dynamic Scheduling 240 Loosely Synchronous and Asynchronous Execution 241 Loosely Synchronous Parallel System 242 Asynchronous Parallel System (Fully Distributed) 243 Actor Model 244 Actor 244 Asynchronous Messages 244 Actor Frameworks 245 Execution Models 245 Process Model 246 Thread Model 246 Remote Execution 246 Tasks for Data Analytics 248 SPMD and MPMD Execution 248 Batch Tasks 249 Data Partitions 249 Operations 251 Task Graph Scheduling 253 Threads, CPU Cores, and Partitions 254 Data Locality 255 Execution 257 Streaming Execution 257 State 257 Immutable Data 258 State in Driver 258 Distributed State 259 Streaming Tasks 259 Streams and Data Partitioning 260 Partitions 260 Operations 261 Scheduling 262 Uniform Resources 263 Resource-Aware Scheduling 264 Execution 264 Dynamic Scaling 264 Back Pressure (Flow Control) 265 Rate-Based Flow Control 266 Credit-Based Flow Control 266 State 267 Summary 268 References 268 Chapter 8 Case Studies 271 Apache Hadoop 271 Programming Model 272 Architecture 274 Cluster Resource Management 275 Apache Spark 275 Programming Model 275 RDD API 276 SQL, DataFrames, and DataSets 277 Architecture 278 Resource Managers 278 Task Schedulers 279 Executors 279 Communication Operations 280 Apache Spark Streaming 280 Apache Storm 282 Programming Model 282 Architecture 284 Cluster Resource Managers 285 Communication Operations 286 Kafka Streams 286 Programming Model 286 Architecture 287 PyTorch 288 Programming Model 288 Execution 292 Cylon 295 Programming Model 296 Architecture 296 Execution 297 Communication Operations 298 Rapids cuDF 298 Programming Model 298 Architecture 299 Summary 300 References 300 Chapter 9 Fault Tolerance 303 Dependable Systems and Failures 303 Fault Tolerance is Not Free 304 Dependable Systems 305 Failures 306 Process Failures 306 Network Failures 307 Node Failures 307 Byzantine Faults 307 Failure Models 308 Failure Detection 308 Recovering from Faults 309 Recovery Methods 310 Stateless Programs 310 Batch Systems 311 Streaming Systems 311 Processing Guarantees 311 Role of Cluster Resource Managers 312 Checkpointing 313 State 313 Consistent Global State 313 Uncoordinated Checkpointing 314 Coordinated Checkpointing 315 Chandy-Lamport Algorithm 315 Batch Systems 316 When to Checkpoint? 317 Snapshot Data 318 Streaming Systems 319 Case Study: Apache Storm 319 Message Tracking 320 Failure Recovery 321 Case Study: Apache Flink 321 Checkpointing 322 Failure Recovery 324 Batch Systems 324 Iterative Programs 324 Case Study: Apache Spark 325 RDD Recomputing 326 Checkpointing 326 Recovery from Failures 327 Summary 327 References 327 Chapter 10 Performance and Productivity 329 Performance Metrics 329 System Performance Metrics 330 Parallel Performance Metrics 330 Speedup 330 Strong Scaling 331 Weak Scaling 332 Parallel Efficiency 332 Amdahl’s Law 333 Gustafson’s Law 334 Throughput 334 Latency 335 Benchmarks 336 LINPACK Benchmark 336 NAS Parallel Benchmark 336 BigDataBench 336 TPC Benchmarks 337 HiBench 337 Performance Factors 337 Memory 337 Execution 338 Distributed Operators 338 Disk I/O 339 Garbage Collection 339 Finding Issues 342 Serial Programs 342 Profiling 342 Scaling 343 Strong Scaling 343 Weak Scaling 344 Debugging Distributed Applications 344 Programming Languages 345 C/C++ 346 Java 346 Memory Management 347 Data Structures 348 Interfacing with Python 348 Python 350 C/C++ Code integration 350 Productivity 351 Choice of Frameworks 351 Operating Environment 353 CPUs and GPUs 353 Public Clouds 355 Future of Data-Intensive Applications 358 Summary 358 References 359 Index 361


Best Sellers


Product Details
  • ISBN-13: 9781119713029
  • Publisher: John Wiley & Sons Inc
  • Publisher Imprint: John Wiley & Sons Inc
  • Height: 231 mm
  • No of Pages: 416
  • Spine Width: 25 mm
  • Weight: 698 gr
  • ISBN-10: 1119713021
  • Publisher Date: 15 Nov 2021
  • Binding: Paperback
  • Language: English
  • Returnable: N
  • Sub Title: Large Scale Data Analytics under the Hood
  • Width: 188 mm


Similar Products

How would you rate your experience shopping for books on Bookswagon?

Add Photo
Add Photo

Customer Reviews

REVIEWS           
Click Here To Be The First to Review this Product
Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood
John Wiley & Sons Inc -
Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood
Writing guidlines
We want to publish your review, so please:
  • keep your review on the product. Review's that defame author's character will be rejected.
  • Keep your review focused on the product.
  • Avoid writing about customer service. contact us instead if you have issue requiring immediate attention.
  • Refrain from mentioning competitors or the specific price you paid for the product.
  • Do not include any personally identifiable information, such as full names.

Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood

Required fields are marked with *

Review Title*
Review
    Add Photo Add up to 6 photos
    Would you recommend this product to a friend?
    Tag this Book
    Read more
    Does your review contain spoilers?
    What type of reader best describes you?
    I agree to the terms & conditions
    You may receive emails regarding this submission. Any emails will include the ability to opt-out of future communications.

    CUSTOMER RATINGS AND REVIEWS AND QUESTIONS AND ANSWERS TERMS OF USE

    These Terms of Use govern your conduct associated with the Customer Ratings and Reviews and/or Questions and Answers service offered by Bookswagon (the "CRR Service").


    By submitting any content to Bookswagon, you guarantee that:
    • You are the sole author and owner of the intellectual property rights in the content;
    • All "moral rights" that you may have in such content have been voluntarily waived by you;
    • All content that you post is accurate;
    • You are at least 13 years old;
    • Use of the content you supply does not violate these Terms of Use and will not cause injury to any person or entity.
    You further agree that you may not submit any content:
    • That is known by you to be false, inaccurate or misleading;
    • That infringes any third party's copyright, patent, trademark, trade secret or other proprietary rights or rights of publicity or privacy;
    • That violates any law, statute, ordinance or regulation (including, but not limited to, those governing, consumer protection, unfair competition, anti-discrimination or false advertising);
    • That is, or may reasonably be considered to be, defamatory, libelous, hateful, racially or religiously biased or offensive, unlawfully threatening or unlawfully harassing to any individual, partnership or corporation;
    • For which you were compensated or granted any consideration by any unapproved third party;
    • That includes any information that references other websites, addresses, email addresses, contact information or phone numbers;
    • That contains any computer viruses, worms or other potentially damaging computer programs or files.
    You agree to indemnify and hold Bookswagon (and its officers, directors, agents, subsidiaries, joint ventures, employees and third-party service providers, including but not limited to Bazaarvoice, Inc.), harmless from all claims, demands, and damages (actual and consequential) of every kind and nature, known and unknown including reasonable attorneys' fees, arising out of a breach of your representations and warranties set forth above, or your violation of any law or the rights of a third party.


    For any content that you submit, you grant Bookswagon a perpetual, irrevocable, royalty-free, transferable right and license to use, copy, modify, delete in its entirety, adapt, publish, translate, create derivative works from and/or sell, transfer, and/or distribute such content and/or incorporate such content into any form, medium or technology throughout the world without compensation to you. Additionally,  Bookswagon may transfer or share any personal information that you submit with its third-party service providers, including but not limited to Bazaarvoice, Inc. in accordance with  Privacy Policy


    All content that you submit may be used at Bookswagon's sole discretion. Bookswagon reserves the right to change, condense, withhold publication, remove or delete any content on Bookswagon's website that Bookswagon deems, in its sole discretion, to violate the content guidelines or any other provision of these Terms of Use.  Bookswagon does not guarantee that you will have any recourse through Bookswagon to edit or delete any content you have submitted. Ratings and written comments are generally posted within two to four business days. However, Bookswagon reserves the right to remove or to refuse to post any submission to the extent authorized by law. You acknowledge that you, not Bookswagon, are responsible for the contents of your submission. None of the content that you submit shall be subject to any obligation of confidence on the part of Bookswagon, its agents, subsidiaries, affiliates, partners or third party service providers (including but not limited to Bazaarvoice, Inc.)and their respective directors, officers and employees.

    Accept

    New Arrivals


    Inspired by your browsing history


    Your review has been submitted!

    You've already reviewed this product!
    ASK VIDYA