Modern Day Data Science:

Tools to help you and your organization “keep up with the Joneses!”

Patrick Anastasio
11 min readFeb 12, 2022

In today’s article, let’s talk about some general tools that can improve efficient workflow and processing power, and cut your time and infrastructure requirements. Specifically, we’ll discuss Distributed Processing, Pipeline Management, Containerization, and the Goliath resource of Cloud Computing.

Distributed Processing (DP):

Divide and conquer

Not gonna lie, I feel like my unit is top of the line. But even just running a simple K Nearest Neighbors algorithm takes more time than I’d like to admit, and don’t even get me started on doing a Grid Search. Doing complex modeling on Big Data sets on individual units takes large amounts of processing power and memory. This can be expensive, not just for an individual doing independent projects, but also for institutions or companies working on commercial projects. This is where distributed processing saves the day. In a nutshell, DP combines the processing power of several computing units, or “clusters”, that are connected over a network, either directly via peer to peer, or via a client-server setup (the latter being more common, and efficient).

Individual units, or “nodes”, donate processing cores to the project. A distributed system of multiple processors offers the advantage of accomplishing project tasks at a fraction of the time it would take for individual units to complete. This is accomplished through a process called “parallelization.” This means that the project, or problem to be solved, is broken done into several smaller tasks, the caveat being that all tasks must be the same, i.e. follow the same set of instructions. In other words, you are taking a problem, and breaking it down into smaller units of the same problem all to be solved the same way. Then these smaller tasks are simultaneously completed on each node. The results are then compiled into one complete answer to the problem. But this begs the question of how to make thee data set easier to work with… Enter MapReduce stage left!

MapReduce

Back before data science was cool, some gurus at Apache devised a framework to work with Big Data more efficiently… this was called Hadoop working with the Hadoop File System (HDFS). This system was centered around a revolutionary process called MapReduce. What this did is “map” the data set into key:value pairs, split the data into smaller fragments, and “reduce” the fragments on their keys. The fragments are then assigned to the separate tasks and distributed amongst the nodes. Genius!

Over time, however, as Big Data kept getting bigger Hadoop was unable to stay agile and keep up. The problem being that the data was stored on the node hard drives and not into the nimbly accessible RAM. Big Data was growing faster than the hardware could be developed to handle it. This slowed the process down, which spells disaster for any data science project where efficient time and resources management is paramount. So Apache got to work on something else… Spark!

SPARK vs HADOOP

The quintessential difference is that Spark is able to keep the data on RAM so the data can be accessed almost instantaneously. Spark uses a file system called RDD (Resilient Distributed Dataset). One of the main features is the implementation of a Directed Acyclic Graph (DAG). DAGs are a very complex topic, and a full explanation is beyond the scope of this article. I’ll just say that it is a method used to keep track of and schedule the complex operations of a Spark application, which allows for fault tolerance, lineage tracking, and more efficient assignment of tasks (much more efficient than Hadoop).

The Directed Acyclic Graph (DAG) allows Spark to be:

  • Resilient — fault tolerant; one node can go down and the data and tasks can be re-implemented on data from a previous state.
  • Immutable — each transform creates a new RDD, creating a “lineage”
  • Lazy — operations are not performed until called, and can be organized to make more efficient use of computation and memory load

Besides the RAM storage, a real difference maker is the “Lazy” component of Spark. In total, “transformations on RDDs are lazily evaluated, meaning that Spark will not compute RDDs until an action is called. Spark keeps track of the lineage graph of transformations, which is used to compute each RDD on demand and to recover lost data.” [1] So, unlike Hadoop, new RDD’s and operations can be defined, without having to be immediately computed.

NOTE: Spark was built on the RDD abstraction, however, Apache being Apache, they are always rapidly evolving. They are currently developing more advanced abstractions such as the DatFrame and Dataset APIs for more efficient use of Spark in certain use cases.

Streaming Data and Pipeline Management

Work smarter, not harder

Machine Learning is a process that can be described as “extremely repetitive.” Pre-processing data, cleaning, standardizing, normalizing, encoding, feature engineering, splitting, fitting, transforming, evaluating, rinse, repeat, so on and so forth, algorithm after algorithm, model after model. Then consider the increasing velocity of which data is being accumulated and the need to process large incoming data streams and update models repeatedly. This can make manual processes untenable. Building pipelines can streamline this process and make for a nice, efficient workflow… which will make you more productive… and maybe put in line for that promotion you’ve been wanting. But I digress…

The ability to process streaming data is extremely important for organizations to be able to react quickly to market conditions, anticipate and meet customer needs, and predict and prevent emergency events before they happen.

Let’s take a look at Kafka, the industry leading application for pipeline and streaming data management. Surprise, its another Apache project!

Kafka bills itself as “an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.”

“More than 80% of all Fortune 100 companies trust, and use Kafka.”

So what is event streaming? It is the process of capturing data in real-time from sources like databases, sensors, mobile devices, cloud services, and software applications in the form of “event streams.” Kafka then stores these event streams and provides tools for manipulating, processing, and reacting in real-time as well as retrospectively; and routing the event streams to different destination technologies as needed. “Event streaming thus ensures a continuous flow and interpretation of data so that the right information is at the right place, at the right time… It can be deployed on bare-metal hardware, virtual machines, and containers [more on these below] in on-premise as well as cloud environments.” [2]

Basically, Kafka provides a streamlined process for taking in streaming data and storing it according to your needs, and implementing processing pipelines on the data stream providing key output for your organization’s business needs.

Containerization is the new Virtualization

Quarantine your applications 🦠

You may be familiar with virtualization, or virtual machines (VMs)… or maybe not. A VM is like a computer within a computer. Via software, It mimics hardware, an environment, or a different operating system, allowing the user to run programs that they otherwise would not be able to on the native system, or “host.” This is great for little Johnny who wants to be able to use Windows on his macBook Pro, but VMs have drawbacks when used in commercial applications where many VMs may be necessary at once. VMs partition off parts of the CPU and RAM. Running multiple VMs quickly depletes the processing power of the host machine.

Containerization is better suited for commercial use cases. This is a form or virtualization where different applications, or several instances of the same application, can run in isolated spaces called containers, sharing the same OS and the same processing resources, with no need to setup separate VMs. Containers increase resource use per server and may reduce the number of systems needed because of this benefit, in turn reducing costs to your organization. A container is its own computing environment containing everything the application needs to run right there in the container; dependencies, libraries, configuration files. The container is “quarantined,” if you will, away from the host OS, but still having access to the underlying resources need to operate. In other words, containers all share the same OS kernel.

Think of a containerized application as the top layer of a multi-tier cake:

— At the bottom, there’s the hardware of the infrastructure in question, including its CPU(s), disk storage, and network interfaces.

— Above that is the host OS and its kernel — the latter serves as a bridge between the software of the OS and the hardware of the underlying system.

— The container engine and its minimal guest OS, which are particular to the containerization technology being used, sit atop the host OS.

— At the very top are the binaries and libraries (bins/libs) for each application and the apps themselves, running in their isolated user spaces (containers).

https://www.citrix.com/solutions/app-delivery-and-security/what-is-containerization.html

The two leading engines of this technology are Docker and Kubernetes.

Docker is portable. Applications can seamlessly run repeatedly on other nodes that are running Docker. It also comes with version control built in!

Think of containers as the packaging for microservices that separate the content from its environment — the underlying operating system and infrastructure. Just like shipping containers revolutionized the transportation industry, Docker containers disrupted software. A standard Docker container can run anywhere, on a personal computer (for example, PC, Mac, Linux), in the cloud, on local servers, and even on edge devices.

Container technology is very powerful as small teams can develop and package their application on laptops and then deploy it anywhere into staging or production environments without having to worry about dependencies, configurations, OS, hardware, and so on. The time and effort saved with testing and deployment are a game-changer for DevOps.

https://www.dynatrace.com/news/blog/kubernetes-vs-docker/

Kubernetes is a container orchestration platform and is the de facto standard because of its greater flexibility and capacity to scale.

Container deployment. In the simplest terms, this means to retrieve a container image from the repository and deploy it on a node. However, an orchestration platform does much more than this: it enables automatic re-creation of failed containers, rolling deployments to avoid downtime for the end-users, as well as managing the entire container lifecycle.

Scaling. This is one of the most important tasks an orchestration platform performs. The “scheduler” determines the placement of new containers so compute resources are used most efficiently. Containers can be replicated or deleted on the fly to meet varying end-user traffic.

Networking. The containerized services need to find and talk to each other in a secure manner, which isn’t a trivial task given the dynamic nature of containers. In addition, some services, like the front-end, need to be exposed to end-users, and a load balancer is required to distribute traffic across multiple nodes.

Observability. An orchestration platform needs to expose data about its internal states and activities in the form of logs, events, metrics, or transaction traces. This is essential for operators to understand the health and behavior of the container infrastructure as well as the applications running in it.

Security. Security is a growing area of concern for managing containers. An orchestration platform has various mechanisms built in to prevent vulnerabilities such as secure container deployment pipelines, encrypted network traffic, secret stores and more. However, these mechanisms alone are not sufficient, but require a comprehensive DevSecOps approach.

https://www.dynatrace.com/news/blog/kubernetes-vs-docker/

A major difference between Docker and Kubernetes is that Docker runs on a single node, whereas Kubernetes is designed to run across a cluster. Another difference between Kubernetes and Docker is that Docker can be used without Kubernetes, whereas Kubernetes, being an orchestration and deployment platform, needs a container runtime in order to orchestrate.

Docker and Kubernetes: Better together

Simply put, the Docker suite and Kubernetes are technologies with different scopes. You can use Docker without Kubernetes and vice versa, however they work well together.

From the perspective of a software development cycle, Docker’s home turf is development. This includes configuring, building, and distributing containers using CI/CD pipelines and DockerHub as an image registry. On the other hand, Kubernetes shines in operations, allowing you to use your existing Docker containers while tackling the complexities of deployment, networking, scaling, and monitoring.

Although Docker Swarm is an alternative to Kubernetes, it is the best choice when it comes to orchestrating large distributed applications with hundreds of connected microservices including databases, secrets and external dependencies.

https://www.dynatrace.com/news/blog/kubernetes-vs-docker/

Cloud Computing

Your computer sucks!

What is cloud computing?

First let’s just acknowledge the elephant in the room. there is no cloud… you knew that, right? A cloud is just a fancy way of referring to supercomputers and massive server farms that you use over the internet, instead of needed your own . According to AWS, the crowned king of the industry… “Cloud computing is the on-demand delivery of IT resources over the Internet with pay-as-you-go pricing. Instead of buying, owning, and maintaining physical data centers and servers, you can access technology services, such as computing power, storage, and databases, on an as-needed basis from a cloud provider…” [3]

It seems every organization is using a cloud computing service these days, and its probably AWS, although they have some competitors. Cloud computing allows an organization to cut hardware and processing costs, not to mention utility costs, in favor of the pay as you go model of cloud computing.

There are 3 main types of cloud computing applications:

  • Infrastructure as a Service (IaaS)

This is the part where you access the aforementioned supercomputers and servers farms for data warehousing, networking capability, and exponential processing power. This is basically computing power and storage… for rent.

  • Platform as a Service (PaaS)

This allows you to focus on the deployment and management of applications and not on the heavy lifting involved in running it. Your applications are stored on, managed and deployed directly from the cloud. This is the tools and software that developers need to build applications… for rent.

  • Software as a Service (SaaS)

SaaS is the delivery of applications as a service. Some of you may remember back in the day when you used to buy programs, like physical disks. SaaS is the practice of licensing use of an application to users, i.e. pay-to-play. You can not own the programs anymore… but you can rent them!

The Heavyweights:

AWS | Microsoft Azure | Google Cloud

AWS | Microsoft Azure | Google Cloud

A great article comparing and contrasting these titans can be found here.

--

--