All’s Fair in Love and Distributed Storage

Intro

There has been a growing shift in the way applications use and consume storage. I don’t want to give a computer history lesson here, but suffice to say, since the mainstream adoption of virtualization in the early 2000’s the demands on traditional storage arrays changed. The I/O demands have gone from single-stream monolithic I/O streams to hundreds of smaller, but disparate, requests to the same array. Things had to change.

Distributed computing mainly lived in the University labs, or the startups of the late 90’s like Yahoo, Google, etc. For storage this meant spreading workloads across several controllers vs everything hitting a single controller. All of the disks connected to each storage controller all participated in the same pool of storage. This paradigm allows for a lot more workloads hitting essentially the same pool of disk. Now all of these workloads could get an even share of disk and controller resources.

The one thing that has lagged behind these changes are the tools and techniques for managing distributed storage. As systems and storage administrators, our day-to-day tools weren’t geared for distributed systems. This series of blog posts looks to address just that: How to hack your system admin tools to work in a distributed systems era!

While I do work at Cohesity, Inc, and all these posts will work with the Cohesity DataPlatform.

I hope you find these posts helpful!

Rsync – A Love / Hate relationship.

If you need to move a vast quantity of data between two Linux/*NIX systems, rsync can be your best friend. But making it go fast can be mind-crushingly painful. Even harder yet is when you try to parallelize streams across several storage controllers. In order to go fast you need to accomplish two things in a distributed system: Parallelize as much as possible, and push as many disk operations as you can. Thankfully, in Linux, we have a few different ways to accomplish this.

GNU Parallel

If you are ok with installing OpenSource software on your servers, take a look at GNU Parallel. This simple utility can make life a lot easier. Below is a quick sample on how to run n instances of rsync with parallel:

rsync -avzm –stats –safe-links –ignore-existing –dry-run –human-readable $src $dest > /tmp/transaction.logcat /tmp/transaction.log | parallel –will-cite -j $threads rsync -avzm –relative –stats –safe-links –ignore-existing –human-readable {} $dest >> /tmp/results.log

The first line of rsync builds out a list of transactions that rsync needs to accomplish, the second line actually re-uses that output to drive the rsync command. Using parallel with a -j option will set the number of treads parallel will execute. Simple as that!

Xargs, Or: How to Use What’s On the Truck

If you don’t like installing other software, or are under more strict change management for your servers, you can accomplish the same task above using xargs to execute the rsync’s in parallel. Here’s an example of the xargs way of doing it. Simply put, we are searching for files in a directory and sending them to rsync:

find . ! -type d -print0 | xargs -0 -n1 -P$THREADS -I% rsync -az % $DESTDIR/%

Using the -P option allows you to specify the number of rsync threads you’re generating, and then you just need to pass a destination directory at the end.

Now we can parallel process things, in two different ways, from a process point of view, but now we need to spread them across all the storage controllers. For this trick, we can use a series of nested for loops. This method can be used for either xargs or parallel. For this example script we need to pass two configuration files, the first contains a list of SOURCE directories and the second contains a list of DESTINATION directories. In our example we are using a single Cohesity View, and mounting the View to our server with all our available VIPs:

export srcs=$1
export dests=$2
export threads=5
export loop_src=`cat $srcs`
export loop_dest=`cat $dests`
for src in $loop_src
do
for dest in $loop_dest
do
echo “about to rsync $src to $dest”
# Place your method of paralleling rsync here echo “completed the rsync of $src to $dest…moving on to next”
loop_dest=`echo $loop_dest | awk ‘{for (i=2; i<=NF; i++) print $i}’`
echo “new list of dirs is $dests”
break
done
done

This is a pretty standard for loop construction here, except we had to come up with a clever way to loop back around the destination directories. You can see this in our re-initialization of the loop_dest variable.

And there you have it! A completely parallelized and fully distributed wrapper for rsync! For those following along, here’s the entire script utilizing the GNU Parallel method:

#/bin/bash
#
# To use this script please have parallel installed
# In Ubuntu just run: sudo apt-get install parallel
# Once installed, create two txtfiles:
# sources.txt should contain a line separated list of the files you which to rsync
# destinations.txt should contain a line separated list of the Cohesity Mount Points
# Make a mount point for every VIP in your cluster and map them all to the same view
# The just run this script: p_rsync sources.txt destinations.txtexport srcs=$1
export dests=$2
export threads=5
export loop_src=`cat $srcs`
export loop_dest=`cat $dests`#check for Parallel
echo “checking for Parallel to be installed”
program=”parallel”
condition=$(which $program 2>/dev/null | grep -v “not found” | wc -l)
if [ $condition -eq 0 ] ; then
echo “$program is not installed”
exit
fiecho “parsing the SOURCE and DESTINATION strings to start the RSYNC in parallel”
#itterate over a list of Source directories and pass them off to Destinations one by one…
for src in $loop_src
do
for dest in $loop_dest
do
echo “about to rsync $src to $dest”
rsync -avzm –stats –safe-links –ignore-existing –dry-run –human-readable $src $dest > /tmp/transaction.log
cat /tmp/transaction.log | parallel –will-cite -j $threads rsync -avzm –relative –stats –safe-links –ignore-existing –human-readable {} $dest >> /tmp/results.log &
echo “completed the rsync of $src to $dest…moving on to next”
loop_dest=`echo $loop_dest | awk ‘{for (i=2; i<=NF; i++) print $i}’`
echo “new list of dirs is $dests”
break
done
done

Written By

Greg Statton

Office of the CTO - Data & AI

Cohesity Data Cloud

Cohesity for Midsize Business

Data Cloud packaging

Certified platforms

Backup & Recovery

Cloud Data Protection

Cloud App Recovery

SaaS Data Protection

Unstructured Data Protection

Cyber Resilience

Zero Trust Security

Archive & Long-Term Retention

Disaster Recovery

Identity Resilience

Ransomware Anomaly Detection

Threat Protection

Data Security Posture Management

Data Classification

Cyber Vaulting

Cyber Recovery Orchestration

Digital Jump Bag

Clean Room Solution

AI Conversational Search

Reporting & IT Insights

Global Search

Why Cohesity

Cohesity vs. Rubrik

Cohesity vs. Commvault

View all solutions

Learn more

M365

AWS

Azure

Google Cloud

Google Workspace

Slack

Active Directory & Entra ID

Hyper-V

Kubernetes

IBM

MongoDB

MS SQL Server

NAS

NoSQL & Hadoop

Nutanix

Oracle

Red Hat

SAP HANA

VMware

Minimum Viable Company

Financial services

Healthcare and life sciences

Manufacturing and logistics

Federal government

State, local government, and education

Retail and hospitality

Legal

Energy and utilities

Telecom

Technology

Media and entertainment

View all

Corporate Overview

Blogs

Demos

Events and Webinars

Customer Stories

Glossary

Marketplace

Podcast

Support

Support login

How-to videos

Cyber Event Response Team

Professional Services

Cohesity Community

Cohesity Academy

Trust Center

Cohesity REDLab

Channel Partners