An interactive course to teach Data Science with Bash shell
Ahmed Shamsul Arefin
Jul 08, 2017
The Problem
Bash may not be the best way to handle all kinds of data, but there often comes a time when you are provided with a pure Bash environment, such as what we get in the common Linux-based Supercomputers and you just want an early result or view of the data before driving into the real programming, using Python, R and SQL, SPSS, and so on. Expertise in data-intensive languages comes at the price of spending a lot of time on them. In contrast, bash scripting is simple, easy to learn and perfect for mining textual data. I couldn’t find many books (except the ones listed below) that discuss Bash in the context of data sciences!
Solution — Write up the ideas!
I wanted to create a super beginner friendly reading material that would help the people who are not very much familiar with Bash/Linux, but willing to use the power of it.
I then worked on creating a few tutorials demonstrating four practical flat file data mining projects each with a different objective function: University ranking data, Facebook data, Australian statistics crime data and Shakespeare-era plays and poems data (all data were collected from public domain), but mainly at my spare times. I would say it was fun to see the potentials of command line tools!
Coming with some backgrounds on Linux-based Supercomputing and parallel data mining, I wanted ensure that my target audience can get going with data on the Linux shell!
I organized my writing in such a way that if a reader hasn’t used Bash before, he can skip the projects and get to tutorials part. He then reads the tutorials and then come back to the projects after getting some basics on the Bash shell. The tutorial section introduces him with bash scripting, regular expressions, AWK
, sed
, grep
and so on.
Publishing!
I did not want to go through the traditional publishing process, as that would take time to publish (have previous experience with noted journals and publications). I also did not want to publish a print book! because I knew I am going to improve my book at a later time. Therefore, I approached the leanpub.com.
Leanpub is the combination of two things: a powerful book writing platform, and an online storefront where you can buy books.
It took me sometimes to write and format the book in the Markdown format, but I would say it was really an awesome experience. In the past, I used Word + Endnote, Latex + Bibtex, Indesign and so on, but the choice of Markdown I believe was just perfect for this project.
The book became available in several format, as soon as I pressed the Publish button! Learn to Analyze Text Data in Bash Shell and Linux, but I did not stop there, I found something more interesting, keep reading…
Making an interactive course!
There have been many options to create an online course, but I wanted to produce contents that should be interesting to watch and run! Udemy.com was an option and I did publish a version of the book at Learn to Analyze Data!, but I wanted to produce one that would have both video lectures and playable/ run-able code. I suddenly came across the Educative.io platform that is specifically designed to deal with Computer Science courses and was a perfect match for my need!
Without any trouble, I converted my markdown ebook into an Educative.io course. The best part of the Educative platform is — it allowed me to insert video tutorials, code playgrounds, and images (screenshots and animated images using their own ‘draw’ tools), all in one lesson! See an example page below:
The best part of the Educative.io platform is — it allowed me to insert video tutorial + code playground + images, all in one lesson!
Course went live following Educative’s careful team review. I am thankful that the team has quickly uploaded the course data onto their server and get-set-go the course on the fly! The course went live at educative.io/learntoanalyzedata
Final words!
There are several examples of practical data mining that have a flow of importing specific data resources into flat text-type files. Bash can run different programs (grep, sort, sed, and so on) on those files, clean, optimise and extract preliminary views (cut, csvlook, view, cat, head, etc.) of the data. There is one part of data mining, which involves unstructured data and then transforming it into a structured one (awk, shell). A scripting language like Bash can be very useful for doing the transformation.
Almost everyone can benefit from learning to use Bash particularly in data mining: particularly students who want to learn Bash and the command line to improve their career prospects, researchers who want to add Bash and other command line tools to their bag of tricks, scientists who want to learn to explore and analyse the data that their lab generates.
Therefore, I believe you would be interested in learning bash shell — a- must have skill for everyone! This thinking helped me to initiate my own learning project — https://www.learntoanalyzedata.com
References:
- Learn to Analyze Text Data in Bash Shell and Linux, Ahmed Arefin
- Data Science at the Command Line by Jeroen Janssens
- Adventures in data science with Bash, Robert Aboukhalil
Originally published at: https://medium.com/learn-to-analyze-data/