The Language of Data

Data science is an umbrella term that can cover visualization, statistical inference, predictive modeling, machine learning, deep learning, artificial intelligence, data engineering, insert buzzword of the day here… (the list goes on). Regardless of what specific niche you’re trying to fit into in the data science universe, it’s likely you’ll rely on a few key languages. These popular languages include R and Python (and SQL when you have to). There are other languages out there that can be used for data analysis and machine learning, but R and Python are the most popular from what I’ve seen. I will discuss both of these primary languages below and weigh their strengths and weaknesses.

R

The R language was the first language I got heavily involved with, but only after starting grad school. In undergrad I had toyed around with some basic HTML5 and CSS while taking my Biochemistry degree. I didn’t see a potential for overlap between these languages and my analytic and scientific interests so I lost motivation. Then when I got to grad school I had a few chances to learn R (along with some other skills like linux/bash for HPC and Python). The first few brushes I had with R didn’t stick, probably because I didn’t think coding was for me. I thought you had to be a special person to understand coding, but as grad school forced me trouble shoot more and more I lost my fear of struggling with problems and gave R an honest shot. I was also forced to take a statistics class at this point that required R.

After finally diving into R, I wondered why I had waited so long. My preconceived notion of a coder was someone who knew all the commands and functions by memory and steadily plugged away. I found the reality was more being willing to Google your questions and using stackoverflow.com to fill in the missing links. The language became very intuitive, very quickly, as I imagine it would for most people if they give themselves a chance. After getting the basics down, I began having curiosities about buzz words I was hearing from more experienced students and researchers, such as “Machine Learning”. It sounded very foreign, and again, out of my grasp. But I quickly found that there is a plethora of resources for machine learning in R as well. From this point, I just continued to follow my curiosities and learn from internet resources and further courses in data science and R programming.

The strengths of R include the intuitive object oriented style, the RStudio interface, Rmd files which can easily be turned into various outputs, and my personal favorite Shiny apps. Weaknesses of R may include it is not as favored for web integration (outside of Shiny apps) as Python is. I think the R language has continually made improvements in ease of use, deployment options, and open source packages which allow more analysis options.

Python

Recently, Python has surpassed R in popularity. However, R is still heavily used in scientific research as well as business. I think R has a good community to answer questions but Python’s community surpasses it and is only growing. Python may be a good language for some to start with because of the aforementioned community. If you have a question it’s likely someone has had before you and already found the answer, and luckily that answer is still on the internet. Python also has a wide variety of deep learning and machine learning packages to choose from. While most data science jobs will accept R or Python (or both), I think Python gives flexibility for more job opportunities outside of data science.

My first experience with Python came a summer ago when I entered a “DREAM challenge”. These challenges use machine learning to answer biological questions. I figured since it was the summer and I had some spare time, it would be well spent learning Python. While a little different than R, I felt I picked it up fairly quick, and I think most people also could. What I really enjoyed was the amount of tutorials I could find for different machine learning methods and then apply them to this challenge with a little tweaking. Another HUGE benefit of Python is being able to use Google colab, which gives free GPU access which is very helpful for speeding up model training. While R isn’t intended for Google colab, you can still access GPU’s for R in Kaggle.

Conclusion

Both languages are useful, but I’m biased towards R because of Shiny apps and comfort. Python has more career flexibility, but if Data Science is your goal, both are good. The resources for Python may be a little more extensive than R, but R is consistently innovating and making it easy to learn, build, and deploy machine learning models.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: