R Programming (JHU Coursera, Course 2)
The second course in the data science specialization, “R Programming” is an introductory course teaching users the basics of R. While I did think it went over the basics well, the assignment difficulty was a bit too much for true beginners to R. In result, I decided to post all my code on my github.
I did this course with a bit of a twist. Instead of working purely with data.frames, I decided to use the highly popular data.table package instead of just relying on data.frame which isn’t as widely used as data.table for large datasets.
Week 1 Highlights: Removing and Subsetting Data is a highly useful skill. The teaching of vectors, lists, matrices, and factors was superb. Wasn’t too difficult.
Week 2 Highlights: Lexical scoping as the reason why all objects must be stored in memory. Programming assignment is useful. As seen below, I have decided to do as much of the specialization in data.table syntax as possible since it is widely used in industry.
Week 2 Horrors: Lots of users have complained about the considerable increase in difficulty for this weeks project. That is partly why I decided to put all my work on my github.
Week 3 Highlights: Opportunity to Practice my data.table skills (or lack thereof) on the quiz below. It is good to see the use of common datasets used as they are easy to manipulate. The assignment wasn’t too bad.
Week 4 Highlights: The quiz wasn’t bad. Lectures on R Profiler. There was a good note on optimization being a priority when the code is designed, working, and understandable. Figuring out best hospital by state by ordering, groupby etc was fun. The assignment was a bit much in my opinion considering it is a beginner specialization. The thumbnail image is what I generated in the early portion of this assignment.
Overall, this was a highly useful course for industry as it teaches people to take data files and manipulate them. It did however expect too much too fast for beginners. Please let me know if you have any questions by leaving a comment!
Also, please see my Course 3 Getting and Cleaning Data Review!
In this lecture, I just want to get everyone on board with writing
functions, because functions play a critical
role in any R programming and you
tend to write a lot of them when you're writing doing a
lot of data analysis or doing a lot of kind of statistical analysis.
And so I just want to make sure that everyone can kind of get started writing
functions and and particularly for those who
are less familiar with programming languages in general.
So this is just going to be about writing your first function.
It's kind of like the hello world so to speak of R.
So the first thing you're going to want to do is you
going to want to write the function in a text file, all right.
It's possible to write functions on the command
line in R, but it usually no preferrable.
So usually you're going to want to put your functions, in a separate file,
separate from any interactive stuff that you're doing in the command line.
In the future you'll want to put your functions in
an R package, which is a kind of a more
structured type of kind of, kind of environment with
documentation and everything, but we won't talk about that now.
Right now the first thing you're going to want to
do is put your functions in a text file.
Okay, so the first thing we're going to want to do is open up our studio.
So lets do that.
And so you can see here in R Studio there's some there's
some stuff going on here from a previous project that I'm working on.
So you, that may happen to you, and generally you
can either close it or you can just ignore it.
I wanted to create a new R script here, so
let's create a clean script here to put our code into.
So the first function I'm going to write is really simple
it's just going to take two numbers and add them together.
So this function obviously doesn't have a real point to it but it shows you how
to use the function syntax, how to specify the arguments and how to return the value.
So the function that adds two values I'm just going to call it add two.
And and so you get it you use the function directive to start it off.
Now it's going to take it's going to add two values so it has to take
two arguments so I'm just going call the two arguments x and y and then
I'm going to take the two arguments and add them together with the plus operator
alright x plus y and then I close out the function with the curly brace.
So you can see that I didn't have to
do anything special to return the value that that's the
sum of the two elements because the or any R
function, the, the function returns whatever the last expression was.
So here there's only really one expression.
So therefore its the last expression and, and it equals the sum of x and y.
So here I can, I can highlight this guy and run it
in the console, and you can see now I've got my function here.
I can say add two, and lets give it say three and five and hopefully I get eight.
Yes that's a good sign.
So it adds the two numbers together, and that's that.
So it's a very simple function, and and,
you've now written your first function in R.
S the next function that I want to
talk about is a little slightly more complicated.
It's going to take a vector of numbers, it's going to, it's going to return
the subset of the vector, that's, that's above the vector value of ten.
So any number that's bigger than ten, it's going to return those numbers for you.
Just because it gives you any number that's above ten.
snd, it's going to take a vector here, we'll call it x,
you don't have to call it x, I'm just calling it that.
And I like to open and close the curly braces right away, just
so you know where the beginning and the end of the function is.
If you happen to have a lot of code in, you know, in, in a single file.
So the first thing I'm going to want to do is I want to construct a logical
statement that figures out which elements of
this vector x are, are greater than ten.
So I'm going to assign an object.
I'll call it use because these are, these are the numbers that I'm going to use.
And I'll say x greater than ten.
So this'll return a logical vector, of trues and falses
to indicating which element of x is greater than ten.
Now of course, there's really nothing special about the number ten.
I just kind of made that up, and so you may want to created a
function that allows people to sub, to kind of extract the elements of a vector.
That are above an arbitrary other number, right?
And so, so it could be ten, it could
be 12, it could be five, it could be anything.
So maybe you'll want to allow the user to specify that number.
So I'll just call, I'll create a new function here.
So it doesn't have the ten encoded in it.
I'll use the function directive, and I'll have a
second arbitrary called n, which can be any number really.
[SOUND] So let's start it off, we'll get the curly braces in there,
and now I'll create a logical statement that x is greater than n.
And then I'll subset the vector x based on that logical statement.
So now if I can source this into R.
Oops, and I can run my function here.
So I'll just create a vector.
Let's say x is one through 20.
Oh I, you see, so I didn't specify the number n, so it's not going to know
what to cut it off at, so I need to specify the threshold, so let's do ah,12.
And you can see it returned all the numbers that are greater than 12.
So that's kind of as we expected, and so the function appears to be working well.
Now let's suppose that maybe there is something
special about the number ten, and maybe it's
something that people are going to be kind of be
doing very often and it's a very common number.
So you might you want to specify a default
argument so you might want to the default to
be ten, so remember when I ran the
function before and I didn't specify the number n.
It gave me an error or maybe you don't want
people to have to encounter that error, and so you'll specify
a default value n equals ten so people don't specify
the cutoff value n, it will just automatically default to ten.
So now I can run this in R and now if I
do above, which is x, you see I don't get the error anymore.
It automatically gives you all the numbers that are bigger than ten.
So it's kind of nice in R when you're writing functions
to be able to specify default values like this that make the
life of the user just a little bit easier, specially for very
common cases, where it's not important that the user specify an argument.
So those are some very simple functions, in R that can be used
to kind of process data or make do simple calculations, like adding two numbers.
The next function I want to talk about is, is just going to take
a matrix or a dataframe and calculate the mean of each column.
Right, so this is slightly more complicated
you, you have to take your argument and
then you have to loop through each column to calculate the mean of each one, right.
So this is going to involve using a for-loop
and, and so we'll talk about it here.
So let's call this function column mean, because that's what it does,
And so y is going to be a data frame or a matrix, and we're going to go
through the columns of this data frame or
matrix and calculate the mean of each column.
So the first thing I need to figure out is how
many columns does this thing have, and that can be easily done.
I'll call it n c for number of columns and we can use the n call function for that.
That will calculate the number of columns, and, and then I need to
initialize a vector that's going to that's going to store the means for each column.
The length of this vector has to equal the number of columns, right.
So I'll just call it means, and it'll be a numeric vector
equal to the length of the number, equal to the number of columns.
So this is just an empty vector.
It doesn't, it's going to have, it's going to
be initialized to, to be all zeros.
But we're going to fill it as we go through the column.
So now we want to for-loop through the columns.
And I'll say i is in and then I'll say one through nc.
So this creates a, an integer vector starts
a one and ends at the number of columns,
and then I'm going to for-loop through and for each
I, I'm just going to assign to my means vector.
The mean of x bracket I, right.
Oh sorry that's called y here now.
And that's it, and then so for I, I haven't returned
anything yet, so right now this function doesn't do anything particularly useful.
But what I want to do is return the vector
of means and so I'm just going to return that.
And that's, since that's the last expression
in the function that what will get returned.
Okay, so I, there are six, I think there are
six columns in this dataset, so it gave me six means.
Now you can see that the first two columns have NAs.
And that's because it, if the, if the vector has
an na in it, then you can't calculate the mean.
And so the one thing you might want to do, is, by default, is throw
out all of the missing values and
just calculate the mean amongst the observed values.
And so, you'll notice that a lot of functions have a feature where it's like,
where they, you can, you can choose whether you want to remove the nas or not.
But let me just add up an argument here, it's called [UNKNOWN] na.
And it will default to true, right.
And then I'll pass this argument to the mean function.
So the mean has an na.rm argument, and I'll pass at this value.
So the last thing you want to do any time you're writing a
function the most important thing of course is to save your file.
So right now this file is unsaved.
If you don't save it and R Studio crashes or something
happens you'll lose all your work and so you want to go
to the save I meant save as menu, and just save
your file as, you know, functions or whatever you want to call it.
So that should get you started, just writing
some simple functions in R, for your programming
assignment you'll have to write a few functions
that kind of go through and look at data.
But I just wanted to get you started writing your first
functions so that you know kind of how the directive, the function directive
works, how the arguments work, and you can play around a little
bit with with more complicated ideas as you work through the assignments.