manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
This lesson is focused on the first three wrangling verbs: select()
, filter()
, and mutate()
.
select()
is used to specify which columns to use as opposed to default to all of the columns
filter()
is used to subset data based on specified criteria
mutate()
is used to make new columns of data
This will be our starting dataframe used in each of the examples below.
Selecting Columns
Syntax: select()
Think of it as selecting check boxes for each column to use. Any selected, whether it’s just one column or whether they’re many columns, will be available once this select()
code is completed.
Scenario
Say we want this dataframe to the right where not all of the columns are needed, only the manufacturer, model, cty, and hwy columns. select()
allows us to achieve this.
manufacturer | model | cty | hwy |
---|---|---|---|
audi | a4 | 18 | 29 |
audi | a4 | 21 | 29 |
audi | a4 | 20 | 31 |
audi | a4 | 21 | 30 |
audi | a4 | 16 | 26 |
Code Solution
Filtering Rows
Syntax: filter()
Filtering is used in the scenarios like, “Only need students that are age 16 years old”, or “Only the the redwood trees in this forest that are taller than 30 ft”.
Scenario
Say we want this dataframe to the right where the cars are made by ford (manufacturer
column) and the amount of highway miles per gallon is greater than 20 (hwy
column). filter()
allows allows us to achieve this.
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
ford | mustang | 3.8 | 1999 | 6 | manual(m5) | r | 18 | 26 | r | subcompact |
ford | mustang | 3.8 | 1999 | 6 | auto(l4) | r | 18 | 25 | r | subcompact |
ford | mustang | 4.0 | 2008 | 6 | manual(m5) | r | 17 | 26 | r | subcompact |
ford | mustang | 4.0 | 2008 | 6 | auto(l5) | r | 16 | 24 | r | subcompact |
ford | mustang | 4.6 | 1999 | 8 | auto(l4) | r | 15 | 21 | r | subcompact |
ford | mustang | 4.6 | 1999 | 8 | manual(m5) | r | 15 | 22 | r | subcompact |
Code Solution
Basic Operators:
>
greater than
<
less than
==
equals
>=
greater than or equal to
<=
less than or equal to
Advace Operators
!=
is not equal to
%in% c()
is in
Making New Columns
Syntax: mutate()
Think of the mutate()
verb as starting with one column and change it to where you want it to become–mutating a column. You can make new columns not based on columns, based on one column, or based on multiple columns.
Scenario
Say we want this dataframe to the right (make sure to scroll to see the other columns) where there was an error in the data and the city miles per gallon needs 5 more miles than what it currently had (cty
column) and we need the a column for the engine displacement (displ
column) divided by the number of cylinders (cyl
column). In reality this second task is not a useful but we’re doing it for practice sake.
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | new_col |
---|---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 23 | 29 | p | compact | 0.4500000 |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 26 | 29 | p | compact | 0.4500000 |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 25 | 31 | p | compact | 0.5000000 |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 26 | 30 | p | compact | 0.5000000 |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 21 | 26 | p | compact | 0.4666667 |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 23 | 26 | p | compact | 0.4666667 |
Code Solution
Practice Exercises
New dataset in use for the practices exercises:
Question 1
Select the age, sex, and bmi columns
age | sex | bmi |
---|---|---|
19 | female | 27.900 |
18 | male | 33.770 |
28 | male | 33.000 |
33 | male | 22.705 |
32 | male | 28.880 |
31 | female | 25.740 |
Question 2
Select the charges
, region
, and age
columns.
charges | region | age |
---|---|---|
16884.924 | southwest | 19 |
1725.552 | southeast | 18 |
4449.462 | southeast | 28 |
21984.471 | northwest | 33 |
3866.855 | northwest | 32 |
3756.622 | southeast | 31 |
Question 3
Filter the data where sex
is only male and region
is only northeast.
age | sex | bmi | children | smoker | region | charges |
---|---|---|---|---|---|---|
37 | male | 29.830 | 2 | no | northeast | 6406.411 |
25 | male | 26.220 | 0 | no | northeast | 2721.321 |
23 | male | 23.845 | 0 | no | northeast | 2395.172 |
35 | male | 36.670 | 1 | yes | northeast | 39774.276 |
38 | male | 37.050 | 1 | no | northeast | 6079.672 |
43 | male | 27.360 | 3 | no | northeast | 8606.217 |
Question 4
Filter the data like this:
age
is younger than or equal to 40bmi
is greater than 21
age | sex | bmi | children | smoker | region | charges |
---|---|---|---|---|---|---|
19 | female | 27.900 | 0 | yes | southwest | 16884.924 |
18 | male | 33.770 | 1 | no | southeast | 1725.552 |
28 | male | 33.000 | 3 | no | southeast | 4449.462 |
33 | male | 22.705 | 0 | no | northwest | 21984.471 |
32 | male | 28.880 | 0 | no | northwest | 3866.855 |
31 | female | 25.740 | 0 | no | southeast | 3756.622 |
Question 5
Make a new column called bmi_modified
that is the of the bmi - 12
age | sex | bmi | children | smoker | region | charges | bmi_modified |
---|---|---|---|---|---|---|---|
19 | female | 27.900 | 0 | yes | southwest | 16884.924 | 15.900 |
18 | male | 33.770 | 1 | no | southeast | 1725.552 | 21.770 |
28 | male | 33.000 | 3 | no | southeast | 4449.462 | 21.000 |
33 | male | 22.705 | 0 | no | northwest | 21984.471 | 10.705 |
32 | male | 28.880 | 0 | no | northwest | 3866.855 | 16.880 |
31 | female | 25.740 | 0 | no | southeast | 3756.622 | 13.740 |
Question 6
Make a new column called age_modified
that is the current age divided by 2
age | sex | bmi | children | smoker | region | charges | age_modified |
---|---|---|---|---|---|---|---|
19 | female | 27.900 | 0 | yes | southwest | 16884.924 | 9.5 |
18 | male | 33.770 | 1 | no | southeast | 1725.552 | 9.0 |
28 | male | 33.000 | 3 | no | southeast | 4449.462 | 14.0 |
33 | male | 22.705 | 0 | no | northwest | 21984.471 | 16.5 |
32 | male | 28.880 | 0 | no | northwest | 3866.855 | 16.0 |
31 | female | 25.740 | 0 | no | southeast | 3756.622 | 15.5 |
Question 7: Combining It All Together
- Only use the age, sex, and charges columns
- Make sure the age column is greater than 50 and there are only females
- Make a new column called real_charges which is the charges times 4
age | sex | charges | real_charges |
---|---|---|---|
60 | female | 28923.14 | 115692.55 |
62 | female | 27808.73 | 111234.90 |
56 | female | 11090.72 | 44362.87 |
52 | female | 10797.34 | 43189.34 |
60 | female | 13228.85 | 52915.39 |
59 | female | 14001.13 | 56004.54 |