version | |
---|---|
R | 4.1.2 |
You’ve already been exposed to a few examples of relational and boolean operations in earlier tutorials. A formal exploration of these techniques follow.
Relational operations play an important role in data manipulation. Anytime you subset a dataset based on one or more criterion, you are making use of a relational operation. The relational operators (also known as logical binary operators) include ==
, !=
, <
, <=
, >
and >=
. The output of a condition is a logical vector TRUE
or FALSE
.
Relational operator | Syntax | Example |
---|---|---|
Exact equality | == |
3 == 4 -> FALSE |
Exact inequality | != |
3 != 4 -> TRUE |
Less than | < |
3 < 4 -> TRUE |
Less than or equal | <= |
4 <= 4 -> TRUE |
Greater than | > |
3 > 4 -> FALSE |
Greater than or equal | >= |
4 >= 4 -> TRUE |
Boolean operations can be used to piece together multiple evaluations.
R has three boolean operators: The AND operator, &
; The NOT operator, !
; And the OR operator, |
.
The &
operator requires that the conditions on both sides of the boolean operator be satisfied. You would normally use this operator when addressing a question along the lines of “x
must be satisfied AND y
must be satisfied”.
The |
operator requires that at least one condition be met on either side of the boolean operator. You would normally use this operator when addressing a question along the lines of “x
must be satisfied OR y
must be satisfied”. Note that the output will also be TRUE if both conditions are met.
The !
operator is a negation operator. It will reverse the outcome of a condition. It can be interpreted as “I do NOT want x
to be true”. So if the outcome of an expression is TRUE
, preceding that expression with !
will reverse the outcome to FALSE
and vice-versa.
Boolean operator | Syntax | Example | Outcome |
---|---|---|---|
AND | & |
4 == 3 & 1 == 1 4 == 4 & 1 == 1 |
FALSE TRUE |
OR | | |
4 == 4 | 1 == 1 4 == 3 | 1 == 1 4 == 3 | 1 == 2 |
TRUE TRUE FALSE |
NOT | ! |
! (4 == 3) ! (4 == 4) |
TRUE FALSE |
The following table breaks down all possible Boolean outcomes where T
= TRUE
and F
= FALSE
:
Boolean operation | Outcome |
---|---|
T & T |
TRUE |
T & F |
FALSE |
F & F |
FALSE |
T | T |
TRUE |
T | F |
TRUE |
F | F |
FALSE |
! T |
FALSE |
! F |
TRUE |
If the input values to a boolean operation are numeric vectors and not logical vectors, the numeric values will be interpreted as FALSE
if zero and TRUE
otherwise. For example:
1 & 2
[1] TRUE
1 & 0
[1] FALSE
Note that the operation a == (3 | 4)
is not the same as (a == 3) | (a == 4)
. The former will return FALSE
whereas the latter will return TRUE
if a = 3
. This is because the Boolean operator evaluates both sides of its expression as separate logical outcomes (i.e. T
and F
values). In the latter case, the Boolean expression is asking “is a
equal to 3
OR is a
equal to 4
”. Since one of the conditions is true, the expression ends up evaluating TRUE | FALSE
which returns TRUE
(see above table).
<- 3
a <- 4
b == 3) | (a == 4) (a
[1] TRUE
In the former expression, the boolean operator |
is evaluating 3
OR 4
on its right-hand side. As mentioned in the previous section, logical values take on a value of 0
for FALSE and any non-zero value for TRUE, so when evaluating 3 | 4
, it’s really seeing TRUE | TRUE
which, according to the aforementioned table will output TRUE
.
3 | 4
[1] TRUE
So in the end, the expression a == (3 | 4)
is really evaluating the condition a == TRUE
which returns false (since 3 is not equal to the logical value TRUE
).
== (3 | 4) a
[1] FALSE
You may wonder how a relational operator such as ==
can compare two different data types? (Recall that a
is numeric and TRUE
is logical). R is not strict about mixing data types in many of its operations. It circumvents differences in data types by coercing all values to the highest common mode (see an earlier tutorial). Here, numeric
overrides logical
type thus coercing the TRUE
variable to a numeric
data type.
as.numeric(TRUE)
[1] 1
So a == (3 | 4)
reduces to a == TRUE
which is reduced to a == 1
.
The relational operators are used to compare single elements (i.e. one element at a time). If you want to compare two objects as a whole (e.g. multi-element vectors or data frames), use the identical()
function. For example:
<- c(1, 5, 6, 10)
a <- c(1, 5, 6)
b identical(a, a)
[1] TRUE
identical(a, b)
[1] FALSE
identical(mtcars, mtcars)
[1] TRUE
Notice that identical
returns a single logical vector, regardless the input object’s dimensions.
Note that the data structure must match as well as its element values. For example, if d
is a list and a
is an atomic vector, the output of identical
will be false even if the internal values match.
<- list( c(1, 5, 6, 10) )
d identical(a, d)
[1] FALSE
If we convert d
from a list to an atomic vector using the unlist
function (thus matching data structures), we get:
identical(a, unlist(d))
[1] TRUE
%in%
The match operator %in%
compares two sets of vectors and assesses if an element on the left-hand side of %in%
matches any of the elements on the right-hand side of the operator. For each element in the left-hand vector, R returns TRUE
if the value is present in any of the right-hand side elements or FALSE
if not.
For example, given the following vectors:
<- c( "a", "b", "cd", "fe")
v1 <- c( "b", "e") v2
find the elements in v1
that match any of the values in v2
.
%in% v2 v1
[1] FALSE TRUE FALSE FALSE
The function checks whether each element in v1
has a matching value in v2
. For example, element "a"
in v1
is compared to elements "b"
and "e"
in v2
. No matches are found and a FALSE
is returned. The next element in v1
, "b"
, is compared to both elements in v2
. This time, there is a match (v2
has an element "b"
) and TRUE
is returned. This process is repeated for all elements in v1
.
The logical vector output has the same length as the input vector v1
(four in this example).
If we swap the vector objects, we get a two element logical vector since we are now comparing each element in v2
to any matching elements in v1
.
%in% v1 v2
[1] TRUE FALSE
NA
When assessing if a value is equal to NA
the following evaluation may behave unexpectedly.
<- c (3, 67, 4, NA, 10)
a == NA a
[1] NA NA NA NA NA
The output is not a logical data type we would expect from an evaluation. Instead, you must make use of the is.na()
function:
is.na(a)
[1] FALSE FALSE FALSE TRUE FALSE
As another example, if we want to keep all rows in dataframe d
where z
= NA
, we would type:
<- data.frame(x = c(1,4,2,5,2,3,NA),
d y = c(3,2,5,3,8,1,1),
z = c(NA,NA,4,9,7,8,3))
is.na(d$z), ] d[
x y z
1 1 3 NA
2 4 2 NA
You can, of course, use the !
operator to reverse the evaluation and omit all rows where z
= NA
,
!is.na(d$z), ] d[
x y z
3 2 5 4
4 5 3 9
5 2 8 7
6 3 1 8
7 NA 1 3