Title: | Tools for Easily Combining and Cleaning Data Sets |
---|---|
Description: | Tools for combining and cleaning data sets, particularly with grouped and time series data. This includes functions for merging data while reporting duplicates, filling in columns with values of a column in another data frame, and creating continuous time data for interupted time series. |
Authors: | Christopher Gandrud [aut, cre] |
Maintainer: | Christopher Gandrud <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.23 |
Built: | 2024-12-19 04:23:05 UTC |
Source: | https://github.com/christophergandrud/datacombine |
Create reports cases after listwise deletion of missing values for time-series cross-sectional data.
CasesTable(data, GroupVar, TimeVar, Vars)
CasesTable(data, GroupVar, TimeVar, Vars)
data |
a data frame with the full sample. |
GroupVar |
a character string specifying the variable in |
TimeVar |
an optional character string specifying the variable in
|
Vars |
a character vector with variables names from |
If TimeVar
is specified then a data frame is returned with
three colums. One identifying the GroupVar
and two others specifying
each unique value of GroupVar
's first and last observation time
post-listwise deletion of missing values.
If TimeVar
is not specified, then a vector of unique GroupVar
post-listwise deletion of missing values is returned.
# Create dummy data ID <- rep(1:4, 4) time <- rep(2000:2003, 4) a <- rep(c(1:3, NA), 4) b <- rep(c(1, NA, 3:4), 4) Data <- data.frame(ID, time, a, b) # Find cases that have not been listwise deleted CasesTable(Data, GroupVar = 'ID') CasesTable(Data, GroupVar = 'ID', Vars = 'a') CasesTable(Data, GroupVar = 'ID', TimeVar = 'time', Vars = 'a')
# Create dummy data ID <- rep(1:4, 4) time <- rep(2000:2003, 4) a <- rep(c(1:3, NA), 4) b <- rep(c(1, NA, 3:4), 4) Data <- data.frame(ID, time, a, b) # Find cases that have not been listwise deleted CasesTable(Data, GroupVar = 'ID') CasesTable(Data, GroupVar = 'ID', Vars = 'a') CasesTable(Data, GroupVar = 'ID', TimeVar = 'time', Vars = 'a')
Calculate the changes (absolute, percent, and proportion) changes from a specified lag, including within groups
change(data, Var, GroupVar, TimeVar, NewVar, slideBy = -1, type = "percent", ...)
change(data, Var, GroupVar, TimeVar, NewVar, slideBy = -1, type = "percent", ...)
data |
a data frame object. |
Var |
a character string naming the variable you would like to find the percentage change for. |
GroupVar |
a character string naming the variable grouping the units
within which the percentage change will be found for (i.e. countries in a
time series). If |
TimeVar |
optional character string naming the time variable. If specified then the data is ordered by Var-TimeVar before finding the change. |
NewVar |
a character string specifying the name for the new variable to place the percentage change in. |
slideBy |
numeric value specifying how many rows (time units) to make the percentage change comparison for. Positive values shift the data up–lead the data. |
type |
character string set at |
... |
arguments passed to |
Finds the absolute, percentage, or proportion change for over a given time period either within groups of data or the whole data frame. Important: the data must be in time order and, if groups are used, group-time order.
a data frame
# Create fake data frame A <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2) B <- c(1:10) Data <- data.frame(A, B) # Find percentage change from two periods before Out <- change(Data, Var = 'B', type = 'proportion', NewVar = 'PercentChange', slideBy = -2) Out
# Create fake data frame A <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2) B <- c(1:10) Data <- data.frame(A, B) # Find percentage change from two periods before Out <- change(Data, Var = 'B', type = 'proportion', NewVar = 'PercentChange', slideBy = -2) Out
CountSpell
is a function that returns a variable counting the spell
number for an observation. Works with grouped data.
CountSpell(data, TimeVar, SpellVar, GroupVar, NewVar, SpellValue)
CountSpell(data, TimeVar, SpellVar, GroupVar, NewVar, SpellValue)
data |
a data frame object. |
TimeVar |
a character string naming the time variable. |
SpellVar |
a character string naming the variable with information on when each spell starts. |
GroupVar |
a character string naming the variable grouping the units experiencing the spells. |
NewVar |
NewVar a character string naming the new variable to place the spell counts in. |
SpellValue |
a value indicating when a unit is in a spell. Must match
the class of the |
# Create fake data ID <- sort(rep(seq(1:4), 5)) Time <- rep(1:20) Dummy <- c(1, sample(c(0, 1), size = 19, replace = TRUE)) Data <- data.frame(ID, Time, Dummy) # Find spell for whole data frame DataSpell1 <- CountSpell(Data, TimeVar = 'Time', SpellVar = 'Dummy', SpellValue = 1) head(DataSpell1) # Find spell for each ID group DataSpell2 <- CountSpell(Data, TimeVar = 'Time', SpellVar = 'Dummy', GroupVar = 'ID', SpellValue = 1) head(DataSpell2)
# Create fake data ID <- sort(rep(seq(1:4), 5)) Time <- rep(1:20) Dummy <- c(1, sample(c(0, 1), size = 19, replace = TRUE)) Data <- data.frame(ID, Time, Dummy) # Find spell for whole data frame DataSpell1 <- CountSpell(Data, TimeVar = 'Time', SpellVar = 'Dummy', SpellValue = 1) head(DataSpell1) # Find spell for each ID group DataSpell2 <- CountSpell(Data, TimeVar = 'Time', SpellVar = 'Dummy', GroupVar = 'ID', SpellValue = 1) head(DataSpell2)
dMerge
merges 2 data frames and reports/drops/keeps only duplicates.
dMerge(data1, data2, by, Var, dropDups = TRUE, dupsOut = FALSE, fromLast = FALSE, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x", ".y"), incomparables = NULL)
dMerge(data1, data2, by, Var, dropDups = TRUE, dupsOut = FALSE, fromLast = FALSE, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x", ".y"), incomparables = NULL)
data1 |
a data frame. The first data frame to merge. |
data2 |
a data frame. The second data frame to merge. |
by |
specifications of the columns used for merging. |
Var |
depricated. |
dropDups |
logical. Whether or not to drop duplicated rows based on
|
dupsOut |
logical. If |
fromLast |
logical indicating if duplication should be considered from
the reverse side. Only relevant if |
all |
logical; all = L is shorthand for |
all.x |
logical; if TRUE, then extra rows will be added to the output,
one for each row in x that has no matching row in y. These rows will have
|
all.y |
logical; analogous to |
sort |
logical. Should the result be sorted on the by columns? |
suffixes |
a character vector of length 2 specifying the suffixes to be used for making unique the names of columns in the result which not used for merging (appearing in by etc). |
incomparables |
values which cannot be matched. See |
DropNA
drops rows from a data frame when they have missing (NA
)
values on a given variable(s).
DropNA(data, Var, message = TRUE)
DropNA(data, Var, message = TRUE)
data |
a data frame object. |
Var |
a character vector naming the variables you would like to have
only non-missing ( |
message |
logical. Whether or not to give you a message about the number of rows that are dropped. |
Partially based on Stack Overflow answer written by donshikin: http://stackoverflow.com/questions/4862178/remove-rows-with-nas-in-data-frame
# Create data frame a <- c(1:4, NA) b <- c(1, NA, 3:5) ABData <- data.frame(a, b) # Remove missing values from column a ASubData <- DropNA(ABData, Var = "a", message = FALSE) # Remove missing values in columns a and b ABSubData <- DropNA(ABData, Var = c("a", "b")) # Remove missing values in all columns of ABDatat AllSubData <- DropNA(ABData)
# Create data frame a <- c(1:4, NA) b <- c(1, NA, 3:5) ABData <- data.frame(a, b) # Remove missing values from column a ASubData <- DropNA(ABData, Var = "a", message = FALSE) # Remove missing values in columns a and b ABSubData <- DropNA(ABData, Var = c("a", "b")) # Remove missing values in all columns of ABDatat AllSubData <- DropNA(ABData)
Fills in missing (NA) values with the previous non-missing value
FillDown(data, Var)
FillDown(data, Var)
data |
a data frame. Optional as you can simply specify a vector with
|
Var |
the variable in |
data frame
# Create fake data id <- c('Algeria', NA, NA, NA, 'Mexico', NA, NA) score <- rnorm(7) Data <- data.frame(id, score) # FillDown id DataOut <- FillDown(Data, 'id') ## Not run: # Use group_by and mutate from dplyr to FillDown grouped data, e.g.: Example <- Example %>% group_by(grouping) %>% mutate(NewFilled = FillDown(Var = VarToFill)) ## End(Not run)
# Create fake data id <- c('Algeria', NA, NA, NA, 'Mexico', NA, NA) score <- rnorm(7) Data <- data.frame(id, score) # FillDown id DataOut <- FillDown(Data, 'id') ## Not run: # Use group_by and mutate from dplyr to FillDown grouped data, e.g.: Example <- Example %>% group_by(grouping) %>% mutate(NewFilled = FillDown(Var = VarToFill)) ## End(Not run)
FillIn
uses values of a variable from one data set to fill in missing
values in another.
FillIn(D1, D2, Var1, Var2, KeyVar = c("iso2c", "year"), allow.cartesian = FALSE, KeepD2Vars = FALSE)
FillIn(D1, D2, Var1, Var2, KeyVar = c("iso2c", "year"), allow.cartesian = FALSE, KeepD2Vars = FALSE)
D1 |
the data frame with the variable you would like to fill in. |
D2 |
the data frame with the variable you would like to use to fill in
|
Var1 |
a character string of the name of the variable in |
Var2 |
an optional character string of variable name in |
KeyVar |
a character vector of variable names that are shared by
|
allow.cartesian |
logical. See the |
KeepD2Vars |
logical, indicating whether or not to keep the variables
from D2 in the output data frame. The default is |
# Create data set with missing values naDF <- data.frame(a = sample(c(1,2), 100, rep = TRUE), b = sample(c(3,4), 100, rep = TRUE), fNA = sample(c(100, 200, 300, 400, NA), 100, rep = TRUE)) # Created full data set fillDF <- data.frame(a = c(1, 2, 1, 2), b = c(3, 3, 4, 4), j = c(5, 5, 5, 5), fFull = c(100, 200, 300, 400)) # Fill in missing f's from naDF with values from fillDF FilledInData <- FillIn(naDF, fillDF, Var1 = "fNA", Var2 = "fFull", KeyVar = c("a", "b"))
# Create data set with missing values naDF <- data.frame(a = sample(c(1,2), 100, rep = TRUE), b = sample(c(3,4), 100, rep = TRUE), fNA = sample(c(100, 200, 300, 400, NA), 100, rep = TRUE)) # Created full data set fillDF <- data.frame(a = c(1, 2, 1, 2), b = c(3, 3, 4, 4), j = c(5, 5, 5, 5), fFull = c(100, 200, 300, 400)) # Fill in missing f's from naDF with values from fillDF FilledInData <- FillIn(naDF, fillDF, Var1 = "fNA", Var2 = "fFull", KeyVar = c("a", "b"))
Find duplicated values in a data frame and subset it to either include or not include them.
FindDups(data, Vars, NotDups = FALSE, test = FALSE, ...)
FindDups(data, Vars, NotDups = FALSE, test = FALSE, ...)
data |
a data frame to select the duplicated values from. |
Vars |
character vector of variables in |
NotDups |
logical. If |
test |
logical. If |
... |
arguments to pass to |
a data frame, unless test = TRUE
and there are duplicates.
Data <- data.frame(ID = c(1, 1, 2, 2), Value = c(1, 2, 3, 4)) FindDups(Data, Vars = 'ID')
Data <- data.frame(ID = c(1, 1, 2, 2), Value = c(1, 2, 3, 4)) FindDups(Data, Vars = 'ID')
FindReplace
allows you to find and replace multiple character string
patterns in a data frame's column.
FindReplace(data, Var, replaceData, from = "from", to = "to", exact = TRUE, vector = FALSE)
FindReplace(data, Var, replaceData, from = "from", to = "to", exact = TRUE, vector = FALSE)
data |
data frame with the column you would like to replace string patterns. |
Var |
character string naming the column you would like to replace
string patterns. The column must be of class |
replaceData |
a data frame with at least two columns. One contains the patterns to replace and the other contains their replacement. Note: the pattern and its replacement must be in the same row. |
from |
character string naming the column with the patterns you would like to replace. |
to |
character string naming the column with the the pattern replacements. |
exact |
logical. Indicates whether to only replace exact pattern matches
( |
vector |
logical. If |
# Create original data ABData <- data.frame(a = c("London, UK", "Oxford, UK", "Berlin, DE", "Hamburg, DE", "Oslo, NO"), b = c(8, 0.1, 3, 2, 1)) # Create replacements data frame Replaces <- data.frame(from = c("UK", "DE"), to = c("England", "Germany")) # Replace patterns and return full data frame ABNewDF <- FindReplace(data = ABData, Var = "a", replaceData = Replaces, from = "from", to = "to", exact = FALSE) # Replace patterns and return the Var as a vector ABNewVector <- FindReplace(data = ABData, Var = "a", replaceData = Replaces, from = "from", to = "to", vector = TRUE)
# Create original data ABData <- data.frame(a = c("London, UK", "Oxford, UK", "Berlin, DE", "Hamburg, DE", "Oslo, NO"), b = c(8, 0.1, 3, 2, 1)) # Create replacements data frame Replaces <- data.frame(from = c("UK", "DE"), to = c("England", "Germany")) # Replace patterns and return full data frame ABNewDF <- FindReplace(data = ABData, Var = "a", replaceData = Replaces, from = "from", to = "to", exact = FALSE) # Replace patterns and return the Var as a vector ABNewVector <- FindReplace(data = ABData, Var = "a", replaceData = Replaces, from = "from", to = "to", vector = TRUE)
Subset a data frame if a specified pattern is found in a character string
grepl.sub(data, pattern, Var, keep.found = TRUE, ...)
grepl.sub(data, pattern, Var, keep.found = TRUE, ...)
data |
data frame. |
pattern |
character vector containing a regular expression to be matched in the given character vector. |
Var |
character vector of the variables that the pattern should be found in. |
keep.found |
logical. whether or not to keep observations where the
pattern is found ( |
... |
arguments to pass to |
# Create data frame ABData <- data.frame(a = c("London, UK", "Oxford, UK", "Berlin, DE", "Hamburg, DE", "Oslo, NO"), b = c(8, 0.1, 3, 2, 1)) # Keep only data from Germany (DE) ABGermany <- grepl.sub(data = ABData, pattern = "DE", Var = "a")
# Create data frame ABData <- data.frame(a = c("London, UK", "Oxford, UK", "Berlin, DE", "Hamburg, DE", "Oslo, NO"), b = c(8, 0.1, 3, 2, 1)) # Keep only data from Germany (DE) ABGermany <- grepl.sub(data = ABData, pattern = "DE", Var = "a")
Inserts a new row into a data frame
InsertRow(data, NewRow, RowNum = NULL)
InsertRow(data, NewRow, RowNum = NULL)
data |
a data frame to insert the new row into. |
NewRow |
a vector whose length is the same as the number of columns in
|
RowNum |
numeric indicating which row to insert the new row as. If not
specified then the new row is added to the end using a vanilla
|
The function largely implements: http://stackoverflow.com/a/11562428
# Create dummy data A <- B <- C <- D <- sample(1:20, size = 20, replace = TRUE) Data <- data.frame(A, B, C, D) # Create new row New <- rep(1000, 4) # Insert into 4th row Data <- InsertRow(Data, NewRow = New, RowNum = 4)
# Create dummy data A <- B <- C <- D <- sample(1:20, size = 20, replace = TRUE) Data <- data.frame(A, B, C, D) # Create new row New <- rep(1000, 4) # Insert into 4th row Data <- InsertRow(Data, NewRow = New, RowNum = 4)
MoveFront
moves variables to the front of a data frame.
MoveFront(data, Var, exact = TRUE, ignore.case = NULL, fixed = NULL)
MoveFront(data, Var, exact = TRUE, ignore.case = NULL, fixed = NULL)
data |
a data frame object containing the variable you want to move. |
Var |
a character vector naming the variables you would like to move to the front of the data frame. The order of the variables should match the order you want them to have in the data frame, i.e. the first variable in the vector will be the first variable in the data frame. |
exact |
logical. If |
ignore.case |
logical. If |
fixed |
logical. If |
Based primarily on a Stack Overflow answer written by rcs: http://stackoverflow.com/questions/3369959/moving-columns-within-a-data-frame-without-retyping.
# Create fake data A <- B <- C <- 1:50 OldOrder <- data.frame(A, B, C) # Move C to front NewOrder1 <- MoveFront(OldOrder, "C") names(NewOrder1) # Move B and A to the front NewOrder2 <- MoveFront(OldOrder, c("B", "A")) names(NewOrder2) ## Non-exact matching (example from Felix Hass) # Create fake data df <- data.frame(dummy = c(1,0), Name = c("Angola", "Chad"), DyadName = c("Government of Angola - UNITA", "Government of Chad - FNT"), Year = c("2002", "1992")) df <- MoveFront(df, c("Name", "Year"), exact = FALSE) names(df) df <- MoveFront(df, c("Name", "Year"), exact = TRUE) names(df)
# Create fake data A <- B <- C <- 1:50 OldOrder <- data.frame(A, B, C) # Move C to front NewOrder1 <- MoveFront(OldOrder, "C") names(NewOrder1) # Move B and A to the front NewOrder2 <- MoveFront(OldOrder, c("B", "A")) names(NewOrder2) ## Non-exact matching (example from Felix Hass) # Create fake data df <- data.frame(dummy = c(1,0), Name = c("Angola", "Chad"), DyadName = c("Government of Angola - UNITA", "Government of Chad - FNT"), Year = c("2002", "1992")) df <- MoveFront(df, c("Name", "Year"), exact = FALSE) names(df) df <- MoveFront(df, c("Name", "Year"), exact = TRUE) names(df)
Create new variable(s) indicating if there are missing values in other variable(s)
NaVar(data, Var, Stub = "Miss_", reverse = FALSE, message = TRUE)
NaVar(data, Var, Stub = "Miss_", reverse = FALSE, message = TRUE)
data |
a data frame object. |
Var |
a character vector naming the variable(s) within which you would like to identify missing values. |
Stub |
a character string indicating the stub you would like to append to the new variables' name(s). |
reverse |
logical. If |
message |
logical. Whether or not to give you a message about the names of the new variables that are created. |
# Create data frame a <- c(1, 2, 3, 4, NA) b <- c( 1, NA, 3, 4, 5) ABData <- data.frame(a, b) # Create varibles indicating missing values in columns a and b ABData1 <- NaVar(ABData, Var = c('a', 'b')) # Create varible indicating missing values in columns a with reversed dummy ABData2 <- NaVar(ABData, Var = 'a', reverse = TRUE, message = FALSE)
# Create data frame a <- c(1, 2, 3, 4, NA) b <- c( 1, NA, 3, 4, 5) ABData <- data.frame(a, b) # Create varibles indicating missing values in columns a and b ABData1 <- NaVar(ABData, Var = c('a', 'b')) # Create varible indicating missing values in columns a with reversed dummy ABData2 <- NaVar(ABData, Var = 'a', reverse = TRUE, message = FALSE)
Calculate the percentage change from a specified lag, including within groups
PercChange(data, Var, GroupVar, NewVar, slideBy = -1, type = "percent", ...)
PercChange(data, Var, GroupVar, NewVar, slideBy = -1, type = "percent", ...)
data |
a data frame object. |
Var |
a character string naming the variable you would like to find the percentage change for. |
GroupVar |
a character string naming the variable grouping the units
within which the percentage change will be found for (i.e. countries in a
time series). If |
NewVar |
a character string specifying the name for the new variable to place the percentage change in. |
slideBy |
numeric value specifying how many rows (time units) to make the change comparison for. Positive values shift the data up–lead the data. |
type |
character string set at either |
... |
arguments passed to |
Finds the percentage or proportion change for over a given time period either within groups of data or the whole data frame. Important: the data must be in time order and, if groups are used, group-time order.
a data frame
# Create fake data frame A <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2) B <- c(1:10) Data <- data.frame(A, B) # Find percentage change from two periods before Out <- PercChange(Data, Var = 'B', type = 'proportion', NewVar = 'PercentChange', slideBy = -2) Out
# Create fake data frame A <- c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2) B <- c(1:10) Data <- data.frame(A, B) # Find percentage change from two periods before Out <- PercChange(Data, Var = 'B', type = 'proportion', NewVar = 'PercentChange', slideBy = -2) Out
Extract a single column from a data frame (including tbl_df) and return it as a vector.
pull(data, Var)
pull(data, Var)
data |
a data frame to extract the column from. |
Var |
a character string identifying the column to extract as a vector. |
a vector
Modified from Tommy O'Dell: http://stackoverflow.com/a/24730843/1705044
rmExcept
removes all objects from a workspace except those specified
by the user.
rmExcept(keepers, envir = globalenv(), message = TRUE)
rmExcept(keepers, envir = globalenv(), message = TRUE)
keepers |
a character vector of the names of object you would like to keep in your workspace. |
envir |
the |
message |
logical, whether or not to return a message informing the user of which objects were removed. |
# Create objects A <- 1; B <- 2; C <- 3 # Remove all objects except for A rmExcept("A") # Show workspace ls()
# Create objects A <- 1; B <- 2; C <- 3 # Remove all objects except for A rmExcept("A") # Show workspace ls()
The function shifts a vector up or down to create lag or lead variables. Note: your data needs to be sorted by date. The date should be ascending (i.e. increasing as it moves down the rows).
shift(VarVect, shiftBy, reminder = TRUE)
shift(VarVect, shiftBy, reminder = TRUE)
VarVect |
a vector you would like to shift (create lag or lead). |
shiftBy |
numeric value specifying how many rows (time units) to shift the data by. Negative values shift the data down–lag the data. Positive values shift the data up–lead the data. |
reminder |
logical. Whether or not to remind you to order your data by
the |
shift
a function for creating lag and lead variables, including for
time-series cross-sectional data.
a vector
Largely based on TszKin Julian's shift
function:
http://ctszkin.com/2012/03/11/generating-a-laglead-variables/.
Internal function for slideMA
shiftMA(x, shiftBy, Abs, reminder)
shiftMA(x, shiftBy, Abs, reminder)
x |
vector |
shiftBy |
numeric |
Abs |
numeric |
reminder |
logical |
The function slides a column up or down to create lag or lead
variables. If GroupVar
is specified it will slide Var
for each
group. This is important for time-series cross-section data. The slid data is
placed in a new variable in the original data frame.
Note: your data needs to be sorted by date. The date should be ascending
(i.e. increasing as it moves down the rows). Also, the time difference
between rows should be constant, e.g. days, months, years, and without missing
values.
slide(data, Var, TimeVar, GroupVar, NewVar, slideBy = -1, keepInvalid = FALSE, reminder = TRUE)
slide(data, Var, TimeVar, GroupVar, NewVar, slideBy = -1, keepInvalid = FALSE, reminder = TRUE)
data |
a data frame object. |
Var |
a character string naming the variable you would like to slide (create lag or lead). |
TimeVar |
optional character string naming the time variable. If specified then the data is ordered by Var-TimeVar before sliding. |
GroupVar |
a character string naming the variable grouping the units
within which |
NewVar |
a character string specifying the name for the new variable to place the slid data in. |
slideBy |
numeric value specifying how many rows (time units) to shift the data by. Negative values slide the data down–lag the data. Positive values shift the data up–lead the data. |
keepInvalid |
logical. Whether or not to keep observations for groups for
which no valid lag/lead can be created due to an insufficient number of time
period observations. If |
reminder |
logical. Whether or not to remind you to order your data by
the |
slide
a function for creating lag and lead variables, including for
time-series cross-sectional data.
a data frame
Partially based on TszKin Julian's shift
function:
http://ctszkin.com/2012/03/11/generating-a-laglead-variables/
# Create dummy data A <- B <- C <- sample(1:20, size = 20, replace = TRUE) ID <- sort(rep(seq(1:4), 5)) Data <- data.frame(ID, A, B, C) # Lead the variable by two time units DataSlid1 <- slide(Data, Var = 'A', NewVar = 'ALead', slideBy = 2) # Lag the variable one time unit by ID group DataSlid2 <- slide(data = Data, Var = 'B', GroupVar = 'ID', NewVar = 'BLag', slideBy = -1) # Lag the variable one time unit by ID group, with invalid lags Data <- Data[1:16, ] DataSlid3 <- slide(data = Data, Var = 'B', GroupVar = 'ID', NewVar = 'BLag', slideBy = -2, keepInvalid = TRUE)
# Create dummy data A <- B <- C <- sample(1:20, size = 20, replace = TRUE) ID <- sort(rep(seq(1:4), 5)) Data <- data.frame(ID, A, B, C) # Lead the variable by two time units DataSlid1 <- slide(Data, Var = 'A', NewVar = 'ALead', slideBy = 2) # Lag the variable one time unit by ID group DataSlid2 <- slide(data = Data, Var = 'B', GroupVar = 'ID', NewVar = 'BLag', slideBy = -1) # Lag the variable one time unit by ID group, with invalid lags Data <- Data[1:16, ] DataSlid3 <- slide(data = Data, Var = 'B', GroupVar = 'ID', NewVar = 'BLag', slideBy = -2, keepInvalid = TRUE)
Create a moving average for a period before or after each time point for a given variable
slideMA(data, Var, GroupVar, periodBound = -3, offset = 1, NewVar, reminder = TRUE, ...)
slideMA(data, Var, GroupVar, periodBound = -3, offset = 1, NewVar, reminder = TRUE, ...)
data |
a data frame object. |
Var |
a character string naming the variable you would like to create the lag/lead moving averages from. |
GroupVar |
a character string naming the variable grouping the units
within which |
periodBound |
integer. The time point for the outer bound of the time
period over which to create the moving averages. The default is |
offset |
integer. How many time increments away from a given time point
to begin the moving average time period. The default is |
NewVar |
a character string specifying the name for the new variable to place the slid data in. |
reminder |
logical. Whether or not to remind you to order your data by
the |
... |
arguements to pass through. |
slideMA
is designed to give you more control over the window
for creating the moving average. Think of the periodBound
and
offset
arguments working together. If for example,
periodBound = -3
and offset = 1
then the variable of interest
will be lagged by 2 then a moving average window of three time increments
around the lagged variable is found.
a data frame
# Create dummy data A <- B <- C <- sample(1:20, size = 20, replace = TRUE) ID <- sort(rep(seq(1:4), 5)) Data <- data.frame(ID, A, B, C) # Lead the variable by two time units DataSlidMA1 <- slideMA(Data, Var = 'A', NewVar = 'ALead_MA', periodBound = 3) # Lag the variable one time unit by ID group DataSlidMA2 <- slideMA(data = Data, Var = 'B', GroupVar = 'ID', NewVar = 'BLag_MA', periodBound = -3, offset = 2)
# Create dummy data A <- B <- C <- sample(1:20, size = 20, replace = TRUE) ID <- sort(rep(seq(1:4), 5)) Data <- data.frame(ID, A, B, C) # Lead the variable by two time units DataSlidMA1 <- slideMA(Data, Var = 'A', NewVar = 'ALead_MA', periodBound = 3) # Lag the variable one time unit by ID group DataSlidMA2 <- slideMA(data = Data, Var = 'B', GroupVar = 'ID', NewVar = 'BLag_MA', periodBound = -3, offset = 2)
Spread a dummy variable (1's and 0') over a specified time period and for specified groups
SpreadDummy(data, Var, GroupVar, NewVar, spreadBy = -2, reminder = TRUE)
SpreadDummy(data, Var, GroupVar, NewVar, spreadBy = -2, reminder = TRUE)
data |
a data frame object. |
Var |
a character string naming the numeric dummy variable with values 0 and 1 that you would like to spread. Can be either spread as a lag or lead. |
GroupVar |
a character string naming the variable grouping the units
within which |
NewVar |
a character string specifying the name for the new variable to place the spread dummy data in. |
spreadBy |
numeric value specifying how many rows (time units) to spread the data over. Negative values spread the data down–lag the data. Positive values spread the data up–lead the data. |
reminder |
logical. Whether or not to remind you to order your data by
the |
# Create dummy data ID <- sort(rep(seq(1:4), 5)) NotVar <- rep(1:5, 4) Dummy <- sample(c(0, 1), size = 20, replace = TRUE) Data <- data.frame(ID, NotVar, Dummy) # Spread DataSpread1 <- SpreadDummy(data = Data, Var = 'Dummy', spreadBy = 2, reminder = FALSE) DataSpread2 <- SpreadDummy(data = Data, Var = 'Dummy', GroupVar = 'ID', spreadBy = -2)
# Create dummy data ID <- sort(rep(seq(1:4), 5)) NotVar <- rep(1:5, 4) Dummy <- sample(c(0, 1), size = 20, replace = TRUE) Data <- data.frame(ID, NotVar, Dummy) # Spread DataSpread1 <- SpreadDummy(data = Data, Var = 'Dummy', spreadBy = 2, reminder = FALSE) DataSpread2 <- SpreadDummy(data = Data, Var = 'Dummy', GroupVar = 'ID', spreadBy = -2)
StartEnd
finds the starting and ending time points of a spell,
including for time-series cross-sectional data. Note: your data needs to be
sorted by date. The date should be ascending (i.e. increasing as it moves
down the rows).
StartEnd(data, SpellVar, GroupVar, SpellValue, OnlyStart = FALSE, ...)
StartEnd(data, SpellVar, GroupVar, SpellValue, OnlyStart = FALSE, ...)
data |
a data frame object. |
SpellVar |
a character string naming the variable with information on when each spell starts. |
GroupVar |
a character string naming the variable grouping the units
experiencing the spells. If |
SpellValue |
a value indicating when a unit is in a spell. If
|
OnlyStart |
logical for whether or not to only add a new
|
... |
Aguments to pass to |
a data frame. If OnlyStart = FALSE
then two new variables are
returned:
Spell_Start: The time period year of a given spell.
Spell_End: The end time period of a given spell.
If OnlyStart = TRUE
then only Spell_Start
is added.
This variable includes both 1
's for the start of a new spell and for
the start of a 'gap spell', i.e. a spell after Spell_End
.
# Create fake data ID <- sort(rep(seq(1:4), 5)) Time <- rep(1:5, 4) Dummy <- c(1, sample(c(0, 1), size = 19, replace = TRUE)) Data <- data.frame(ID, Time, Dummy) # Find start/end of spells denoted by Dummy = 1 DataSpell <- StartEnd(Data, SpellVar = 'Dummy', GroupVar = 'ID', TimeVar = 'Time', SpellValue = 1) head(DataSpell)
# Create fake data ID <- sort(rep(seq(1:4), 5)) Time <- rep(1:5, 4) Dummy <- c(1, sample(c(0, 1), size = 19, replace = TRUE)) Data <- data.frame(ID, Time, Dummy) # Find start/end of spells denoted by Dummy = 1 DataSpell <- StartEnd(Data, SpellVar = 'Dummy', GroupVar = 'ID', TimeVar = 'Time', SpellValue = 1) head(DataSpell)
Expands a data set so that it includes an observation for each time point in a sequence. Works with grouped data.
TimeExpand(data, GroupVar, TimeVar, begin, end, by = 1)
TimeExpand(data, GroupVar, TimeVar, begin, end, by = 1)
data |
a data frame. |
GroupVar |
the variable in |
TimeVar |
the variable in |
begin |
numeric of length 1. Specifies beginning time point.
Only relevant if |
end |
numeric of length 1. Specifies ending time point.
Only relevant if |
by |
numeric or character string specifying the steps in the
|
Data <- data.frame(country = c("Cambodia", "Camnodia", "Japan", "Japan"), year = c(1990, 2001, 1994, 2012)) ExpandedData <- TimeExpand(Data, GroupVar = 'country', TimeVar = 'year')
Data <- data.frame(country = c("Cambodia", "Camnodia", "Japan", "Japan"), year = c(1990, 2001, 1994, 2012)) ExpandedData <- TimeExpand(Data, GroupVar = 'country', TimeVar = 'year')
Creates a continuous Unit-Time-Dummy data frame from a data frame with Unit-Start-End times
TimeFill(data, GroupVar, StartVar, EndVar, NewVar = "TimeFilled", NewTimeVar = "Time", KeepStartStop = FALSE)
TimeFill(data, GroupVar, StartVar, EndVar, NewVar = "TimeFilled", NewTimeVar = "Time", KeepStartStop = FALSE)
data |
a data frame with a Group, Start, and End variables. |
GroupVar |
a character string naming the variable grouping the units within which the new dummy variable will be found. |
StartVar |
a character string indicating the variable with the starting times of some series. |
EndVar |
a character string indicating the variable with the ending times of some series. |
NewVar |
a character string specifying the name of the new dummy
variable for the series. The default is |
NewTimeVar |
a character string specifying the name of the new time
variable. The default is |
KeepStartStop |
logical indicating whether or not to keep the
|
Returns a data frame with at least three columns, with the
GroupVar
, NewTimeVar
, and a new dummy variable with the name
specified by NewVar
. This variable is 1
for every time
increment between and including StartVar
and EndVar
. It is
0
otherwise.
# Create fake data country = c('Panama', 'Korea', 'Korea', 'Germany', 'Finland') start = c(1995, 1980, 2004, 2000, 2012) end = c(1995, 2001, 2010, 2002, 2014) Data <- data.frame(country, start, end) # TimeFill FilledData <- TimeFill(Data, GroupVar = 'country', StartVar = 'start', EndVar = 'end') # Show selection from TimeFill-ed data FilledData[90:100, ]
# Create fake data country = c('Panama', 'Korea', 'Korea', 'Germany', 'Finland') start = c(1995, 1980, 2004, 2000, 2012) end = c(1995, 2001, 2010, 2002, 2014) Data <- data.frame(country, start, end) # TimeFill FilledData <- TimeFill(Data, GroupVar = 'country', StartVar = 'start', EndVar = 'end') # Show selection from TimeFill-ed data FilledData[90:100, ]
VarDrop
drops one or more variables from a data frame.
VarDrop(data, Var)
VarDrop(data, Var)
data |
a data frame. |
Var |
character vector containing the names of the variables to drop. |
# Create dummy data a <- c(1, 2, 3, 4, NA) b <- c( 1, NA, 3, 4, 5) c <- c(1:5) d <- c(1:5) ABCData <- data.frame(a, b, c, d) # Drop a and b DroppedData <- VarDrop(ABCData, c('b', 'c'))
# Create dummy data a <- c(1, 2, 3, 4, NA) b <- c( 1, NA, 3, 4, 5) c <- c(1:5) d <- c(1:5) ABCData <- data.frame(a, b, c, d) # Drop a and b DroppedData <- VarDrop(ABCData, c('b', 'c'))