R/NMcheckData.R
NMcheckData.Rd
Check data in various ways for compatibility with Nonmem. Some findings will be reported even if they will not make Nonmem fail but because they are typical dataset issues.
NMcheckData(
data,
file,
covs,
covs.occ,
cols.num,
col.id = "ID",
col.time = "TIME",
col.dv = "DV",
col.mdv = "MDV",
col.cmt = "CMT",
col.amt = "AMT",
col.flagn,
col.row,
col.usubjid,
cols.dup,
type.data = "est",
cols.disable,
na.strings,
return.summary = FALSE,
quiet = FALSE,
as.fun
)
The data to check. data.frame
,
data.table
, tibble
, anything that can be
converted to data.table
.
Alternatively to checking a data object, you can use
file to specify a control stream to check. This can either be
a (working or non-working) input control stream or an output
control stream. In this case, NMdataCheck
checks column
names in data against control stream (see
NMcheckColnames
), reads the data as Nonmem would do,
and do the same checks on the data as NMdataCheck would do
using the data argument. col.flagn
is ignored in this
case - instead, ACCEPT/IGNORE statements in control stream are
applied. The file argument is useful for debugging a Nonmem
model.
columns that contain subject-level covariates. They are expected to be non-missing, numeric and not varying within subjects.
A list specifying columns that contain
subject:occasion-level covariates. They are expected to be
non-missing, numeric and not varying within combinations of
subject and occasion. covs.occ=list(PERIOD=c("FED"))
means that FED
is the covariate, while PERIOD
indicates the occasion.
Columns that are expected to be present, numeric and non-NA. If a character vector is given, the columns are expected to be used in all rows. If a column is only used for a subset of rows, use a list and name the elements by subsetting strings. See examples.
The name of the column that holds the subject identifier. Default is "ID".
The name of the column holding actual time.
The name of the column holding the dependent
variable. For now, only one column can be specified, and
MDV
is assumed to match this column. Default is
DV
.
The name of the column holding the binary indicator
of the dependent variable missing. Default is MDV
.
The name(s) of the compartment column(s). These will be checked to be positive integers for all rows. They are also used in checks for row duplicates.
The name of the dose amount column.
Optionally, the name of the column holding
numeric exclusion flags. Default value is FLAG
and can
be configured using NMdataConf
. Even though FLAG
is the default value, no finding will be returned if the
column is missing unless explicitly defined as
col.flagn="FLAG"
. This is because this way of using
exclusion flags is only one of many ways you could choose to
handle exclusions. Disable completely by using
col.flagn=FALSE
.
A column with a unique value for each row. Such a
column is recommended to use if possible. Default
("ROW"
) can be modified using NMdataConf
.
Optional unique subject identifier. It is recommended to keep a unique subject identifier (typically a character string including an abbreviated study name and the subject id) from the clinical datasets in the analysis set. If you supply the name of the column holding this identifier, NMcheckData will check that it is non-missing, that it is unique within values of col.id (i.e. that the analysis subject ID's are unique across actual subjects), and that col.id is unique within the unique subject ID (a violation of the latter is less likely).
Additional column names to consider in search of
duplicate events. col.id
, col.cmt
,
col.evid
, and col.time
are always considered if
found in data, and cols.dup
is added to this list if
provided.
"est"
for estimation data (default), and
"sim"
for simulation data. Differences are that
col.row
is not expected for simulation data, and
subjects will be checked to have EVID==0
rows for
estimation data and EVID==2
rows for simulation data.
Columns to not check. This is particularly useful when checking data sets that do not include i.e. `CMT`, `EVID`, and others. To skip checking specific columns, provide their names like `cols.disable=c("CMT","EVID")`.
Strings to be accepted when trying to convert
characters to numerics. This will typically be a string that
represents missing values. Default is ".". Notice, actual
NA
, i.e. not a string, is allowed independently of
na.strings
. See ?NMisNumeric
.
If TRUE (not default), the table summary
that is printed if quiet=FALSE
is returned as well. In
that case, a list is returned, and the findings are in an
element called findings.
Keep quiet? Default is not to.
The default is to return data as a
data.frame
. Pass a function (say
tibble::as_tibble
) in as.fun to convert to something
else. If data.tables
are wanted, use
as.fun="data.table"
. The default can be configured
using NMdataConf
.
A table with findings
The following checks are performed. The term "numeric" does not refer to a numeric representation in R, but compatibility with Nonmem. The character string "2" is in this sense a valid numeric, "id2" is not.
Column names must be unique and not contain special characters
If an exclusion flag is used (for ACCEPT/IGNORE in Nonmem), elements must be non-missing and integers. Notice, if an exclusion flag is found, the rest of the checks are performed on rows where that flag equals 0 (zero) only.
If a unique row identifier is found, it has to be non-missing, increasing integers.
col.time (TIME),
EVID
, col.id (ID
), col.cmt (CMT
), and col.mdv
(MDV
): If present, elements must be non-missing
and numeric.
col.time (TIME) must be non-negative
EVID
must be in {0,1,2,3,4}.
CMT must be positive integers. However, can be missing or zero for EVID==3
.
MDV must be the binary (1/0) representation of is.na(DV)
for
dosing records (EVID==0
).
AMT must be 0 or NA
for EVID
0 and 2
AMT must be positive for EVID
1 and 4
DV must be numeric
DV must be missing for EVID
in {1,4}.
If found, RATE must be a numeric, equaling -2 or non-negative for dosing events.
If found, SS must be a numeric, equaling 0 or 1 for dosing records.
If found, ADDL
must be a non-negative integer for dosing
records. II must be present.
If found, II must be a non-negative integer for dosing
records. ADDL
must be present.
ID must be positive and values cannot be disjoint (all records for each ID must be following each other. This is technically not a requirement in Nonmem but most often an error. Use a second ID column if you deliberately want to soften this check)
TIME cannot be decreasing within ID, unless EVID
in {3,4}.
all ID's must have doses (EVID
in {1,4})
all ID's must have observations (EVID
==0)
ID's should not have leading zeros since these will be lost when Nonmem read, then write the data.
If a unique row identifier is used, this must be non-missing, increasing, integer
Character values must not contain commas (they will mess up writing/reading csv)
Columns specified in covs argument must be non-missing, numeric and not varying within subjects.
Columns specified in covs.occ
must be
non-missing, numeric and not varying within combinations of
subject and occasion.
Columns specified in cols.num
must be present, numeric
and non-NA
.
If a unique subject identifier column (col.usubjid
) is
provided, `col.id` must be unique within values of col.usubjid
and
vice versa.
Events should not be duplicated. For all rows, the
combination of col.id
, col.cmt
, col.evid
, col.time
plus the
optional columns specified in cols.dup
must be unique. In other
words, if a subject (col.id
) that has say observations (col.evid
)
at the same time (col.time), this is considered a duplicate. The
exception is if there is a reset event (col.evid
is 3 or 4) in
between the two rows. cols.dup can be used to add columns to this
analysis. This is useful for different assays run on the same
compartment (say a DVID column) or maybe stacked datasets. If
col.cmt is of length>1, this search is repeated for each cmt
column.
if (FALSE) { # \dontrun{
dat <- readRDS(system.file("examples/data/xgxr2.rds", package="NMdata"))
NMcheckData(dat)
dat[EVID==0,LLOQ:=3.5]
## expecting LLOQ only for samples
NMcheckData(dat,cols.num=list(c("STUDY"),"EVID==0"=c("LLOQ")))
} # }