Check data for Nonmem compatibility or check control stream for data compatibility

Check data in various ways for compatibility with Nonmem. Some findings will be reported even if they will not make Nonmem fail but because they are typical dataset issues.

NMcheckData(
  data,
  file,
  covs,
  covs.occ,
  cols.num,
  col.id = "ID",
  col.time = "TIME",
  col.dv = "DV",
  col.mdv = "MDV",
  col.cmt = "CMT",
  col.amt = "AMT",
  col.flagn,
  col.row,
  col.usubjid,
  cols.dup,
  type.data = "est",
  cols.disable,
  na.strings,
  return.summary = FALSE,
  quiet = FALSE,
  as.fun
)

Arguments

data: The data to check. data.frame, data.table, tibble, anything that can be converted to data.table.
file: Alternatively to checking a data object, you can use file to specify a control stream to check. This can either be a (working or non-working) input control stream or an output control stream. In this case, NMdataCheck checks column names in data against control stream (see NMcheckColnames), reads the data as Nonmem would do, and do the same checks on the data as NMdataCheck would do using the data argument. col.flagn is ignored in this case - instead, ACCEPT/IGNORE statements in control stream are applied. The file argument is useful for debugging a Nonmem model.
covs: columns that contain subject-level covariates. They are expected to be non-missing, numeric and not varying within subjects.
covs.occ: A list specifying columns that contain subject:occasion-level covariates. They are expected to be non-missing, numeric and not varying within combinations of subject and occasion. covs.occ=list(PERIOD=c("FED")) means that FED is the covariate, while PERIOD indicates the occasion.
cols.num: Columns that are expected to be present, numeric and non-NA. If a character vector is given, the columns are expected to be used in all rows. If a column is only used for a subset of rows, use a list and name the elements by subsetting strings. See examples.
col.id: The name of the column that holds the subject identifier. Default is "ID".
col.time: The name of the column holding actual time.
col.dv: The name of the column holding the dependent variable. For now, only one column can be specified, and MDV is assumed to match this column. Default is DV.
col.mdv: The name of the column holding the binary indicator of the dependent variable missing. Default is MDV.
col.cmt: The name(s) of the compartment column(s). These will be checked to be positive integers for all rows. They are also used in checks for row duplicates.
col.amt: The name of the dose amount column.
col.flagn: Optionally, the name of the column holding numeric exclusion flags. Default value is FLAG and can be configured using NMdataConf. Even though FLAG is the default value, no finding will be returned if the column is missing unless explicitly defined as col.flagn="FLAG". This is because this way of using exclusion flags is only one of many ways you could choose to handle exclusions. Disable completely by using col.flagn=FALSE.
col.row: A column with a unique value for each row. Such a column is recommended to use if possible. Default ("ROW") can be modified using NMdataConf.
col.usubjid: Optional unique subject identifier. It is recommended to keep a unique subject identifier (typically a character string including an abbreviated study name and the subject id) from the clinical datasets in the analysis set. If you supply the name of the column holding this identifier, NMcheckData will check that it is non-missing, that it is unique within values of col.id (i.e. that the analysis subject ID's are unique across actual subjects), and that col.id is unique within the unique subject ID (a violation of the latter is less likely).
cols.dup: Additional column names to consider in search of duplicate events. col.id, col.cmt, col.evid, and col.time are always considered if found in data, and cols.dup is added to this list if provided.
type.data: "est" for estimation data (default), and "sim" for simulation data. Differences are that col.row is not expected for simulation data, and subjects will be checked to have EVID==0 rows for estimation data and EVID==2 rows for simulation data.
cols.disable: Columns to not check. This is particularly useful when checking data sets that do not include i.e. `CMT`, `EVID`, and others. To skip checking specific columns, provide their names like `cols.disable=c("CMT","EVID")`.
na.strings: Strings to be accepted when trying to convert characters to numerics. This will typically be a string that represents missing values. Default is ".". Notice, actual NA, i.e. not a string, is allowed independently of na.strings. See ?NMisNumeric.
return.summary: If TRUE (not default), the table summary that is printed if quiet=FALSE is returned as well. In that case, a list is returned, and the findings are in an element called findings.
quiet: Keep quiet? Default is not to.
as.fun: The default is to return data as a data.frame. Pass a function (say tibble::as_tibble) in as.fun to convert to something else. If data.tables are wanted, use as.fun="data.table". The default can be configured using NMdataConf.

Value

A table with findings

Details

The following checks are performed. The term "numeric" does not refer to a numeric representation in R, but compatibility with Nonmem. The character string "2" is in this sense a valid numeric, "id2" is not.

Column names must be unique and not contain special characters
If an exclusion flag is used (for ACCEPT/IGNORE in Nonmem), elements must be non-missing and integers. Notice, if an exclusion flag is found, the rest of the checks are performed on rows where that flag equals 0 (zero) only.
If a unique row identifier is found, it has to be non-missing, increasing integers.
col.time (TIME), EVID, col.id (ID), col.cmt (CMT), and col.mdv (MDV): If present, elements must be non-missing and numeric.
col.time (TIME) must be non-negative
EVID must be in {0,1,2,3,4}.
CMT must be positive integers. However, can be missing or zero for EVID==3.
MDV must be the binary (1/0) representation of is.na(DV) for dosing records (EVID==0).
AMT must be 0 or NA for EVID 0 and 2
AMT must be positive for EVID 1 and 4
DV must be numeric
DV must be missing for EVID in {1,4}.
If found, RATE must be a numeric, equaling -2 or non-negative for dosing events.
If found, SS must be a numeric, equaling 0 or 1 for dosing records.
If found, ADDL must be a non-negative integer for dosing records. II must be present.
If found, II must be a non-negative integer for dosing records. ADDL must be present.
ID must be positive and values cannot be disjoint (all records for each ID must be following each other. This is technically not a requirement in Nonmem but most often an error. Use a second ID column if you deliberately want to soften this check)
TIME cannot be decreasing within ID, unless EVID in {3,4}.
all ID's must have doses (EVID in {1,4})
all ID's must have observations (EVID==0)
ID's should not have leading zeros since these will be lost when Nonmem read, then write the data.
If a unique row identifier is used, this must be non-missing, increasing, integer
Character values must not contain commas (they will mess up writing/reading csv)
Columns specified in covs argument must be non-missing, numeric and not varying within subjects.
Columns specified in covs.occ must be non-missing, numeric and not varying within combinations of subject and occasion.
Columns specified in cols.num must be present, numeric and non-NA.
If a unique subject identifier column (col.usubjid) is provided, `col.id` must be unique within values of col.usubjid and vice versa.
Events should not be duplicated. For all rows, the combination of col.id, col.cmt, col.evid, col.time plus the optional columns specified in cols.dup must be unique. In other words, if a subject (col.id) that has say observations (col.evid) at the same time (col.time), this is considered a duplicate. The exception is if there is a reset event (col.evid is 3 or 4) in between the two rows. cols.dup can be used to add columns to this analysis. This is useful for different assays run on the same compartment (say a DVID column) or maybe stacked datasets. If col.cmt is of length>1, this search is repeated for each cmt column.

Examples

if (FALSE) { # \dontrun{
dat <- readRDS(system.file("examples/data/xgxr2.rds", package="NMdata"))
NMcheckData(dat)
dat[EVID==0,LLOQ:=3.5]
## expecting LLOQ only for samples
NMcheckData(dat,cols.num=list(c("STUDY"),"EVID==0"=c("LLOQ")))
} # }