--- title: "Introduction to xform_function" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to xform_function} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, echo = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ## Introduction This vignette provides examples of how to use the `xform_function` transformation to create new data features for PMML models. Given a `xform_wrap` object and a transformation expression, `xform_function` calculates data for a new feature and creates a new `xform_wrap` object. When PMML is produced with `pmml::pmml()`, the transformation is inserted into the `LocalTransformations` node as a `DerivedField`. Multiple data fields and functions can be combined to produce a new feature. The code below uses `knitr::kable()` to make tables more readable. ```{r, echo=FALSE,warning=FALSE,message=FALSE,results="hide"} library(pmml) library(knitr) ``` ## Single numeric input Using the `iris` dataset as an example, let's construct a new feature by transforming one variable. Load the dataset and show the first few lines: ```{r} data(iris) kable(head(iris,3)) ``` Create the `iris_box` object with `xform_wrap`: ```{r} iris_box <- xform_wrap(iris) ``` `iris_box` contains the data and transform information that will be used to produce PMML later. The original data is in `iris_box$data`. Any new features created with a transformation are added as columns to this data frame. ```{r} kable(head(iris_box$data,3)) ``` Transform and field information is in `iris_box$field_data`. The field_data data frame contains information on every field in the dataset, as well as every transform used. The `xform_function` column contains expressions used in the `xform_function` transform. ```{r} kable(iris_box$field_data) ``` Now add a new feature, `Sepal.Length.Sqrt`, using `xform_function`: ```{r} iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length", new_field_name="Sepal.Length.Sqrt", expression="sqrt(Sepal.Length)") ``` The new feature is calculated and added as a column to the `iris_box$data` data frame: ```{r} kable(head(iris_box$data,3)) ``` `iris_box$field_data` now contains a new row with the transformation expression: ```{r} kable(iris_box$field_data[6,c(1:3,14)]) ``` Construct a linear model for `Petal.Width` using this new feature, and convert it to PMML: ```{r} fit <- lm(Petal.Width ~ Sepal.Length.Sqrt, data=iris_box$data) fit_pmml <- pmml(fit, transform=iris_box) ``` Since the model predicts `Petal.Width` using a variable based on `Sepal.Length`, the PMML will contain these two fields in the `DataDictionary` and `MiningSchema`: ```{r} fit_pmml[[2]] #Data Dictionary node fit_pmml[[3]][[1]] #Mining Schema node ``` The `LocalTransformations` node contains `Sepal.Length.Sqrt` as a derived field: ```{r} fit_pmml[[3]][[3]] ``` ## Single categorical input `xform_function` can also operate on categorical data. In this example, let's create a numeric feature that equals 1 when `Species` is `setosa`, and 0 otherwise: ```{r} iris_box <- xform_wrap(iris) iris_box <- xform_function(iris_box,orig_field_name="Species", new_field_name="Species.Setosa", expression="if (Species == 'setosa') {1} else {0}") kable(head(iris_box$data,3)) ``` Create a linear model and check the `LocalTransformations` node: ```{r} fit <- lm(Petal.Width ~ Species.Setosa, data=iris_box$data) fit_pmml <- pmml(fit, transform=iris_box) fit_pmml[[3]][[3]] ``` ## Multiple input fields Several fields can be combined to create new features. Let's make a new field from the ratio of sepal and petal lengths: ```{r} iris_box <- xform_wrap(iris) iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length,Petal.Length", new_field_name="Length.Ratio", expression="Sepal.Length / Petal.Length") ``` As before, the new field is added as a column to the `iris_box$data` data frame: ```{r} kable(head(iris_box$data,3)) ``` Fit a linear model using this new feature, and convert it to pmml: ```{r} fit <- lm(Petal.Width ~ Length.Ratio, data=iris_box$data) fit_pmml <- pmml(fit, transform=iris_box) ``` The pmml will contain `Sepal.Length` and `Petal.Length` in the `DataDictionary` and `MiningSchema`: ```{r} fit_pmml[[2]] #Data Dictionary node fit_pmml[[3]][[1]] #Mining Schema node ``` The `Local.Transformations` node contains `Length.Ratio` as a derived field: ```{r} fit_pmml[[3]][[3]] ``` ## Using a previously derived feature It is possible to pass a feature derived with `xform_function` to another `xform_function` call. To do this, the second call to `xform_function` must use the original data field names (instead of the derived field) in the `orig_field_name` argument. ```{r} iris_box <- xform_wrap(iris) iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length,Petal.Length", new_field_name="Length.Ratio", expression="Sepal.Length / Petal.Length") iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length,Petal.Length,Sepal.Width", new_field_name="Length.R.Times.S.Width", expression="Length.Ratio * Sepal.Width") kable(iris_box$field_data[6:7,c(1:3,14)]) ``` ```{r} fit <- lm(Petal.Width ~ Length.R.Times.S.Width, data=iris_box$data) fit_pmml <- pmml(fit, transform=iris_box) ``` The pmml will contain `Sepal.Length`, `Petal.Length`, and `Sepal.Width` in the `DataDictionary` and `MiningSchema`: ```{r} fit_pmml[[2]] #Data Dictionary node fit_pmml[[3]][[1]] #Mining Schema node ``` The `Local.Transformations` node contains `Length.Ratio` and `Length.R.Times.S.Width` as derived fields: ```{r} fit_pmml[[3]][[3]] ``` ## Factor output The resulting field can be numeric or factor. Note that factors are exported with `dataType = "string"` and `optype = "categorical"` in PMML. The following code creates a factor with 3 levels from `Sepal.Length`: ```{r} iris_box <- xform_wrap(iris) iris_box <- xform_function(wrap_object = iris_box, orig_field_name = "Sepal.Length", new_field_name = "SL_factor", new_field_data_type = "factor", expression = "if(Sepal.Length<5.1) {'level_A'} else if (Sepal.Length>6.6) {'level_B'} else {'level_C'}") kable(head(iris_box$data, 3)) ``` The feature can then be used to create a model as usual: ```{r} fit <- lm(Petal.Width ~ SL_factor, data=iris_box$data) fit_pmml <- pmml(fit, transform=iris_box) ``` ## PMML functions supported by `xform_function` The following R functions and operators are directly supported by `xform_function`. Their PMML equivalents are listed in the second column: ```{r,echo=FALSE} R <- c("+","-","/","*","^","<","<=",">",">=","&&","&","|","||","==","!=","!","ceiling","prod","log") PMML <- c("+","-","/","*","pow","lessThan","lessOrEqual","greaterThan","greaterOrEqual","and","and","or","or","equal","notEqual","not","ceil","product","ln") funcs_df <- data.frame(R, PMML) knitr::kable(funcs_df) ``` For these functions, no extra code is required for translation. The R function `prod` can be used as long as only numeric arguments are specified. That is, `prod` can take an `na.rm` argument, but specifying this in `xform_function` directly will not produce PMML equivalent to the R expression. Similarly, the R function `log` can be used directly as long as the second argument (the base) is not specified. ## PMML functions not supported by `xform_function` There are built-in functions defined in PMML that cannot be directly translated to PMML using `xform_function` as described above. In this case, an error will be thrown when R tries to calculate a new feature using the function passed to `xform_function`, but does not see that function in the environment. It is still possible to make `xform_function` work, but the PMML function must be defined in the R environment first. Let's use `isIn`, a PMML function, as an example. The function returns a boolean indicating whether the first argument is contained in a list of values. Detailed specification for this function is available on [this DMG page](http://dmg.org/pmml/v4-4-1/BuiltinFunctions.html#boolean5). One way to implement this in R is by using `%in%`, with the list of values being represented by `...`: ```{r} isIn <- function(x, ...) { dots <- c(...) if (x %in% dots) { return(TRUE) } else { return(FALSE) } } isIn(1,2,1,4) ``` This function can now be passed to `xform_function`. The following code creates a feature that indicates whether `Species` is either `setosa` or `versicolor`: ```{r} iris_box <- xform_wrap(iris) iris_box <- xform_function(iris_box,orig_field_name="Species", new_field_name="Species.Setosa.or.Versicolor", expression="isIn(Species,'setosa','versicolor')") ``` The `data` data frame now contains the new feature: ```{r} kable(head(iris_box$data,3)) ``` Create a linear model and view the corresponding PMML for the function: ```{r} fit <- lm(Petal.Width ~ Species.Setosa.or.Versicolor, data=iris_box$data) fit_pmml <- pmml(fit, transform=iris_box) fit_pmml[[3]][[3]] ``` ## PMML function not supported by `xform_function` - another example As another example, let's use R's `mean` function to create a new feature. PMML has a built-in `avg`, so we will define an R function with this name. ```{r} avg <- function(...) { dots <- c(...) return(mean(dots)) } ``` Now use this function to take an average of several other features and combine with another field: ```{r} iris_box <- xform_wrap(iris) iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length,Petal.Length,Sepal.Width", new_field_name="Length.Average.Ratio", expression="avg(Sepal.Length,Petal.Length)/Sepal.Width") ``` The `data` data frame now contains the new feature: ```{r} kable(head(iris_box$data,3)) ``` Create a simple linear model and view the corresponding PMML for the function: ```{r} fit <- lm(Petal.Width ~ Length.Average.Ratio, data=iris_box$data) fit_pmml <- pmml(fit, transform=iris_box) fit_pmml[[3]][[3]] ``` In the PMML, `avg` will be recognized as a valid function. ## PMML for arbitrary functions The function `function_to_pmml` (part of the `pmml` package) makes it possible to convert an R expression into PMML directly, without creating a model or calculating values. As long as the expression passed to the function is a valid R expression (e.g., no unbalanced parentheses), it can contain arbitrary function names not defined in R. Variables in the expression passed to `xform_function` are always assumed to be field names, and not substituted. That is, even if `x` has a value in the R environment, the resulting expression will still use `x`. ```{r} function_to_pmml("1 + 2") x <- 3 function_to_pmml("foo(bar(x * y))") ``` ## More notes on functions There are several limitations to parsing expressions in `xform_function`. Each transformation operates on one data row at a time. For example, it is not possible to compute the mean of an entire feature column in `xform_function`. An expression such as `foo(x)` is treated as a function `foo` with argument `x`. Consequently, passing in an R vector `c(1,2,3)` will produce PMML where `c` is a function and `1,2,3` are the arguments: ```{r} function_to_pmml("c(1,2,3)") ``` We can also see what happens when passing an `na.rm` argument to `prod`, as mentioned in an above example: ```{r} function_to_pmml("prod(1,2,na.rm=FALSE)") #produces incorrect PMML function_to_pmml("prod(1,2)") #produces correct PMML ``` Additionally, passing in a vector to `prod` produces incorrect PMML: ```{r} prod(c(1,2,3)) function_to_pmml("prod(c(1,2,3))") ``` ## More examples of functions The following are additional examples of pmml produced from R expressions. Extra parentheses: ```{r} function_to_pmml("pmmlT(((1+2))*(x))") ``` If-else expressions: ```{r} function_to_pmml("if(a<2) {x+3} else if (a>4) {4} else {5}") ``` ## References - [DMG PMML 4.4 specification](http://dmg.org/pmml/v4-4-1/GeneralStructure.html)