Warm tip: This article is reproduced from stackoverflow.com, please click
matrix r sparse-matrix

sparse.model.matrix creating inconsistent output

发布于 2020-04-10 16:06:39

I have an xgboost model on two different servers - a test server and a production server. Each server has exactly the same data and exactly the same code, but when I apply the same model to the same data in each environment I get a slightly different result. We need the results to be identical.

I've found that the sparse matrix object that the following line returns is different on each server:

mm <- sparse.model.matrix(y ~ ., data = df.new)[,-1]

The mm on the test server has @i and @x of length 182, whereas the mm on the production server has @i and @x of length 184. Again, I've compared the df.new from both servers and they are identical.

I've tried downgrading the Matrix package on the production server so that the versions match, but it's still producing different results. The only idea I have left is to match the versions of every package.

Does anyone have any suggestions for what might be happening? Unfortunately I can't share the data, but if it helps, it's 227 variables of mixed types (775 when converted to sparse model matrix). A lot of the variables are mostly 0.

I don't know if it makes a difference or not, but the test server is Windows and the production server is Linux.

Ben Bolker 2020-02-03 21:45

You're getting bitten by the conjunction of two problems:

(1) floating-point computations are inherently sensitive to small differences (platform, compiler, compiler settings ...) (2) ordered factors in R use an orthogonal polynomial contrasts (see ?contr.poly, Venables and Ripley Modern Applied Statistics with S, or here), which involve floating-point computation.

dd <- data.frame(x=ordered(0:2))
> Matrix::sparse.model.matrix(~x,dd)
3 x 3 sparse Matrix of class "dgCMatrix"
  (Intercept)           x.L        x.Q
1           1 -7.071068e-01  0.4082483
2           1 -7.850462e-17 -0.8164966
3           1  7.071068e-01  0.4082483

You can see that one of the entries here is close to but not exactly equal to zero. So far I haven't actually been able to come up with an example that displays a difference between the two platforms I have handy (Ubuntu Linux and MacOS), but this is almost surely the source of your problem; the nearly-zero entry is computed as exactly zero on one platform but not the other.

There is probably no perfect solution to this problem, but zapsmall() would convert small entries to zero, and drop0 would convert them from explicit to implicit (structural) zero entries, so drop0(zapsmall(mm)) might work ...