我不想重塑它,因为我有很多数据,所以像循环这样的东西会自动转换它输入-数据帧1
Item LC ToLC
8T4121 MW92 WK14
8T4121 WK14 RM11
8T4121 WK14 RS11
8T4121 RS11 OY01
AB7651 MW92 RS11
AB7651 RS11 OY01
我想做一个循环,我可以得到像这样的数据框2的输出
Item LC1 LC2 LC3 LC4
8T4121 MW92 WK14 RM11
8T4121 MW92 WK14 RS11 OY01
AB7651 MW92 RS11 OY01
我已经尝试过这样的事情:
bodlane <- lctolc
colnames(bodlane) <- c("Item","Entry","From")
bodlane$To <- lctolc$To[match(bodlane$From, lctolc$From)]
colnames(bodlane) <- c("Item","Entry","Parent","From")
bodlane$To <- lctolc$To[match(bodlane$From, lctolc$From)]
colnames(bodlane) <- c("Item","Entry","Parent","Parent1","From")
bodlane$To <- lctolc$To[match(bodlane$From, lctolc$From)]
colnames(bodlane) <- c("Item","LC","ToLC","Parent1","From","To")
我相信,可以通过与data.tableigraph
中的“递归”自联接类似的方式解决此问题,但无需计算。
The difficulty here is that there are separate graphs for each Item
. My approach is to split the data frame into a list of graphs. There might be more concise solutions which use the type
vertex attribute.
However, the code below creates the expected result:
library(igraph)
library(data.table)
library(magrittr)
lapply(
lapply(split(lctolc, lctolc$Item), function(x) graph.data.frame(x[, 2:3])),
function(x) lapply(
V(x)[degree(x, mode = "in") == 0],
function(s) all_simple_paths(x, from = s,
to = V(x)[degree(x, mode = "out") == 0]) %>%
lapply(
function(y) as.data.table(t(names(y))) %>% setnames(paste0("LC", seq_along(.)))
) %>%
rbindlist(fill = TRUE)
) %>% rbindlist(fill = TRUE)
) %>% rbindlist(fill = TRUE, idcol = "Item")
Item LC1 LC2 LC3 LC4 1: 8T4121 MN12 AB12 BC34 <NA> 2: 8T4121 MW92 WK14 RS11 OY01 3: 8T4121 MW92 WK14 RM11 <NA> 4: AB7651 MW92 RS11 OY01 <NA>
The igraph
package is a good choice for questions like this.
However, we need to treat the graph of each Item
separately. This is achieved by splitting the data.frame and creating a list of graphs by
lg <- lapply(split(lctolc, lctolc$Item), function(x) graph.data.frame(x[, 2:3]))
which returns
lg
$`8T4121` IGRAPH 8eb2bcc DN-- 8 6 -- + attr: name (v/c) + edges from 8eb2bcc (vertex names): [1] AB12->BC34 MN12->AB12 MW92->WK14 WK14->RM11 WK14->RS11 RS11->OY01 $AB7651 IGRAPH 7cd75e7 DN-- 3 2 -- + attr: name (v/c) + edges from 7cd75e7 (vertex names): [1] MW92->RS11 RS11->OY01
or, visualised by two separate plots.
lapply(seq_along(lg), function(i) plot(lg[[i]], main = names(lg)[i]))
Now, the function all_simple_paths()
lists simple paths from one source vertex to another vertex or vertices where a path is simple if the vertices are visited once at most. To use the function we need to determine the start nodes and all end nodes. This is achieved by
V(x)[degree(x, mode = "in") == 0] # start nodes
V(x)[degree(x, mode = "out") == 0] # end nodes
The degree()
function returns the number of in-coming or out-going edges, resp.
For our example dataset we get
lapply(lg, function(x) V(x)[degree(x, mode = "in") == 0]) # start nodes
$`8T4121` + 2/8 vertices, named, from 8eb2bcc: [1] MN12 MW92 $AB7651 + 1/3 vertex, named, from 7cd75e7: [1] MW92
lapply(lg, function(x) V(x)[degree(x, mode = "out") == 0]) # end nodes
$`8T4121` + 3/8 vertices, named, from 8eb2bcc: [1] BC34 RM11 OY01 $AB7651 + 1/3 vertex, named, from 7cd75e7: [1] OY01
Now, we loop through all start nodes of each graph and determine all simple paths. The result is a list, again. For each list item, the node names are extracted and reshaped to a data.table in wide format. The columns are renamed to LC1
, LC2
, etc.
In each step, we get a list of data.tables which are combined by rbindlist()
. The fill
parameter is required as the number of columns may vary. The final call to rbindlist() uses the idcol
parameter to mark the rows which are associated with Item
.
The sample dataset has been amended to include the cases from OP's comments here and here.
library(data.table)
lctolc <- fread("
Item LC ToLC
8T4121 AB12 BC34
8T4121 MN12 AB12
8T4121 MW92 WK14
8T4121 WK14 RM11
8T4121 WK14 RS11
8T4121 RS11 OY01
AB7651 MW92 RS11
AB7651 RS11 OY01",
data.table = FALSE)
在此代码中没有涉及一个极端情况。假设对于项目8T4121,存在来自LC“ AB12” ToLC“ BC34”的另一种关系,该关系与该项目的任何其他LC没有关系,则该关系也应出现在输出中,但现在不存在。
另外,如果我们在第一行中具有AB12与BC34之间的关系,在第二行中将MN12与AB12具有一种关系,则该关系应该作为MN12与AB12至BC34之间的关系出现,但仅作为AB12与BC34之间的关系出现。仅考虑第一行而不考虑第二行的关系
@AnshulSaravgi,的确,我错过了仅接受单个节点的
from
参数all_simple_paths()
。因此,我需要将调用包装在另一个lapply()
循环中。请查看固定代码版本。我遇到以下错误:rbindlist(。,fill = TRUE,idcol =“ Item”)中的错误:尝试在SET_STRING_ELT中设置索引50611/50611。您能帮我这个忙吗?
@ stackoverflow.com/users/3817004/uwe任何想法如何解决大型数据集