温馨提示:本文翻译自stackoverflow.com,查看原文请点击:其他 - Talend : Transform JSON lines to columns, extracting column names from JSON
talend

其他 - 其他-Talend:将JSON行转换为列,从JSON中提取列名称

发布于 2020-07-02 14:08:00

我有一个JSON rest响应,其结构类似于这样:

{
    "data" : [
        {
            "fields" : [
                { "label" : "John", "value" : "John" },
                { "label" : "Smith", "value" : "/person/4315" },
                { "label" : "43", "value" : "43" },
                { "label" : "London", "value" : "/city/54" }
            ]
        },
        {
            "fields" : [
                { "label" : "Albert", "value" : "Albert" },
                { "label" : "Einstein", "value" : "/person/154" },
                { "label" : "141", "value" : "141" },
                { "label" : "Princeton", "value" : "/city/9541" }
            ]
        }
    ],
    "columns" : ["firstname", "lastname", "age", "city"]
}

我正在寻找一种方法来将此数据转换为像

| first_name_label  | firstname_value | lastname_label | lastname_value | age_label | age_value | city_label | city_value |
---------------------------------------------------------------------------------------------------------------------------
|      John         |   John          | Smith          |  /person/4315  |  43       |    43     |    London  |  /city/54  |
|      Albert       |   Albert        | Einstein       |  /person/154   |  141      |    141    |  Princeton | /city/9541 |

当然,列数及其名称可能会更改,因此在运行时之前我不知道该架构。我可能可以编写Java来处理此问题,但我想知道是否有更标准的方法。

我是Talend的新手,所以我花了数小时进行尝试,但是由于我的尝试可能是完全错误的,因此在此不再赘述。

谢谢你的帮助。

查看更多

提问者
Emmanuel BRUNO
被浏览
9
2020-06-20 17:12

这是我汇总的完全动态的解决方案。

在此处输入图片说明

首先,您需要阅读json以获取列列表。这是tExtractJSONFields_2的样子:

在此处输入图片说明

然后,将列及其位置存储在tHashOutput中(您需要在File> Project properties> Designer> Palette settings中取消隐藏它)。在tMap_2中,您可以使用以下序列获取列的位置:

Numeric.sequence("s", 1, 1) 

该子作业的输出为:

|=-------+--------=|
|position|column   |
|=-------+--------=|
|1       |firstname|
|2       |lastname |
|3       |age      |
|4       |city     |
'--------+---------'

第二步是再次读取json,以解析fields属性。 在此处输入图片说明 像在步骤1中一样,您需要向每个字段添加相对于列的位置。这是我用来获取序列的表达式:

(Numeric.sequence("s1", 0, 1) % ((Integer)globalMap.get("tHashOutput_1_NB_LINE"))) + 1 

Note that I'm using a different sequence name, because sequences keep their value throughout the job. I'm using the number of columns from tHashOutput_1 in order to keep things dynamic.
Here's the output from this subjob:

|=-------+---------+---------------=|
|position|label    |value           |
|=-------+---------+---------------=|
|1       |John     |John            |
|2       |Smith    |/person/4315    |
|3       |43       |43              |
|4       |London   |/city/54        |
|1       |Albert   |Albert          |
|2       |Einstein |/person/154     |
|3       |141      |141             |
|4       |Princeton|/city/9541      |
'--------+---------+----------------'

In the last subjob, you need to join the fields data with the columns, using the column position we stored with either one. 在此处输入图片说明

In tSplitRow_1 I generate 2 rows for each incoming row. Each row is a key value pair. The first row is <columnName>_label (like firstname_label, lastname_label) its value being the label from the fields. The 2nd row's key is <columnName>_value, and its value is the value from the fields.

在此处输入图片说明

Once again, we need to add a position to our data in tMap_4, using this expression:

(Numeric.sequence("s2", 0, 1) / ((Integer)globalMap.get("tHashOutput_1_NB_LINE") * 2)) + 1

Note that since we have twice as many rows coming out of tSplitRow, I multiply the number of columns by 2.
This will attribute the same ID for the data that needs to be on the same row in the output file. The output of this tMap will be like:

|=-+---------------+-----------=|
|id|col_label      |col_value   |
|=-+---------------+-----------=|
|1 |firstname_label|John        |
|1 |firstname_value|John        |
|1 |lastname_label |Smith       |
|1 |lastname_value |/person/4315|
|1 |age_label      |43          |
|1 |age_value      |43          |
|1 |city_label     |London      |
|1 |city_value     |/city/54    |
|2 |firstname_label|Albert      |
|2 |firstname_value|Albert      |
|2 |lastname_label |Einstein    |
|2 |lastname_value |/person/154 |
|2 |age_label      |141         |
|2 |age_value      |141         |
|2 |city_label     |Princeton   |
|2 |city_value     |/city/9541  |
'--+---------------+------------'

This leads us to the last component tPivotToColumnsDelimited which will pivot our rows to columns using the unique ID.

在此处输入图片说明

And the final result is a csv file like:

id;firstname_label;firstname_value;lastname_label;lastname_value;age_label;age_value;city_label;city_value
1;John;John;Smith;/person/4315;43;43;London;/city/54
2;Albert;Albert;Einstein;/person/154;141;141;Princeton;/city/9541

Note that you end up with an extraneous column at the beginning which is the row id which can be easily removed by reading the file and removing it.
I tried adding a new column along with the corresponding fields in the input json, and it works as expected.