Warm tip: This article is reproduced from stackoverflow.com, please click
talend

Talend : Transform JSON lines to columns, extracting column names from JSON

发布于 2020-06-27 06:31:18

I have a json rest response with a structure somehow like this one :

{
    "data" : [
        {
            "fields" : [
                { "label" : "John", "value" : "John" },
                { "label" : "Smith", "value" : "/person/4315" },
                { "label" : "43", "value" : "43" },
                { "label" : "London", "value" : "/city/54" }
            ]
        },
        {
            "fields" : [
                { "label" : "Albert", "value" : "Albert" },
                { "label" : "Einstein", "value" : "/person/154" },
                { "label" : "141", "value" : "141" },
                { "label" : "Princeton", "value" : "/city/9541" }
            ]
        }
    ],
    "columns" : ["firstname", "lastname", "age", "city"]
}

I'm looking for a way to transform this data to rows like

| first_name_label  | firstname_value | lastname_label | lastname_value | age_label | age_value | city_label | city_value |
---------------------------------------------------------------------------------------------------------------------------
|      John         |   John          | Smith          |  /person/4315  |  43       |    43     |    London  |  /city/54  |
|      Albert       |   Albert        | Einstein       |  /person/154   |  141      |    141    |  Princeton | /city/9541 |

Of course the number of columns and their names may change so I don't know the schema before runtime. I probably can write java to handle this but I'd like to know if there's a more standard way.

I'm new to Talend so I spent hours trying, but since my attempts were probably totally wrong I won't describe it here.

Thanks for your help.

Questioner
Emmanuel BRUNO
Viewed
29
2020-06-20 17:12

Here's a completely dynamic solution I put together.

enter image description here

First, you need to read the json in order to get the column list. Here's what tExtractJSONFields_2 looks like:

enter image description here

Then you store the columns and their positions in a tHashOutput (you need to unhide it in File > Project properties > Designer > Palette settings). In tMap_2, you get the position of the column using a sequence:

Numeric.sequence("s", 1, 1) 

The output of this subjob is:

|=-------+--------=|
|position|column   |
|=-------+--------=|
|1       |firstname|
|2       |lastname |
|3       |age      |
|4       |city     |
'--------+---------'

The 2nd step is to read the json again, in order to parse the fields property. enter image description here Like in step 1, you need to add a position to each field, relative to the columns. Here's the expression I used to get the sequence:

(Numeric.sequence("s1", 0, 1) % ((Integer)globalMap.get("tHashOutput_1_NB_LINE"))) + 1 

Note that I'm using a different sequence name, because sequences keep their value throughout the job. I'm using the number of columns from tHashOutput_1 in order to keep things dynamic.
Here's the output from this subjob:

|=-------+---------+---------------=|
|position|label    |value           |
|=-------+---------+---------------=|
|1       |John     |John            |
|2       |Smith    |/person/4315    |
|3       |43       |43              |
|4       |London   |/city/54        |
|1       |Albert   |Albert          |
|2       |Einstein |/person/154     |
|3       |141      |141             |
|4       |Princeton|/city/9541      |
'--------+---------+----------------'

In the last subjob, you need to join the fields data with the columns, using the column position we stored with either one. enter image description here

In tSplitRow_1 I generate 2 rows for each incoming row. Each row is a key value pair. The first row is <columnName>_label (like firstname_label, lastname_label) its value being the label from the fields. The 2nd row's key is <columnName>_value, and its value is the value from the fields.

enter image description here

Once again, we need to add a position to our data in tMap_4, using this expression:

(Numeric.sequence("s2", 0, 1) / ((Integer)globalMap.get("tHashOutput_1_NB_LINE") * 2)) + 1

Note that since we have twice as many rows coming out of tSplitRow, I multiply the number of columns by 2.
This will attribute the same ID for the data that needs to be on the same row in the output file. The output of this tMap will be like:

|=-+---------------+-----------=|
|id|col_label      |col_value   |
|=-+---------------+-----------=|
|1 |firstname_label|John        |
|1 |firstname_value|John        |
|1 |lastname_label |Smith       |
|1 |lastname_value |/person/4315|
|1 |age_label      |43          |
|1 |age_value      |43          |
|1 |city_label     |London      |
|1 |city_value     |/city/54    |
|2 |firstname_label|Albert      |
|2 |firstname_value|Albert      |
|2 |lastname_label |Einstein    |
|2 |lastname_value |/person/154 |
|2 |age_label      |141         |
|2 |age_value      |141         |
|2 |city_label     |Princeton   |
|2 |city_value     |/city/9541  |
'--+---------------+------------'

This leads us to the last component tPivotToColumnsDelimited which will pivot our rows to columns using the unique ID.

enter image description here

And the final result is a csv file like:

id;firstname_label;firstname_value;lastname_label;lastname_value;age_label;age_value;city_label;city_value
1;John;John;Smith;/person/4315;43;43;London;/city/54
2;Albert;Albert;Einstein;/person/154;141;141;Princeton;/city/9541

Note that you end up with an extraneous column at the beginning which is the row id which can be easily removed by reading the file and removing it.
I tried adding a new column along with the corresponding fields in the input json, and it works as expected.