Tokenising words in a dictionary Python

发布于 2020-03-27 10:25:54

So I have json file where I import data into python.

I have an agentId field and an agentText field in JSON

Sample json:

{
"messages": 
[
    {"agentId": "1", "agentText": "I Love Python"},
    {"agentId": "2", "agentText": "but cant seem to get my head around it"},
    {"agentId": "3", "agentText": "what are the alternatives?"}
]
}

I'm trying to create a dictionary/key pair value with agentIds and the AgentText fields by doing the following:

When I do this, the key value pairs work fine:

import json

with open('20190626-101200-text-messages2.json', 'r') as f:
    data = json.load(f)

for message in data['messages']:
        agentIdandText = {message['agentId']: [message['agentText']]}
        print(agentIdandText)

and the output I get this:

{'1': ['I love python']}
{'2': ["but cant seem to get my head around it"]}
{'3': ['what are the alternatives?']}

but as soon as I try to tokenise the words(below), I start hitting errors

from nltk.tokenize import TweetTokenizer
varToken = TweetTokenizer()

import json

with open('20190626-101200-text-messages2.json', 'r') as f:
    data = json.load(f)

for message in data['messages']:
        agentIdandText = {message['agentId']: varToken.tokenize([message['agentText']])}
        print(agentIdandText)

Partial error message (edited in from comments):

return ENT_RE.sub(_convert_entity, _str_to_unicode(text, encoding)) 
TypeError: expected string or bytes-like object

So what I'm expecting is this:

{
'1': ['I', 'love', 'python'],
'2': ['but', 'cant', 'seem', 'to', 'get', 'my', 'head', 'around', 'it'],
'3': ['what', 'are', 'the', 'alternatives?']
}

How can I achieve this?

Questioner

dragonfury2

Viewed

Chinese

Original

dragonfury2 2019-07-04 16:37:46

This is almost what im after, except it needs to be in this format: { '1': ['I', 'love', 'python'], '2': ['but', 'cant', 'seem', 'to', 'get', 'my', 'head', 'around', 'it'], '3': ['what', 'are', 'the', 'alternatives?'] }

Kenstars 2019-07-04 18:20:33

Ok let me edit it for how you need it.

Tokenising words in a dictionary Python

Related issues