I have a Pandas Dataframe generated from a database, which has data with mixed encodings. For example:
+----+-------------------------+----------+------------+------------------------------------------------+--------------------------------------------------------+--------------+-----------------------+
| ID | path | language | date | longest_sentence | shortest_sentence | number_words | readability_consensus |
+----+-------------------------+----------+------------+------------------------------------------------+--------------------------------------------------------+--------------+-----------------------+
| 0 | data/Eng/Sagitarius.txt | Eng | 2015-09-17 | With administrative experience in the prepa... | I am able to relocate internationally on short not... | 306 | 11th and 12th grade |
+----+-------------------------+----------+------------+------------------------------------------------+--------------------------------------------------------+--------------+-----------------------+
| 31 | data/Nor/Høylandet.txt | Nor | 2015-07-22 | Høgskolen i Østfold er et eksempel... | Som skuespiller har jeg både... | 253 | 15th and 16th grade |
+----+-------------------------+----------+------------+------------------------------------------------+--------------------------------------------------------+--------------+-----------------------+
As seen there is a mix of English and Norwegian (encoded as ISO-8859-1 in the database I think). I need to get the contents of this Dataframe output as a Markdown table, but without getting problems with encoding. I followed this answer (from the question Generate Markdown tables?) and got the following:
import sys, sqlite3
db = sqlite3.connect("Applications.db")
df = pd.read_sql_query("SELECT path, language, date, longest_sentence, shortest_sentence, number_words, readability_consensus FROM applications ORDER BY date(date) DESC", db)
db.close()
rows = []
for index, row in df.iterrows():
items = (row['date'],
row['path'],
row['language'],
row['shortest_sentence'],
row['longest_sentence'],
row['number_words'],
row['readability_consensus'])
rows.append(items)
headings = ['Date',
'Path',
'Language',
'Shortest Sentence',
'Longest Sentence since',
'Words',
'Grade level']
fields = [0, 1, 2, 3, 4, 5, 6]
align = [('^', '<'), ('^', '^'), ('^', '<'), ('^', '^'), ('^', '>'),
('^','^'), ('^','^')]
table(sys.stdout, rows, fields, headings, align)
However, this yields an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 72: ordinal not in range(128)
error. How can I output the Dataframe as a Markdown table? That is, for the purpose of storing this code in a file for use in writing a Markdown document. I need the output to look like this:
| ID | path | language | date | longest_sentence | shortest_sentence | number_words | readability_consensus |
|----|-------------------------|----------|------------|------------------------------------------------|--------------------------------------------------------|--------------|-----------------------|
| 0 | data/Eng/Sagitarius.txt | Eng | 2015-09-17 | With administrative experience in the prepa... | I am able to relocate internationally on short not... | 306 | 11th and 12th grade |
| 31 | data/Nor/Høylandet.txt | Nor | 2015-07-22 | Høgskolen i Østfold er et eksempel... | Som skuespiller har jeg både... | 253 | 15th and 16th grade |
Right, so I've taken a leaf from a question suggested by Rohit (Python - Encoding string - Swedish Letters), extended his answer, and came up with the following:
# Enforce UTF-8 encoding
import sys
stdin, stdout = sys.stdin, sys.stdout
reload(sys)
sys.stdin, sys.stdout = stdin, stdout
sys.setdefaultencoding('UTF-8')
# SQLite3 database
import sqlite3
# Pandas: Data structures and data analysis tools
import pandas as pd
# Read database, attach as Pandas dataframe
db = sqlite3.connect("Applications.db")
df = pd.read_sql_query("SELECT path, language, date, shortest_sentence, longest_sentence, number_words, readability_consensus FROM applications ORDER BY date(date) DESC", db)
db.close()
df.columns = ['Path', 'Language', 'Date', 'Shortest Sentence', 'Longest Sentence', 'Words', 'Readability Consensus']
# Parse Dataframe and apply Markdown, then save as 'table.md'
cols = df.columns
df2 = pd.DataFrame([['---','---','---','---','---','---','---']], columns=cols)
df3 = pd.concat([df2, df])
df3.to_csv("table.md", sep="|", index=False)
An important precursor to this is that the shortest_sentence
and longest_sentence
columns do not contain unnecessary line breaks, as removed by applying .replace('\n', ' ').replace('\r', '')
to them before submitting into the SQLite database. It appears that the solution is not to enforce the language-specific encoding (ISO-8859-1
for Norwegian), but rather that UTF-8
is used instead of the default ASCII
.
I ran this through my IPython notebook (Python 2.7.10) and got a table like the following (fixed spacing for appearance here):
| Path | Language | Date | Shortest Sentence | Longest Sentence | Words | Readability Consensus |
|-------------------------|----------|------------|----------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|-----------------------|
| data/Eng/Something1.txt | Eng | 2015-09-17 | I am able to relocate to London on short notice. | With my administrative experience in the preparation of the structure and content of seminars in various courses, and critiquing academic papers on various levels, I am confident that I can execute the work required as an editorial assistant. | 306 | 11th and 12th grade |
| data/Nor/NoeNorrønt.txt | Nor | 2015-09-17 | Jeg har grundig kjennskap til Microsoft Office og Adobe. | I løpet av studiene har jeg vært salgsmedarbeider for et større konsern, hvor jeg solgte forsikring til studentene og de faglige ansatte ved universitetet i Trønderlag, samt renholdsarbeider i et annet, hvor jeg i en periode var avdelingsansvarlig. | 205 | 18th and 19th grade |
| data/Nor/Ørret.txt.txt | Nor | 2015-09-17 | Jeg håper på positiv tilbakemelding, og møter naturligvis til intervju hvis det er ønskelig. | I løpet av studiene har jeg vært salgsmedarbeider for et større konsern, hvor jeg solgte forsikring til studentene og de faglige ansatte ved universitetet i Trønderlag, samt renholdsarbeider i et annet, hvor jeg i en periode var avdelingsansvarlig. | 160 | 18th and 19th grade |
Thus, a Markdown table without problems with encoding.
Calling
sys.setdefaultencoding('UTF-8')
is an ultra bad idea. It masks all kind of issues and it looks like your new code doesn't need it as you're not callingtable()
, which is full of implied encodings.