converting binary to utf-8 in python

Question

I have a binary like this: 1101100110000110110110011000001011011000101001111101100010101000

and I want to convert it to utf-8. how can I do this in python?

What encoding is the binary string in? ASCII? Or you mean the bytes are a utf-8-encoded string and you want to get a unicode string in python? — Claudiu, Oct 8 '13 at 18:52
What do you mean with "convert it to utf-8"? Create the characters from the binary octets? — Paulo Bu, Oct 8 '13 at 18:53
the binary string is in utf-8 and yes, I want to get a unicode string in python. — Aidin.T, Oct 8 '13 at 18:55
I think we're not understanding precisely what sort of file you have. Could you run hd or od or a similar hex-dump utility and copy-paste the first few lines? — Robᵩ, Oct 8 '13 at 18:57
it's not a file. I just have a text in persian and I convert it to binary, now I want to convert it back to the text. — Aidin.T, Oct 8 '13 at 19:04

Igonato · Accepted Answer · 2013-10-08 20:20:22Z

up vote 8 down vote accepted

Cleaner version:

>>> test_string = '1101100110000110110110011000001011011000101001111101100010101000'
>>> print ('%x' % int(test_string, 2)).decode('hex').decode('utf-8')
نقاب

Inverse (from @Robᵩ's comment):

>>> '{:b}'.format(int(u'نقاب'.encode('utf-8').encode('hex'), 16))
1: '1101100110000110110110011000001011011000101001111101100010101000'

edited Oct 8 '13 at 20:20

answered Oct 8 '13 at 19:14

Igonato

3,5501033

+1 for .decode('hex') – Robᵩ Oct 8 '13 at 19:16

but it doesn't work properly. it shows something else, not the first text I just converted to binary – Aidin.T Oct 8 '13 at 19:22

1

@Aidin.T try now. I added decode('utf-8') at the end. – Igonato Oct 8 '13 at 19:31

2

And the inverse would be: s=u'نقاب'; print '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16)) – Robᵩ Oct 8 '13 at 19:48

1

Note that s = "سلام" and s = u"سلام" give different results. The former fails, the latter works. But let's stop solving the new problem. @Aidin.T, if you have a problem with encoding, please open a new question. – Robᵩ Oct 8 '13 at 21:12

| show 18 more comments

Paulo Bu · Answer 2 · 2013-10-08 19:06:25Z

up vote 3 down vote

Well, the idea I have is: 1. Split the string into octets 2. Convert the octet to hexadecimal using int and later chr 3. Join them and decode the utf-8 string into Unicode

This code works for me, but I'm not sure what does it print because I don't have utf-8 in my console (Windows :P ).

s = '1101100110000110110110011000001011011000101001111101100010101000'
u = "".join([chr(int(x,2)) for x in [s[i:i+8] 
                           for i in range(0,len(s), 8)
                           ]
            ])
d = u.decode('utf-8')

Hope this helps!

answered Oct 8 '13 at 19:06

Paulo Bu

19.3k43853

i believe you want unichr – Joran Beasley Oct 8 '13 at 19:07

2

Hmmm, I'm somewhat suspicious of unichr. Because OP says his binary is already utf-8. utf-8 has variable character length, so I just used chr to join the raw bytes in a string and decode them later into Unicode. – Paulo Bu Oct 8 '13 at 19:09

2

@JoranBeasley - I disagree, assuming Python2. In that step he is collecting bytes, not characters. Only after he has the utf-8-encoded byte string does he want to convert. – Robᵩ Oct 8 '13 at 19:09

@Robᵩ: That's my point. Nice answer, love the split('........'). I think is basically the same idea as mine. +1 – Paulo Bu Oct 8 '13 at 19:11

1

+1 - This is the same technique as mine (so obviously I approve), plus you explained yours. Questioner should move the check to this better answer. – Robᵩ Oct 8 '13 at 19:12

| show 2 more comments

Robᵩ · Answer 3 · 2013-10-08 19:16:06Z

up vote 3 down vote

>>> s='1101100110000110110110011000001011011000101001111101100010101000'
>>> print (''.join([chr(int(x,2)) for x in re.split('(........)', s) if x ])).decode('utf-8')
نقاب
>>>

Or, the inverse:

>>> s=u'نقاب'
>>> ''.join(['{:b}'.format(ord(x)) for x in s.encode('utf-8')])
'1101100110000110110110011000001011011000101001111101100010101000'
>>>

edited Oct 8 '13 at 19:16

answered Oct 8 '13 at 19:07

Robᵩ

92.5k786167

there is another question, how can I convert my text to binary by python? I mean the inverse form of my question – Aidin.T Oct 8 '13 at 19:10

add a comment |

Nacib Neme · Answer 4 · 2013-10-08 18:59:02Z

up vote 1 down vote

Use:

def bin2text(s): return "".join([chr(int(s[i:i+8],2)) for i in xrange(0,len(s),8)])


>>> print bin2text("01110100011001010111001101110100")
>>> test

answered Oct 8 '13 at 18:59

Nacib Neme

59711022

for my text it returns this: '\xd9\x86\xd9\x82\xd8\xa7\xd8\xa8', how can I get it in the right way of showing? – Aidin.T Oct 8 '13 at 19:05

2

You want unichr(), not just chr(). docs.python.org/2/library/functions.html#unichr – Christian Ternus Oct 8 '13 at 19:05

add a comment |

asked	3 years, 8 months ago
viewed	7,006 times
active	3 years, 8 months ago

converting binary to utf-8 in python

4 Answers 4

Your Answer

Not the answer you're looking for? Browse other questions tagged python string utf-8 binary converter or ask your own question.

Hot Network Questions

converting binary to utf-8 in python

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python string utf-8 binary converter or ask your own question.

Related

Hot Network Questions