I have a binary like this: 1101100110000110110110011000001011011000101001111101100010101000

and I want to convert it to utf-8. how can I do this in python?

share|improve this question
    
What encoding is the binary string in? ASCII? Or you mean the bytes are a utf-8-encoded string and you want to get a unicode string in python? – Claudiu Oct 8 '13 at 18:52
    
What do you mean with "convert it to utf-8"? Create the characters from the binary octets? – Paulo Bu Oct 8 '13 at 18:53
1  
the binary string is in utf-8 and yes, I want to get a unicode string in python. – Aidin.T Oct 8 '13 at 18:55
    
I think we're not understanding precisely what sort of file you have. Could you run hd or od or a similar hex-dump utility and copy-paste the first few lines? – Robᵩ Oct 8 '13 at 18:57
    
it's not a file. I just have a text in persian and I convert it to binary, now I want to convert it back to the text. – Aidin.T Oct 8 '13 at 19:04
up vote 8 down vote accepted

Cleaner version:

>>> test_string = '1101100110000110110110011000001011011000101001111101100010101000'
>>> print ('%x' % int(test_string, 2)).decode('hex').decode('utf-8')
نقاب

Inverse (from @Robᵩ's comment):

>>> '{:b}'.format(int(u'نقاب'.encode('utf-8').encode('hex'), 16))
1: '1101100110000110110110011000001011011000101001111101100010101000'
share|improve this answer
    
+1 for .decode('hex') – Robᵩ Oct 8 '13 at 19:16
    
but it doesn't work properly. it shows something else, not the first text I just converted to binary – Aidin.T Oct 8 '13 at 19:22
1  
@Aidin.T try now. I added decode('utf-8') at the end. – Igonato Oct 8 '13 at 19:31
2  
And the inverse would be: s=u'نقاب'; print '{:b}'.format(int(s.encode('utf-8').encode('hex'), 16)) – Robᵩ Oct 8 '13 at 19:48
1  
Note that s = "سلام" and s = u"سلام" give different results. The former fails, the latter works. But let's stop solving the new problem. @Aidin.T, if you have a problem with encoding, please open a new question. – Robᵩ Oct 8 '13 at 21:12

Well, the idea I have is: 1. Split the string into octets 2. Convert the octet to hexadecimal using int and later chr 3. Join them and decode the utf-8 string into Unicode

This code works for me, but I'm not sure what does it print because I don't have utf-8 in my console (Windows :P ).

s = '1101100110000110110110011000001011011000101001111101100010101000'
u = "".join([chr(int(x,2)) for x in [s[i:i+8] 
                           for i in range(0,len(s), 8)
                           ]
            ])
d = u.decode('utf-8')

Hope this helps!

share|improve this answer
    
i believe you want unichr – Joran Beasley Oct 8 '13 at 19:07
2  
Hmmm, I'm somewhat suspicious of unichr. Because OP says his binary is already utf-8. utf-8 has variable character length, so I just used chr to join the raw bytes in a string and decode them later into Unicode. – Paulo Bu Oct 8 '13 at 19:09
2  
@JoranBeasley - I disagree, assuming Python2. In that step he is collecting bytes, not characters. Only after he has the utf-8-encoded byte string does he want to convert. – Robᵩ Oct 8 '13 at 19:09
    
@Robᵩ: That's my point. Nice answer, love the split('........'). I think is basically the same idea as mine. +1 – Paulo Bu Oct 8 '13 at 19:11
1  
+1 - This is the same technique as mine (so obviously I approve), plus you explained yours. Questioner should move the check to this better answer. – Robᵩ Oct 8 '13 at 19:12
>>> s='1101100110000110110110011000001011011000101001111101100010101000'
>>> print (''.join([chr(int(x,2)) for x in re.split('(........)', s) if x ])).decode('utf-8')
نقاب
>>> 

Or, the inverse:

>>> s=u'نقاب'
>>> ''.join(['{:b}'.format(ord(x)) for x in s.encode('utf-8')])
'1101100110000110110110011000001011011000101001111101100010101000'
>>> 
share|improve this answer
    
there is another question, how can I convert my text to binary by python? I mean the inverse form of my question – Aidin.T Oct 8 '13 at 19:10

Use:

def bin2text(s): return "".join([chr(int(s[i:i+8],2)) for i in xrange(0,len(s),8)])


>>> print bin2text("01110100011001010111001101110100")
>>> test
share|improve this answer
    
for my text it returns this: '\xd9\x86\xd9\x82\xd8\xa7\xd8\xa8', how can I get it in the right way of showing? – Aidin.T Oct 8 '13 at 19:05
2  
You want unichr(), not just chr(). docs.python.org/2/library/functions.html#unichr – Christian Ternus Oct 8 '13 at 19:05

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.