mypy

Debugging Python code using pdb: a crash course.

mypy | 27 October, 2013 23:16

pdb (Python Debugger) is a standard debugging utility for Python. If you have been using print statements to debug your Python code so far then you should definitely invest in learning this tool as it will save you time in the long run. 

Let's consider following Python code that contains a bug:

1
2
3
4
5
6
7
8
9
10
11
def flatten(tree, base_list=[]): 
  """Outputs elemetns of the tree as a list in DFS order."""
  for element in tree:
    if isinstance(element, list):
      base_list += flatten(element)
    else:
      base_list.append(element)
  return base_list
 
tree = [1, [2, [3, 4]], 5]
print flatten(tree)

Running this code produces following output:

[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 5]

which is obviously not what we expected. Not let's debug the problem using pdb:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
> python -m pdb bug.py
> /tmp/bug.py(1)<module>()
-> def flatten(tree, base_list=[]):
(Pdb) s
> /tmp/bug.py(11)<module>()
-> tree = [1, [2, [3, 4]], 5]
(Pdb) s
> /tmp/bug.py(12)<module>()
-> print flatten(tree)
(Pdb) s
--Call--
> /tmp/bug.py(1)flatten()
-> def flatten(tree, base_list=[]):
(Pdb) s
> /tmp/bug.py(3)flatten()
-> for element in tree:
(Pdb) s
> /tmp/bug.py(4)flatten()
-> if isinstance(element, list):
(Pdb) p element
1 # so far so good, the first element is 1
(Pdb) s
> /tmp/bug.py(7)flatten()
-> base_list.append(element)
(Pdb) s
> /tmp/bug.py(3)flatten()
-> for element in tree:
(Pdb) s
> /tmp/bug.py(4)flatten()
-> if isinstance(element, list):
(Pdb) p element
[2, [3, 4]] # the second subtree is indeed [2, [3, 4]]
(Pdb) s
> /tmp/bug.py(5)flatten()
-> base_list += flatten(element)
(Pdb) s
--Call--
> /tmp/bug.py(1)flatten()
-> def flatten(tree, base_list=[]):
(Pdb) s
> /tmp/bug.py(3)flatten()
-> for element in tree:
(Pdb) s
> /tmp/bug.py(4)flatten()
-> if isinstance(element, list):
(Pdb) print element
2 # first element in the subtree is 2
(Pdb) s
> /tmp/bug.py(7)flatten()
-> base_list.append(element)
(Pdb) s
> /tmp/bug.py(3)flatten()
-> for element in tree:
(Pdb) print base_list
[1, 2] # 1 showed up among subtree elements, bug!

This is a common bug of using non-immutable object as a default value!

Now we can correct our program:

1
2
3
4
5
6
7
8
9
10
11
12
13
def flatten(tree, base_list=None):
  """Outputs elemetns of the tree as a list in DFS order."""
  if not base_list:
    base_list = []
  for element in tree:
    if isinstance(element, list):
      base_list += flatten(element)
    else:
      base_list.append(element)
  return base_list
 
tree = [1, [2, [3, 4]], 5]
print flatten(tree)

Running it produces the intended output:

[1, 2, 3, 4, 5]

Success! We debugged and fixed the problem.

Workaround for Python bug: 'ascii' codec can't encode character u'\xa0' in position 111: ordinal not in range(128)

mypy | 18 July, 2012 22:46

Have you ever needed to read unicode data file from Python?

If so, you know that it is harder than it sounds.

Even if you set your environment (e.g. export LANG=fr_FR.UTF8) to use utf-8 Python as of 2.7.1 still might not pick up this and will try to read file in ascii resulting in all too common: 'ascii' codec can't encode character u'\xa0' in position 111: ordinal not in range(128)

After lots of trial and error I found a workaround that works. First of all check if you have this problem by executing:

1
2
import sys
sys.getdefaultencoding()

if it comes back with 'ascii' then read on.

Default encoding need to be changed. However this is only possible when sys module is reloaded.

Here is a complete solution:

1
2
3
import sys;
reload(sys);
sys.setdefaultencoding("utf8")

Python vs. Haskell vs. PHP - real world performance

mypy | 02 November, 2010 00:10

Recently I had to optimize a legacy PHP cgi application, which worked fine but was too slow for its purpose. The main bottleneck was in selecting matching lines from a file containing about 15000 lines. Not finding a way to optimize PHP code I decided to change the language.
Two candidates were Python and Haskell. Knowing that Haskell is the only language out of three which compiles to machine code I expected it to be a clear winner, but I was up for a surprise...

Benchmark results:

PHP: 18ms (PHP 5.3.3)
Python: 4ms (Python 2.6)
Haskell: 38ms (compiled with ghc --make -dynamic -O2, ghc 6.12)

Code of the fast cgi application I benchmarked in 3 languages:

PHP code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
define('MAX_NAMES', 30);
$handle = fopen("names.dat", "r");
 
$name = strtoupper($_REQUEST['q']);
$res = array();
$size = 0;
 
while (($customer_name = fgets($handle, 4096)) !== false) {
    if (strpos($customer_name, $bank_name) !== false) {
        $res[] = "'" . trim($customer_name) . "'";
        $size += 1;
        if ($size > MAX_NAMES + 1) {
            break;
        }
    }
}
 
fclose($handle);
print '['.implode(',',$res).']';

Python code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import cgi
import sys, os
from flup.server.fcgi import WSGIServer
 
 
MAX_NAMES = 30
 
def app(environ, start_response):
    start_response('200 OK', [('Content-Type', 'text/html')])
    form = cgi.FieldStorage(fp=environ['wsgi.input'], environ=environ)
    name = form["q"].value.upper()
    output_names = []
    count = 0
 
 
     with open("names.dat") as f:
        for line in f:
            if name in line:
                output_names.append(line.strip())
                count += 1
                if count > MAX_NAMES:
                    break
 
 
      yield "['" + "','".join(output_names) + "']"
 
WSGIServer(app).run()

Haskell code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import Control.Concurrent
import System.Posix.Process (getProcessID)
import Char
import Data.List
import Data.Maybe
 
import Network.FastCGI
 
compute :: CGI CGIResult
bankSuggest = do setHeader "Content-type" "text/plain"
                 q <- getInput "q"
                 let key = maybe "" (map Char.toUpper) q
                 let n = 30
                 fileContent <- liftIO $ readFile "names.dat"
                 let result = if key /= ""
                                    then take n $ filter (Data.List.isInfixOf key) $ lines fileContent
                                    else []
                 output $ "['" ++ intercalate "','" result ++ "']"
 
main = runFastCGIConcurrent' forkIO 1 compute

Parameter q for this benchmark was selected to produce less than 30 entries, which is the most common case for this application. As a result lazy evaluation and early loop termination did not help the performance.

To sum up Python version is 4.5 times faster than PHP and almost 10 times faster than Haskell, which is pretty amazing.

 

Follow up:

As Don Stewart pointed out String type is quite slow in Haskell and a faster alternative would be to use ByteString.
So I've rewritten Haskell cgi using ByteString functions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import Control.Concurrent
import System.Posix.Process (getProcessID)
import Char
import Data.Maybe
import Data.ByteString.Char8 as BS
 
import Network.FastCGI
 
bankSuggest :: CGI CGIResult
bankSuggest = do setHeader "Content-type" "text/plain"
                 q <- getInput "q"
                 let key = BS.pack (maybe "" (Prelude.map Char.toUpper) q)
                 let n = 30
                 fileContent <- liftIO $ BS.readFile "names.dat"
                 let result = if key /= ""
                                    then Prelude.take n $ Prelude.filter (BS.isInfixOf key) $ BS.lines fileContent
                                    else []
                 output $ "['" ++ BS.unpack (BS.intercalate "','" result) ++ "']"
 
main = runFastCGIConcurrent' forkIO 1 bankSuggest

New benchmark results:

PHP: 18ms (PHP 5.3.3)
Python: 4ms (Python 2.6)
Haskell using String: 38ms
(compiled with ghc --make -dynamic -O2, ghc 6.12)
Haskell using ByteString: 8ms
(compiled with ghc -XOverloadedStrings --make -dynamic -O2, ghc 6.12)

Indeed Haskell code using ByteString is 5 times faster than Haskell using String.
However it is still twice slower than Python!

Python 3000

mypy | 08 June, 2008 21:26

Python 3000 is coming. 

Guido van Rossum explains what Python developer should expect from Python 3000 in his blog article and keynote.

Python 3000 is not compatible with current Python 2. Most changes represent language clean-up and removal of deprecated features (classic classes, string exceptions). Overall language will become smaller with fewer surprises and exceptions. Guido urges developers not to change API and promises to support Python 2.6 for at least 5 years. He mentions that 2to3 tool (source-to-source) translator will help to migrate from Python 2 to Python 3000 easier.

Release of Python 3.0 final is expected in August 2008.

 

 
Accessible and Valid XHTML 1.0 Strict and CSS
Powered by LT - Design by BalearWeb