Table recognition

Hello everyone Periodically, there is a need to recognize the table and extract information from the cells. Tell me, pliz, where to dig? Is it possible to adapt already known free packages (Tesseract, CuneiForm, ...) for these purposes, and if not, how difficult will it be to implement this recognition yourself?

Upd: I started studying the OpenCV package, it looks very much like it contains the functionality I need! Python code example:

import numpy as np
import cv2

im = cv2.imread('test.jpg')
imgray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
ret,thresh = cv2.threshold(imgray, 127, 255, 0)
image, contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
Author: Oceinic, 2014-01-09

1 answers

In general, something like this, can be useful to someone ))

# Stan 2014-01-21
# -*- coding: utf-8 -*-

import cv2
import matplotlib.pyplot as plt
from matplotlib.contour import ContourSet
import matplotlib.cm as cm

im = cv2.imread('d:\\doc3.jpg')

imgray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
ret, thresh = cv2.threshold(imgray, 127, 255, 0)
contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

fig = plt.figure()
ax = fig.gca()

i = 0
for c, h in zip(contours, hierarchy[0]):
    area = cv2.contourArea(c)
    print i, area, h

    x0, y0 = x1, y1 = c[0][0]
    for j in c[1:]:
        x2, y2 = j[0]
        lines0 = [ [[x1, y1], [x2, y2]] ]
        ContourSet(ax, [0], [ lines0 ], cmap=cm.cool)
        x1, y1 = x2, y2
    lines0 = [ [[x0, y0], [x1, y1]] ]
    ContourSet(ax, [0], [ lines0 ], cmap=cm.cool)
    i += 1

plt.show()
 1
Author: Stan, 2014-01-21 16:25:05