Text column identification on newspaper like images. JAVA needed.
Task level: intermediate
Length and deadline: some hours within some days
Needed knowledge: Java, basic image processing, IO handling
Specification: as below
General goal is to identify the position of columns on an image which contains newspaper like arranged text.
We NEED the followings:
- a class possibly with one callable method which does the work for us
- short and clear comments in the code
- short and clear notes how to test your code
We DON'T NEED the followings:
- No user interface.
- No detailed documentation.
You will need 4 test images which contain the following text column arrangements:
- Test image 1: One title, one subtitle, one text column. The title is not wider then the column.
- Test image 2: One title, one subtitle, two text columns. The title starts at the left edge of the first column and wider then the first column, it spans over the second column.
- Test image 3: Same as test image 3 but with "disturbing" features likes thick black line above the whole text, some words hand writing at the bottom, a bar code on the side which starts at the edge of the page but doesn't rich the text, some "dirt" pixel groups between two columns no bigger then 1-3 letters.
- Test image 4: Like two "test image 3" vertically arranged where the second image part should be half column shifted on the side compared two the first one. So the columns of the two parts shouldn't be in a vertical line.
We accept your code:
- if the class works properly on the determined test cases. You need to provide a simple test class which calls the your project class four times for the four test cases and writes the results in a simple but clear format to standard output.
The needed class:
The goal is to find the possible vertical column boundaries of text columns on an image which has newspaper like columns, titles and sometimes side shifted columns.
- B&W image file, which contains newspaper like text columns
- A series of data sets of the supposed columns and their boundaries. One set of data of one supposed column is:
- the horizontal position of the supposed left edge of the column,
- the horizontal position of the supposed right edge of the column,
- the average B&W ratio of the column,
The first column's left edge will be always the left side of the original image. The last column's right side will be always the right side of the original image. If the right side of column X is e.g. on horizontal position 234 then the left side of column X+1 is 235. Never the same and it's always 1 pixel away.
Calculate the ratio of the black and white pixels on each vertical lines from left to right. Look for the sudden and big ratio changes. We suppose that the ratio changes suddenly and in a great extent on edges of columns. Find the vertical lines which possibly are left or right column edges. Prepare the output set of data.
The class should have proper extension handling.
Your code has to be optimized. Try to find a proper method to get and work on the lines fast. Your task is to suggest a coding solution for that.