Discussion:
tricky text manipulation
(too old to reply)
n***@gmail.com
2012-09-20 18:51:28 UTC
Permalink
My use of this utility would be to delete the inevitable
repeated/redundant 'packing' that we get when we
fetch http-textS and append them to accumulatorFile.

Each URL-fetch accumulates its 3 parts:
u : the URL
C: the contents
|: the one-line-separator: "<><><>".

A sequence of 4 triplets described above can be show as:
uA|vB|wC|xD|
where
u,v,w,x are the 1-line URLs
A,B,C,D are the multiline text-blocks [fetched by lynx/links]
and | are the one-line-separators "<><><>".

But now, the contents: A,B,C,D will contain common,
repeated/redundant text, which usually is near the
page-beginning, but must be handled anywhere is the page.

So then whith the garbage represented as 8, it looks like:
u8A|v8B|wC8|x8D|

The algorithm that I see is:

the human starts reading/editing accumulatorFile,
and notices the/some repeated/redundant/garbage,
which be pastes out to FileH;
and then the program does:
scan accumulatorFile, and delete all copies of the
text-block:H, except the frst copy.

That should'nt be difficult, but the following refinement
is also required:
for matching purposes, ignore all "["d{d}"]" and
spaces and tabs. So the following 2 lines should match:
the [4] cat sat on the mat
the [27] cat sat on the mat

== TIA.

PS. by just pasting the next round of URLs to a file,
for their pages to be appended to your <book>,
you can collect and MANAGE some of the good
edumation available.
n***@gmail.com
2012-09-23 16:34:35 UTC
Permalink
]I think you forgot to ask a question.

Indeed, although you correctly deduced that it WAS a question
and not an announcement.

The Question is: how would one implement this in lisp/emacs,
and/or are there some pointers to previous work on related tasks.
--------------------
Post by n***@gmail.com
My use of this utility would be to delete the inevitable
repeated/redundant 'packing' that we get when we
fetch http-textS and append them to accumulatorFile.
u : the URL
C: the contents
|: the one-line-separator: "<><><>".
uA|vB|wC|xD|
where
u,v,w,x are the 1-line URLs
A,B,C,D are the multiline text-blocks [fetched by lynx/links]
and | are the one-line-separators "<><><>".
But now, the contents: A,B,C,D will contain common,
repeated/redundant text, which usually is near the
page-beginning, but must be handled anywhere is the page.
u8A|v8B|wC8|x8D|
the human starts reading/editing accumulatorFile,
and notices the/some repeated/redundant/garbage,
which be pastes out to FileH;
scan accumulatorFile, and delete all copies of the
text-block:H, except the frst copy.
That should'nt be difficult, but the following refinement
for matching purposes, ignore all "["d{d}"]" and
the [4] cat sat on the mat
the [27] cat sat on the mat
== TIA.
PS. by just pasting the next round of URLs to a file,
for their pages to be appended to your <book>,
you can collect and MANAGE some of the good
edumation available.
]I think you forgot to ask a question.

]YAWIA, hk

]ps. Your algorithm includes a human copy-pasting to a file?

Yes, did you write that or did I?
Kaz Kylheku
2012-09-23 18:08:50 UTC
Permalink
Post by n***@gmail.com
]I think you forgot to ask a question.
Indeed, although you correctly deduced that it WAS a question
and not an announcement.
The Question is: how would one implement this in lisp/emacs,
and/or are there some pointers to previous work on related tasks.
Implement what? You don't have a specification. Where are your
representative inputs, and what does the output look like?

Is this your question:

"How would we implement a tool which finds multi-line pattern matches in a text
file, and removes the second and subsequent matches?"

But do you really need that tool, or do you need a concrete problem solved:
doing a specific text scraping job on some specific web pages.

It sounds like you're assuming that the problem at hand must be solved with
this approach, but maybe it's not the best approach. You should let expert
eyes look at the input data and the desired output, and design the solution
from first principles.

I happen not to think that writing patterns to remove garbage is the best
approach. It is probably best to look for positive patterns which delimit what
you want to retain. The garbage is then, implicitly, everything you don't
retain.

Anyway how you would write the tool you're looking for is by making a program
which reads the pattern file, builds a pattern-matching object from it, and
then scans the target file, looking for places (starting lines) where the
pattern matches, keeping track of the match number so that the second and
subsequent matches can be removed.

I designed an entire text-scraping programming language which does complex
multi-line pattern matching over entire documents. Faced with your
text-scraping task, I would just use that language. That still requires
programming: looking at the data, considering the output, forming a
text-extraction strategy, and expressing it in the language.
n***@gmail.com
2012-09-23 22:30:10 UTC
Permalink
Post by Kaz Kylheku
Post by n***@gmail.com
]I think you forgot to ask a question.
Indeed, although you correctly deduced that it WAS a question
and not an announcement.
The Question is: how would one implement this in lisp/emacs,
and/or are there some pointers to previous work on related tasks.
Implement what? You don't have a specification. Where are your
representative inputs, and what does the output look like?
You've snipped my spec. And you description below proves that you
read and understood it.
Post by Kaz Kylheku
"How would we implement a tool which finds multi-line pattern matches in a text
file, and removes the second and subsequent matches?"
Yes.
Post by Kaz Kylheku
doing a specific text scraping job on some specific web pages.
It sounds like you're assuming that the problem at hand must be solved with
this approach, but maybe it's not the best approach. You should let expert
eyes look at the input data and the desired output, and design the solution
from first principles.
The origin of the data is described below.
Post by Kaz Kylheku
I happen not to think that writing patterns to remove garbage is the best
approach. It is probably best to look for positive patterns which delimit what
you want to retain. The garbage is then, implicitly, everything you don't
retain.
No. It is to be removed because it is REPEATED/redundant.
The good stuff is unique. Computers do repetition.
It's a horse&rider method: the human inspects the text and says
<delete all further copies [or very similar] of this marked block>.
Post by Kaz Kylheku
Anyway how you would write the tool you're looking for is by making a program
which reads the pattern file, builds a pattern-matching object from it, and
then scans the target file, looking for places (starting lines) where the
pattern matches, keeping track of the match number so that the second and
subsequent matches can be removed.
I designed an entire text-scraping programming language which does complex
multi-line pattern matching over entire documents. Faced with your
text-scraping task, I would just use that language. That still requires
programming: looking at the data, considering the output, forming a
text-extraction strategy, and expressing it in the language.
==========
I favour the incremental/successive-refinement approach:
I'd start reading InFile, and when I see repetative-garbage,
I'd mark it, and send it to the DeleteFile. Then say:
InFile minus Repeated:DeleteFile -> OutFile.

Or a sequence of DeleteFile1, DeleteFile2...DeleteFileN
could successively filter the garbage.
==================

This is typically how InFile is derived:

FetchAppend URLContents InFile

Where `FetchAppend` uses lynx/link to repeatedly fetch
the URLContents -- one-per-line in File:URLContents,
and append them to InFile;
but with a header: being the URL,
and a single-line tail-record-separator, being:
"<><><><><><>".

So if you fetch 12 'pages' from the same <URL-family>,
InFile is the appended [book-like] version of the 'web-site'.

2 refinements [at least] are required:
* some of the InFile [accumulated over years] have different
length tail-record-separators: "<><><>"
* often the one page will have eg:
" Table of Contents [13]"
and another page will have:
" Table of Contents [17]"
which must be seen to match.

Thanks.
Kaz Kylheku
2012-09-24 08:43:10 UTC
Permalink
Post by n***@gmail.com
Post by Kaz Kylheku
Post by n***@gmail.com
]I think you forgot to ask a question.
Indeed, although you correctly deduced that it WAS a question
and not an announcement.
The Question is: how would one implement this in lisp/emacs,
and/or are there some pointers to previous work on related tasks.
Implement what? You don't have a specification. Where are your
representative inputs, and what does the output look like?
You've snipped my spec. And you description below proves that you
read and understood it.
Of course.
Post by n***@gmail.com
Post by Kaz Kylheku
"How would we implement a tool which finds multi-line pattern matches in a text
file, and removes the second and subsequent matches?"
Yes.
Post by Kaz Kylheku
doing a specific text scraping job on some specific web pages.
It sounds like you're assuming that the problem at hand must be solved with
this approach, but maybe it's not the best approach. You should let expert
eyes look at the input data and the desired output, and design the solution
from first principles.
The origin of the data is described below.
Post by Kaz Kylheku
I happen not to think that writing patterns to remove garbage is the best
approach. It is probably best to look for positive patterns which delimit what
you want to retain. The garbage is then, implicitly, everything you don't
retain.
No. It is to be removed because it is REPEATED/redundant.
The good stuff is unique. Computers do repetition.
That's where you are incorrect in seeing only one way.

For instance, when Lisp reads a list, and makes a data structure out of it, the
parentheses are removed. That doesn't mean that Lisp looks for things like )(
and removes them.
Post by n***@gmail.com
It's a horse&rider method: the human inspects the text and says
<delete all further copies [or very similar] of this marked block>.
Or it could be: "look for nuggests of useful information starting after
these familiar lines of junk, until those ones."

You don't necessarily even need a full match for the junk, just enough
of the beginning or end of it to frame the data.
Post by n***@gmail.com
I'd start reading InFile, and when I see repetative-garbage,
InFile minus Repeated:DeleteFile -> OutFile.
Great: you know how the problem should be solved, only you don't quite know how
to write the code, and you won't listen to ideas from people who do know.

Moreover, you won't share your actual test vectors and expected output, and you
somehow want your hand held into producing a working solution (which has to be
your way).

I smell that what you're after is to be someone's programming puppet,
who then takes 100% of the credit for having done it "himself".
n***@gmail.com
2012-09-25 07:56:15 UTC
Permalink
-- snip --
Post by n***@gmail.com
Post by Kaz Kylheku
I happen not to think that writing patterns to remove garbage is the best
approach. It is probably best to look for positive patterns which delimit what
you want to retain. The garbage is then, implicitly, everything you don't
retain.
No. It is to be removed because it is REPEATED/redundant.
The good stuff is unique. Computers do repetition.
That's where you are incorrect in seeing only one way.
For instance, when Lisp reads a list, and makes a data structure
out of it, the parentheses are removed. That doesn't mean that
Lisp looks for things like )( and removes them.
I don't understand how this relates to my above [already
admitted as too simplistic] statement,. So I assume you TOO
didn't understand my [misdirected] statement.
As previously stated 'we move forward from where we are';
i.e. burden-of-history. So I was thinking of <saving the manual
labour of repeatedly editing delete-all-further-blocks-like-this.
Which is an ideal repetitative-computerised task.
Post by n***@gmail.com
It's a horse&rider method: the human inspects the text and says
<delete all further copies [or very similar] of this marked block>.
Or it could be: "look for nuggests of useful information starting after
these familiar lines of junk, until those ones."
I can't parse that.
The human has to recognise the nuggets. The computer can only do
repetition.
You don't necessarily even need a full match for the junk, just enough
of the beginning or end of it to frame the data.
Yes, that would be an intended refinement.
Moreover, you won't share your actual test vectors and
expected output,
Mathematics is not a spectator sport.
I believe you have the capability of working with abstract definitions.

BTW, the facility to <delete all further-such-text-streches> is a
COMMON feature of editors. Except all of mine have TOO small
search/replace buffers. Some editors can even handle Regex.

So for my case, refinements which I can think of NOW, but should
be expandable [that's why it's best to be data-driven] are:
* ignore white-chars
* for "["<num>"]", allow <num> to be ANY decimal-string.

I expect this to be trivial for emacs, and probably ALREADY exists.

BTW, I'm discussing this on <awkNewsgroup> too, so I've
proved that <show us a data sample> doesn't work.

Only baby problems work with data samples.
Empirical methods are limited.
When your teacher told you 'two apples plus two apples
is four apples', did you say "SHOW ME the apples!" ?
Kaz Kylheku
2012-09-25 15:05:03 UTC
Permalink
Post by n***@gmail.com
BTW, I'm discussing this on <awkNewsgroup> too, so I've
proved that <show us a data sample> doesn't work.
In comp.lang.awk all you've proved that when you're asked for sample
data, you pull out the same bit about "show me the apples".

I do follow that group, and happen to know that you're trying to
scrape text from an online encyclopedia of philosophy.

(That tends to explains why you're impractical, yet brimming with opinions
about how those with skills ought to approach their craft.)
n***@gmail.com
2012-09-26 11:09:52 UTC
Permalink
Post by Kaz Kylheku
Post by n***@gmail.com
BTW, I'm discussing this on <awkNewsgroup> too, so I've
proved that <show us a data sample> doesn't work.
In comp.lang.awk all you've proved that when you're asked for sample
data, you pull out the same bit about "show me the apples".
I do follow that group, and happen to know that you're trying to
scrape text from an online encyclopedia of philosophy.
I vaguely remember THAT example.
Try to think rather of the <systems which remove annoying ads
from TV viewing>. If you had to supply a <sample/test-data>
before the designer could understand what was required .......
You can't discuss music theory with someone who can't read
scores, no matter how well they can busk.
Post by Kaz Kylheku
(That tends to explains why you're impractical, yet brimming with opinions
about how those with skills ought to approach their craft.)
Thanks for NO technical input.

PS. it looks as if I MUST goto emacs.
I really identify with the following goog-find, as an
example of continually refining one's tools:--
RE Builder "query replace this"

I use M-x re-builder a *lot*. Its an interactive regular expression
builder that really helps build proper regexps and is an excellent =\
learning tool.
However, in using it, I found myself repeating a particular
workflow:
- build regexp in re-builder
- copy it
- run query-replace-regexp
- paste regexp as the search
- type in the replacement
- press return and off you go
This was a lot of repeated work, so I wrote a function, which I bind
to a key in reb-mode-map (C-c M-%, since plain old M-% is
query-replace) which can be run in the re-builder buffer to
automatically search the target buffer (the one that re-builder
matches as you build) for the regexp you built, and replace it with
a string, which is the only prompted argument of the function.
That may all sound complication, but its not. The workflow becomes:
- build regexp in re-builder
- run reb-query-this-regexp (C-c M-%)
- type in the replacement
- press return and off you go
Here is the code: ...
------------
I reckon He can 'read the music' rather than
'play it and I'll see what it sounds like'.
------------------
]A lot of productivity comes out of being able to ignore irrelevant
] code, so these keys should become second-nature.
Yes: emacs makes sense to me, although I need a menu.
http://www.emacswiki.org/emacs/ReplaceRegexp
perhaps that does it?
Kaz Kylheku
2012-09-26 17:13:19 UTC
Permalink
Post by n***@gmail.com
I found myself repeating a particular
- build regexp in re-builder
- copy it
- run query-replace-regexp
- paste regexp as the search
- type in the replacement
- press return and off you go
And, of course, you did all this without the sample data, just like what you're
asking others to do. That might have been a source of difficulty.
n***@gmail.com
2012-09-24 05:24:06 UTC
Permalink
Of the 2 editors that I commonly use:-
*mcedit can: deleteAll [replace with blank] FORWARD. But only single line/s.
*ETHO can: ReplaceAll TEXT-stretch [forward from current position]. But the
search buffer is too small. That's why I thought of using a file.

mcedit is labeled <can do Regex> also, although I've never tried it.

The 2 Regex matches which I currently can think of:
different lengths of "\n<><>\n" and "[<digits>]",
are probably not the only ones.

So it's looking like a smart editor.
Which I why I expect emacs/lisp to do it.

== TIA
n***@gmail.com
2012-09-24 18:59:46 UTC
Permalink
Post by n***@gmail.com
Post by Kaz Kylheku
Post by n***@gmail.com
]I think you forgot to ask a question.
Indeed, although you correctly deduced that it WAS a question
and not an announcement.
The Question is: how would one implement this in lisp/emacs,
and/or are there some pointers to previous work on related tasks.
Implement what? You don't have a specification. Where are your
representative inputs, and what does the output look like?
You've snipped my spec. And you description below proves that you
read and understood it.
Post by Kaz Kylheku
"How would we implement a tool which finds multi-line pattern matches in a text
file, and removes the second and subsequent matches?"
Yes.
Post by Kaz Kylheku
doing a specific text scraping job on some specific web pages.
It sounds like you're assuming that the problem at hand must be solved with
this approach, but maybe it's not the best approach. You should let expert
eyes look at the input data and the desired output, and design the solution
from first principles.
The origin of the data is described below.
Post by Kaz Kylheku
I happen not to think that writing patterns to remove garbage is the best
approach. It is probably best to look for positive patterns which delimit what
you want to retain. The garbage is then, implicitly, everything you don't
retain.
No. It is to be removed because it is REPEATED/redundant.
The good stuff is unique. Computers do repetition.
It's a horse&rider method: the human inspects the text and says
<delete all further copies [or very similar] of this marked block>.
----
My reasoning here is too simplistic. If I 'lift positive patterns' then
all the repeated garbage is separated in PARALLEL.
Often one arives at a sub-optimal approach, because of 'evolution':
the partial approach was optimal for THAT stage, but for a fuller
stage, one needs rather to abandon the initial approach.
OTOH, usually it's the human interface that's most important.

And the 'hello-world' principle, where you've got a working
model from the start, rather than hoping for a perfect item to
pop out finally, has great advantages.

To extract the good-stuff, you have to KNOW that it's good.
And if you knew, you wouldn't need to eg. fetch the tutorial.
Also, with re-reading, material BECOMES redundant.
As knowledge advances the text CONDENSES, in ways that
are unknown at the start.

BTW, I'm learning this, while I'm writing.

Thanks.
Kaz Kylheku
2012-09-23 17:52:23 UTC
Permalink
Post by n***@gmail.com
But now, the contents: A,B,C,D will contain common,
repeated/redundant text, which usually is near the
page-beginning, but must be handled anywhere is the page.
The problem with your problem description is that it is described
in terms of these abstract hypotheticals.

However, the details depend on the actual data.

You must provide some concrete instances of the actual catenated web pages,
rather than "uA|vB|wC|xD|".

Also show the exact output that you want from each input sample (i.e. each set
of catenated web pages).

Since these things are probably large, you should put them on some file hosting
site (perhaps as a compressed archive) and give a URL.

Without these reference input/output pairs, it is impossible to write, debug,
test, and refine a piece of software.

If you keep the exact input and output pairs to yourself (like you did
throughout the entire large thread in the other newsgroups where you posted
this) it will just be another big waste of time.
Post by n***@gmail.com
u8A|v8B|wC8|x8D|
the human starts reading/editing accumulatorFile,
and notices the/some repeated/redundant/garbage,
which be pastes out to FileH;
That, right off the bat, is not an algorithm. An algorithm must describe how
garbage is recognized and delimited. The above part is essentially
programming. The human prepares a specification of what is "redundant garbage".
Post by n***@gmail.com
scan accumulatorFile, and delete all copies of the
text-block:H, except the frst copy.
It's probably much more productive to look for what to *keep*, rather
than what to *delete*.

That is to say, take the catenated web pages and scrape them for interesting
content, skipping uninteresting content.
Post by n***@gmail.com
That should'nt be difficult, but the following refinement
for matching purposes, ignore all "["d{d}"]" and
spaces and tabs.
What is this notation "[" d{d} "]" ? Is it BNF?

You can't just drop random notations in the middle of a sentence without
explaining what they are, and expect to be perfectly understood.
Post by n***@gmail.com
the [4] cat sat on the mat
the [27] cat sat on the mat
Ah, of course, d means digit? Oh, stupid me! Would it kill you to write
something like:

ignore all "[" d { d } "]" where this is EBNF notation, and the
grammar symbol d stands for a digit?
Loading...