GREP help

Stephen Marsh

Well-known member
I am looking for advice and help with GREP find replace. I am wishing to delete all text and wordspaces in a sentence, EXCEPT for one single key bit of text:

INPUT =
2A - 4x12! - some More tEXT and now the keyword (1) etc $
Z3 - 9-1* some random text and now the new_keyword {88} etc %

OUTPUT =
keyword
new_keyword

As you can see, there are multiple characters, numbers, caps and lowercase, $pecial characters etc. All I wish to do is find the keyword and the new_keyword, retain them both and then delete everything else.

Of course, the unwanted text is not consistent and neither is the keyword, so I am expecting to have to run multiple GREP search and replace commands to clean up the text. Obviously there are many lines of text where I wish to clean things up to only retain the two keywords in question.

Is it possible to search for and retain multiple keywords, while removing everything else?

Can anybody point me to the regular expression/s that are needed? I can handle removing word spaces if I have to, what I am really after is how to single out two consistent single keywords, retain them and remove everything else (except for the CR/hard return at the end of each line).

I guess another way of looking at this is to find and select a keyword, invert the selection and delete and then repeat again on a line by line basis, keeping the return at the end of the original paragraph.

I will be using GREP capable software such as InDesign CS5 and/or TextWrangler.


Stephen Marsh
 
Last edited:
It should be possible, still not quite clear what you want to do… from what to what. If you need to do in several stages there is a great plug called multi-find/change (Automatication | Multi-Find/Change (CS4/CS5.x/CS6))

Look into forming a query (part 1 = what is infront of keyword) (part 2 = keyword ||new_keyword)(part3 = what you want to throw at end) replace with $2 wich is the second found bracket.

I still have trouble explaining how to think grep in forum post‚ if you want to try and think collectively please use my skype (should be able to find from forum)
 
It should be possible, still not quite clear what you want to do… from what to what. If you need to do in several stages there is a great plug called multi-find/change (Automatication | Multi-Find/Change (CS4/CS5.x/CS6))

Look into forming a query (part 1 = what is infront of keyword) (part 2 = keyword ||new_keyword)(part3 = what you want to throw at end) replace with $2 wich is the second found bracket.

I still have trouble explaining how to think grep in forum post‚ if you want to try and think collectively please use my skype (should be able to find from forum)


Thanks Lukas, I just have problems thinking in GREP!

Imagine hundreds of separate lines of sentence text, each line ending in a hard paragraph return.

Each separate sentence line will have a single keyword that is to be retained in that line, deleting all other text on that line except for the return character.



Stephen Marsh
 
Is anything in the keyword static? a prefix/suffix, spacing/tab thats irregular or set amount of text after?
 
Hello Stephen,

In text wrangler:

find = (.*[^0-9a-zA-Z.][^\n]*)(keyword1|keyword2)([^0-9a-zA-Z.][^\n]*.*)
replace = \2

Keep adding pipes to capture additional keywords rather than running multiple expressions. If this isn't quite what you are looking for let me know and I'll try to help you further. I'm not on here much so if I don't reply in a reasonable timeframe send a note to [email protected].

Thanks,
Matt Louis
 
Stephen,

If you want to return two keywords rather than one or the other then try:

Find = (.*[^0-9a-zA-Z.][^\n]*)(keyword1)(.*[^0-9a-zA-Z.][^\n]*)(keyword2)([^0-9a-zA-Z.][^\n]*.*)
Replace = \2
\4

This expression will put keyword1 on line 1 and keyword2 on line 2.

- Matt Louis
 
Matt Louis - a very big thank you, you are a champion! This works exactly as I need it to, thanks. This will really help in clearing up a product list. I am finding that Regular Expressions (grep) and MS Excel are really good for clearing up text for use with database apps.

I did not think that it would matter, however it does... My description only used two example lines/items, sadly I have a product list of 171 lines/items where I need to find the keyword and remove everything else on the same line.

The two keywords are identical, in that any particular line of the 171 may have one of the two specific keywords, however the order of the keywords is not a consistent pattern so I don't want the order of the lines of keywords changed from where they originally appeared.

keyword2
keyword
keyword2
keyword
keyword2
keyword2
keyword2
keyword
keyword
keyword2

The second regular expression example that you noted deletes the other 169 lines and only returns 2 lines, where as I need all 171 lines reduced to the keywords in question. I know that 171 lines is not too many to do manually, however the next time that I have to do this there might be a lot more lines or more than two keywords, so automation should ideally take care of everything.


Stephen Marsh
 
Last edited:
Stephen,

I believe your challenge is beyond the scope of Regex. Setting keywords and sequences of keywords then mapping them back out line by line is possible but the odds of one expression working with numerous text files may be overly optimistic. I'm not sure regex was intended to process one line, then the next, and so on. I am uncertain because I am not a developer. I only know what I know out of necessity to configure Twist workflows.

Getting decent at this type of thing happens to be on my bucket list. I'll take a stab at figuring this out using sed or awk commands and if I run into some luck I'll share my findings with you.

- Matt Louis
 
Is it impossible? Are you saying you are looking for "keyword" that may or may not be followed by "2" ? that shouldn't be a problem. If you don't want to post the actual keyword explicitly you may PM me or skype me and we can try work it out.
 
Is this sort of automated text editing possible in some other application, such as Word or Excel - using a Macro?

Should I look into some sort of third-party macro tool such as Quick Keys?

Automation should let me do what I can do manually, more efficiently. I can't write script, it has always bugged me that I can't record an action in InDesign.... the following can be done manually - I just can't automate it...

1. Convert text selection to tables (1 row per line) [this is a hack to force a line by line approach to selecting text]
2. Find (keyword) - done (keyword is selected).
3. Cut
4. Select All
5. Delete
6. Paste

Repeat steps 2-6 as required.

or

1. Convert text selection to tables (1 row per line) [this is a hack to force a line by line approach to selecting text]
2. Find (keyword) - done (keyword is selected).
3. Left arrow key to deselect and place cursor at the start of the previous find selection
4. CMD SHIFT HOME key (selects all text to the left of the cursor) [this is why working in a table is necessary]
5. Delete
6. Find (Keyword) - done (keyword is selected)
7. Right arrow key to deselect and place cursor at the end of the previous find selection
8. CMD SHIFT END key (selects all text to the right of the cursor) [this is why working in a table is necessary]
9. Delete

Repeat steps 2-9 as required.


Stephen Marsh
 
Last edited:
Although InDesign scripting would be nice, I have now worked this out myself using other software.

I downloaded jEdit:
jEdit - Programmer's Text Editor - overview

In jEdit I recorded two macros, that would find and select a desired keyword, cut, select all on the line, delete then paste. I then setup a custom keyboard shortcut to apply the last run macro...so it is then a case holding down the keyboard shortcut keys until the end of the product list is reached. Then repeat with the second macro to fix up the other keyword.

I am sure that Word macros would probably be similar, however I found jEdit first.

Thanks for the assistance with Regular Expressions/GREP - I hope that this is helpful to others out there reading this thread. I wanted to keep this as "generic" as possible so that others could benefit, as I imagine that this sort of stuff is not an uncommon task.

EDIT: For reference, just in case anyone is interested - the jEdit macro text is as follows:

SearchAndReplace.setSearchString("keyword");
SearchAndReplace.setAutoWrapAround(false);
SearchAndReplace.setReverseSearch(false);
SearchAndReplace.setWholeWord(false);
SearchAndReplace.setIgnoreCase(false);
SearchAndReplace.setRegexp(false);
SearchAndReplace.setSearchFileSet(new CurrentBufferSet());
SearchAndReplace.find(view);
Registers.cut(textArea,'$');
textArea.selectLine();
textArea.backspace();
Registers.paste(textArea,'$',false);




Stephen Marsh
 
Last edited:
Imagine hundreds of separate lines of sentence text, each line ending in a hard paragraph return.

Each separate sentence line will have a single keyword that is to be retained in that line, deleting all other text on that line except for the return character.

Could your request be resaid in that you want to keep the last word in each paragraph or is there more to it? I'm a little confused by "keyword" in your descriptions?

The last word can be selected by finding any consecutive wordcharacters preceded by a period positioned at the end of a paragraph. Are there non-word characters in the keywords? An alternative would be to look for any consecutive non-whitespace characters at the end of a paragraph.

Once the correct text can be selected then it can be isolated, thinking of (first part of paragraph)(second, selected part)
and replacing with the second part.

example:
(^.*)(\s)(\S+)(\s*\.*\s*$) look for any consecutive non-whitespace characters at the end of a paragraph with zero or more spaces zero or more periods zero or more spaces. Now if the trailing end could also include quotations then the expression will match the last quote if it is surrounded by white space.


$3 replace with the contents of the third bracket.

I'm trying to show you how to "think" GREP rather than serve the answer, so that you can come up with fine tuning as needed.
 
Actually was able to tweak that a litte (it was counting new line as white space halving the number of lines)
(^.*)(\<\S+)$
this expression looks for begining of word ( \< ) instead of a white space. So first bracket is any number of characters at the beginning of a paragraph, folowed by a begining of word and one or more non space characters. A simple dollar sign is the end of paragraph, but depending on if you need optional punctuation and spaces to be stripped of the end you may need as per my previous post. Note that the replace is now dealing with the second bracked and so $2 is the corrected replace string.
 
Thanks Lukas, I appreciate your repeated attempts to follow where I am coming from and where I wish to be.

The generic example I gave was:

2A - 4x12! - some More tEXT and now the keyword (1) etc $
Z3 - 9-1* some random text and now the new_keyword {88} etc %

I was just trying to show that there could be any amount of upper or lower case characters, numerals and or punctuation marks preceding and following the "keyword" or the "target words" that I wished to retain. They are not the very last character at the end of the line.

I have been reading the regular expressions examples in the TextWrangler help, however this is too much like scripting and programming for me, it fries my brain!

I have been working on isolating two repeating keywords from a product list. In this case, it was 171 different envelopes. The keywords in this case were "plainface" and "window". That being said, the client may turn around and say that they would prefer the keywords to be "strip seal", "self seal" and or "lik'n'stik" - or perhaps that they would like "secretive" or "non-secretive" to be the keywords. I don't know yet which keywords they will finally use. The isolated keywords need to be in the original order as they appeared in the source file, so I can't have regular expressions or other methods re-ordering the list. This is a bit like variable data mapping, different columns of data, that all need to be in the same row order.

11B - 90x145 - strip seal plainface secretive
12 & 3/4 - 92x165 - self seal plainface non-secretive
12 & 3/4 - 92x165 - self seal window (1) secretive
100x230 - self seal plainface secretive
100x230 - self seal window (1) secretive
DL - 110x220 - lik'n'stik plainface non-secretive
DL - 110x220 - lik'n'stik plainface gold
DL - 110x220 - lik'n'stik plainface secretive
DL - 110x220 - lik'n'stik window (1) secretive


Not knowing regular expressions/grep - I perhaps mistakenly consider them to be the "holy grail" of performing automated text search/replace...and perhaps they are not always the best option and I have been trying to use a screwdriver as a hammer! In the end, I found recording a macro in jEdit to be much easier, similar to recording an action in Photoshop, which I am very comfortable with.

For D.I.Y. automation to be accessible to the masses, visual tools are needed to create "code" (regular expressions, javascript or whatever the particular language is). If Adobe can't do this, I would think that a third-party software developer could make money with a visual or wizard driven tool to demystify regular expressions, scripting etc.

In addition to spreadsheet formulas, an online client-side text tool that has helped me with other parts of this project is:

http://textmechanic.com/


Stephen Marsh
 
Last edited:
Some times designing or automating means spending alot of time understanding the customer and their workflow. What is the date they are interested in an how do I find it in the database. It is not always that the road is straight. Looking at your list it seems like something that is seeded in a database. Many of these things can be found in a combination of JS and GREP if you want to ask complex logic matches… but GREP does work well, and it is a great way they have with the pop-ups… but it doesn't help in learning how to think. David Blattner has a great GREP series on Lynda.com, and once you start seeing GREP patterns, then your invested time is soon paying off.

If you want to do it with a macro that's fine too.
Having the data in a tab delimited list with the keyword in a particular position does help, since it means you can look for the word after first "-" and till the next "tab" this is what needs to be done if it's random words and you don't know if it is a single word or a phrase. Now The customer probably has the "list" in a database. Some jobs I done i fist add some other delimiter (random unused character) to mark a spot, do a search replace, followed by a second search replace. The tricky part is that the computer does not understand that "self" needs to be "self seal" for it to be a keyword. We could search for two words after hyphen, if you want it generic but then you would have a problem with "lik'n'stik" which is one word.
You could in GREP look for all words after "-" and before "plain" or "window", but I don't know if there are any more exceptions. (these patterns will be a problem irrespective of if you do a script or in GREP)

I worked on a catalouge where there were different ceritfications. (environment, low-fat etc etc) and ech needed a symbol to be inserted as an anchored object… this is when i found "multi-find change". There is also the script find-replace by list built in, but it only allows for all processing to happen in one go and JS syntax is slightly different to fin-change syntax.

I hope there will be a better/easier interface for GREP in the future, use the adobe feature request form to push the progress in the direction you want.
 
Last edited:

PressWise

A 30-day Fix for Managed Chaos

As any print professional knows, printing can be managed chaos. Software that solves multiple problems and provides measurable and monetizable value has a direct impact on the bottom-line.

“We reduced order entry costs by about 40%.” Significant savings in a shop that turns about 500 jobs a month.


Learn how…….

   
Back
Top