Page 1 of 1

[99% SOLVED] How to match Cyrillic characters with a regular expression

Posted: Sat Apr 06, 2019 9:14 am
by Debugger
Text Editor:
Finding Russian sentences containing one or more dots and dashes and number and comma etc. and it is very important to look only in one line!


[ЁёА-Яа-я„«—]

example:
Что такое стиль. Настольная книга для писательницы
Что такое стиль. Настольная книга для писательницы - 2
Что такое стиль. Настольная — писательницы -
3. Что такое стиль. Настольная книга для писательницы

Re: How to match Cyrillic characters with a regular expression

Posted: Sun Apr 07, 2019 11:30 am
by therube
(Of course I'm not following, but...)


Separate your letters out first.

regex:[ЁёА-Яа-я] (or regex:[ЁёА-я], I think)

> В цепях древней тайны.mp3
> Славься, Русь!.mp3

Then add your punctuation.
Will that work?

regex:[ЁёА-Яа-я] regex:[,„«—]+

> Славься, Русь!.mp3

Re: How to match Cyrillic characters with a regular expression

Posted: Sun Apr 07, 2019 1:59 pm
by Debugger
Regular expression WRONG.
Bad match of all characters in one line.
Finding virtually the some text than it should. It should not match normal text, for example, without searching for CHARACTER and other characters throughout the text, rather than being strictly defined on a single line that contains at least a text in Russian.

Not need operator regex:


Example:

Line1: Russian text and or not and other char
Line2: Russian text
Line3: Polish text
(Separator)Line4:===
Line5: Russian text and or not and other char
Line6: Russian text
Line7: Polish text
(Separator)Line8:===

Re: How to match Cyrillic characters with a regular expression

Posted: Mon Apr 08, 2019 2:35 am
by void

Re: How to match Cyrillic characters with a regular expression

Posted: Mon Apr 08, 2019 6:37 am
by Debugger
void - Well, yes, but I can not find anything on the subject that a regular expression in one line must include strictly defined characters (Russian), can not contain mixed text, English, Polish, German and other the same characters, etc.

.+[ЁёА-Яа-я.,„”"«—0-9)(]\n

Re: How to match Cyrillic characters with a regular expression

Posted: Mon Apr 08, 2019 7:57 am
by void
Requires PCRE in multiline mode:

^([\p{Cyrillic}]+[\-\.—0-9]+[\p{Cyrillic}\-\.—0-9]*|[\-\.—0-9]+[\p{Cyrillic}]+[\p{Cyrillic}\-\.—0-9]*)$

This will also match at least one Cyrillic character, which I assume you want, otherwise it would match a long string of numbers or dashes or dots.

^ = match start of string (or line, in multiline mode)
[] = match character in a set
\p{Cyrillic} = match a Cyrillic character
\- = match a literal -
\. = match a literal .
+ = match previous element one or more times.
* = match previous element zero or more times.
$ = match end of string (or line, in multiline mode)

Re: How to match Cyrillic characters with a regular expression

Posted: Mon Apr 08, 2019 9:06 am
by Debugger
Unfortunately, I do not use PCRE, but I switched to the Onigmo engine and it will work.

I have modified a of the regex:
^([\p{Cyrillic}]+[\-\.\,\!\…\?\(\)\„\”\,\;\\\/\*\#\@\&\:\.x{200B}—0-9\s]+[\p{Cyrillic}\-\.\,\!\…\?\(\)\„\”\,\;\\\/\*\#\@\&\:\.x{200B}—0-9\s]*|[\-\.\,\…\?\(\)\„\”\,\;\\\/\*\#\@\&\:\.x{200B}—0-9\s]+[\p{Cyrillic}]+[\p{Cyrillic}\-\.\,\!\…\?\(\)\„\”\,\;\\\/\*\#\@\&\:\.x{200B}—0-9\s]*)$

but wrong regex

Text included:
\p{Cyrillic}
!
!!
!!!
!!!!
?
??
???
… (unicode)
— (unicode)
-
.
..
...
,
0-9
(
)
„ (unicode)
„ (unicode)
"
\s (space)
\
/
\x{200B} or really maybe .\x{200B}
*
#
@
&
:
;