Готов псевдокод, но не знаю, как реализовать. Разбивка текста на абзацы

@DmitryLiebe · Регистрация: 17.03.2021

Студворк — интернет-сервис помощи студентам

Код приведён ниже с комментариями. Надеюсь, что расписано достаточно подробно.
Необходимо написать программу для задания "Разбивка сплошного текста на абзацы".

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Сплошной текст, который необходимо разбить на смысловые абзацы
text = 'Lorem ipsum dolor sit amet. Nam non diam porttitor. Phasellus non quam ultrices tempus mi. Sed vel ex ex. Nulla ut maximus justo. Vivamus luctus eget odio non pellentesque. Fusce vel diam sagittis.'
# Ключевые слова, которые задаёт преподаватель(программист) заранее. Они означают начало каждого из абзацев.
list_of_key_words = ['Phasellus','Nulla']
 
# Ответ студента. Здесь ученик(пользователь) расставляет абзацы(то есть проставляет \n ) как ему хочется.
student_answer = 'Lorem ipsum dolor sit amet. \nNam non diam porttitor. Phasellus non quam ultrices tempus mi. Sed vel ex ex. \nNulla ut maximus justo.\n Vivamus luctus eget odio non pellentesque. Fusce vel diam sagittis.'
# пустой массив для ключевых слов студента
list_of_student_words = []
 
def Check:
    # Цикл (Если находится слово в student_answer после \n, то вписать его в массив list_of_student_words, например, через функцию append() )
    # Для данного примера должно получится list_of_student_words = ['Nam','Nulla','Vivamus'] 
    # Сравнение list_of_key_words и list_of_student_words
    # Если студент ошибся, то есть list_of_key_words и list_of_student_word не совпадают, тогда там, где он ошибся, необходимо вставить символ '!!!'.
    # То есть в нашем примере нужно поставить !!! перед словами Nam и Vivamus.
    # Затем в консоль выводим текст с !!! или без воскл. знаков в зависимости от того, была ли совершена ошибка. Конец.

как мне записать "Цикл (Если находится слово в student_answer после \n, то вписать его в массив list_of_student_words, например, через функцию append() )" через, например, функцию re.match()?
Буду очень благодарен любой подсказке.

@iSmokeJC · 23.03.2021, 19:45

Python
1
2
3
4
5
6
7
8
9
10
list_of_key_words = ['Phasellus', 'Nulla']
student_answer = 'Lorem ipsum dolor sit amet. \nNam non diam porttitor. Phasellus non quam ultrices tempus mi. Sed ' \
                 'vel ex ex. \nNulla ut maximus justo.\n Vivamus luctus eget odio non pellentesque. Fusce vel diam ' \
                 'sagittis. '
 
list_of_student_words = re.findall(r'\n\s?(\w+?)\W', student_answer)
for word in list_of_student_words:
    if word not in list_of_key_words:
        student_answer = re.sub(rf'\b({word})\b', r'!!!\1', student_answer)
print(student_answer)

Bash
1
2
3
4
Lorem ipsum dolor sit amet. 
!!!Nam non diam porttitor. Phasellus non quam ultrices tempus mi. Sed vel ex ex. 
Nulla ut maximus justo.
 !!!Vivamus luctus eget odio non pellentesque. Fusce vel diam sagittis.

@DmitryLiebe · 24.03.2021, 09:47 **[ТС]**

Благодарю от души, iSmokeJC, всё работает. Выручили.

@iSmokeJC · 24.03.2021, 09:55

DmitryLiebe, да незачт, обращайся.
ЗЫ: в первой регулярке поменяй наверно \s? на \W?, так верней

@DmitryLiebe · 01.04.2021, 20:18 **[ТС]**

Здравствуйте ещё раз, iSmokeJC и дорогие программисты!
Обнаружилась ошибка, думаю, вы сможете помочь.

Вот исходный сплошной текст.

Code
1
English has been established as having a simpler and often more logical structure. Gestri et al. [2011], for example, contend that English is by nature a more synthetic language. Burgess [2001], in his seminal work on the subject, disputed some of Gestri’s observations. Smith and Jones [2010] compared English and Spanish technical writing and found that English used about 30% less words to express the same concept. They confirmed previous research on the subject by concluding that English is inherently simpler and more concise. His findings were essentially the same as Smith’s and Jones’, but deviated in the percentages – 40% rather than 30%. Our work is a direct continuation of the work begun by Smith and Ughi, but with two essential differences

Ключевые слова такие list_of_key_words = ['Smith and Jones', 'Our work is']

Я думаю, что алгоритм считывает только одно слово, а словосочетание не может. Что можно в таком случае придумать?
Вот код, который я использую для теста. Здесть можно увидеть, что перед Smith and John он ставит ошибку, а перед Ourworkis - не ставит.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Ключевые слова, которые задаёт преподаватель(программист) заранее. Они означают начало каждого из абзацев.
list_of_key_words = ['Smith and Jones','Ourworkis']
 
# Ответ студента. Здесь ученик(пользователь) расставляет абзацы(то есть проставляет \n ) как ему хочется.
student_answer = 'English has been established as having a simpler and often more logical ' \
                 'structure. Gestri et al. [2011], for example, contend that English is by ' \
                 'nature a more synthetic language. \nBurgess [2001], in his seminal work on ' \
                 'the subject, disputed some of Gestri’s observations. \nSmith and Jones [2010]' \
                 ' compared English and Spanish technical writing and \nfound that English ' \
                 'used about 30% less words to express the same concept. They confirmed ' \
                 'previous research on the subject by concluding that English is inherently' \
                 ' simpler and more concise. His findings were essentially the same as Smith’s' \
                 ' and Jones’, but deviated in the percentages – 40% rather' \
                 ' than 30%. \nOurworkis a direct continuation of the work begun ' \
                 'by Smith and Ughi, but with two essential differences'
 
# массив ключевых слов студента
list_of_student_words = re.findall(r'\n\W?(\w+?)\W', student_answer) #находится слово в student_answer после \n и вписываетя в массив
 
for word in list_of_student_words: #итерация в массиве ключевых слов студента
    if word not in list_of_key_words: #если слово не нашлось в массиве, то
        student_answer = re.sub(rf'\b({word})\b', r'!!!\1', student_answer) # в ответ студента добавляется !!!, где неправильно поставлен абзац
print(student_answer)

Bash
1
2
3
4
5
English has been established as having a simpler and often more logical structure. Gestri et al. [2011], for example, contend that English is by nature a more synthetic language. 
!!!Burgess [2001], in his seminal work on the subject, disputed some of Gestri’s observations. 
!!!Smith and Jones [2010] compared English and Spanish technical writing and 
!!!found that English used about 30% less words to express the same concept. They confirmed previous research on the subject by concluding that English is inherently simpler and more concise. His findings were essentially the same as !!!Smith’s and Jones’, but deviated in the percentages – 40% rather than 30%. 
Ourworkis a direct continuation of the work begun by !!!Smith and Ughi, but with two essential differences

@iSmokeJC · 01.04.2021, 20:40

Сообщение от DmitryLiebe

алгоритм считывает только одно слово, а словосочетание не может

Конечно не может. Как он может угадать - кейворд будет слово или фраза? А если фраза, то из скольких слов?

Сообщение от DmitryLiebe

Что можно в таком случае придумать?

Пиши парсер

@Arsegg · 01.04.2021, 21:03

Не по теме:

Сообщение от iSmokeJC

Пиши парсер

Мне кажется, потребуется артиллерия помощнее))

@DmitryLiebe · 01.04.2021, 21:28 **[ТС]**

Да, логично, благодарю.

@DmitryLiebe · 02.04.2021, 08:51 **[ТС]**

Ещё раз здравствуйте!

Вот ключевые слова

Python
1
   list_of_key_words = ['5)Smith', '9)Our','10)(1)','12)(2)','13)To','15)The','17)Our','19)A','24)We']

Вот текст, в котором расставлены абзацы. Есть правильно и неправильно расставленные.

Bash
1
2
3
4
5
6
7
8
1)A review of the literature in this field clearly shows that the majority of authors believe that there is an inherent difference between English and Latinate languages: English has been established as having a simpler and often more logical structure. 2)Gestri et al. [2011], for example, contend that English is by nature a more synthetic language. 3)Burgess [2001], in his seminal work on the subject, disputed some of Gestri’s observations. 4)Specifically, Burgess called into question the latter’s GAS index, and eventually reformulated it into the SMOKEWARE index [2004]. 
5)Smith and Jones [2010] compared English and Spanish technical writing and found that English used about 30% less words to express the same concept. 6)They confirmed previous research on the subject by concluding that English is inherently simpler and more concise. 7)A similar study was made by Ughi [2014] who reported on an interesting statistical analysis of typical phrases in the two languages. 8)His findings were essentially the same as Smith’s and Jones’, but deviated in the percentages – 40% rather than 30%. 9)Our work is a direct continuation of the work begun by Smith and Ughi, but with two essential differences: 
10)(1) The works quoted above make the unwarranted assumption that English has always been a simple language. 
11)We attempt to prove otherwise. 
12)(2) We go a step further than previous works, in that we establish a correlation between the simplicity of a language and the ease of life in the nation where that language is spoken. 13)To prove these two points we developed a Verbosity Index, derived from 1000 recent scientific articles written in English, and the same number written in Italian (for full details see Sect. 4). 14)The Verbosity Index was computed on the basis of the difficulty in comprehending an article, primarily in terms of sentence length – the higher the VI, the more difficult the understanding. 15)The same process was then repeated for articles written 50 years ago. 16)The results show that the English of 50 years ago has a comparable Verbosity Index to current French, Italian and Spanish, but is much higher than current English.
 17)Our findings demonstrate that English has become increasingly less verbose over the last 20 years. 18)We believe that this trend can in part be attributed to such organizations as the Campaign for Plain English and Siegel and Gale (a company spe******ed in reducing the length and complexity of government documents). 19)A concerted effort has been made to make English simpler. 20)This has been done for two main reasons. 21)Firstly, to make the written language more accessible to a wider variety of people (but not primarily those whose first language is not English). 
22)Secondly, for economic reasons it makes much more sense to have a document of one page rather than three. 23)Not only is time saved in writing and reading the document, it also costs less to produce, and takes much less time to process, especially in the case of such documents as passport forms and tax declaration returns. 
24)We believe that our work has significant implications for all those countries whose citizens are habitually buried in bureaucratic procedures and forms.

Вот то, что выдаёт алгоритм. То есть он ставит !!! перед '5)Smith', хотя тут всё правильно. В итоге, алгоритм берёт и просто вставляет !!! перед всеми абзацами, правильными и неправильными. Как это можно исправить?

Bash
1
2
3
4
5
6
7
8
1)A review of the literature in this field clearly shows that the majority of authors believe that there is an inherent difference between English and Latinate languages: English has been established as having a simpler and often more logical structure. 2)Gestri et al. [2011], for example, contend that English is by nature a more synthetic language. 3)Burgess [2001], in his seminal work on the subject, disputed some of Gestri’s observations. 4)Specifically, Burgess called into question the latter’s GAS index, and eventually reformulated it into the SMOKEWARE index [2004]. 
!!!5)Smith and Jones [2010] compared English and Spanish technical writing and found that English used about 30% less words to express the same concept. 6)They confirmed previous research on the subject by concluding that English is inherently simpler and more concise. 7)A similar study was made by Ughi [2014] who reported on an interesting statistical analysis of typical phrases in the two languages. 8)His findings were essentially the same as Smith’s and Jones’, but deviated in the percentages – 40% rather than 30%. 9)Our work is a direct continuation of the work begun by Smith and Ughi, but with two essential differences: 
!!!10)(1) The works quoted above make the unwarranted assumption that English has always been a simple language. 
!!!11)We attempt to prove otherwise. 
!!!12)(2) We go a step further than previous works, in that we establish a correlation between the simplicity of a language and the ease of life in the nation where that language is spoken. 13)To prove these two points we developed a Verbosity Index, derived from 1000 recent scientific articles written in English, and the same number written in Italian (for full details see Sect. 4). 14)The Verbosity Index was computed on the basis of the difficulty in comprehending an article, primarily in terms of sentence length – the higher the VI, the more difficult the understanding. 15)The same process was then repeated for articles written 50 years ago. 16)The results show that the English of 50 years ago has a comparable Verbosity Index to current French, Italian and Spanish, but is much higher than current English.
 !!!17)Our findings demonstrate that English has become increasingly less verbose over the last 20 years. 18)We believe that this trend can in part be attributed to such organizations as the Campaign for Plain English and Siegel and Gale (a company spe******ed in reducing the length and complexity of government documents). 19)A concerted effort has been made to make English simpler. 20)This has been done for two main reasons. 21)Firstly, to make the written language more accessible to a wider variety of people (but not primarily those whose first language is not English). 
!!!22)Secondly, for economic reasons it makes much more sense to have a document of one page rather than three. 23)Not only is time saved in writing and reading the document, it also costs less to produce, and takes much less time to process, especially in the case of such documents as passport forms and tax declaration returns. 
!!!24)We believe that our work has significant implications for all those countries whose citizens are habitually buried in bureaucratic procedures and forms.

@iSmokeJC · 02.04.2021, 09:40

Включить в регулярку отлов скобок. В настоящий момент она захватывает только буквы, цифры и знак подчеркивания.

Добавлено через 1 минуту
r'\n\s?([)(\w]+?)\W'

@DmitryLiebe · 02.04.2021, 10:32 **[ТС]**

Благодарю, iSmokeJC. Но всё равно не работает. Также лепит !!! везде, где абзацы стоят.
Попробовал без 5) перед Smith. В этом случае работает. Можете подсказать, пожалуйста

@iSmokeJC · 02.04.2021, 10:34

r'\n\s?([\)\(\w]+?)\W'

@DmitryLiebe · 02.04.2021, 10:46 **[ТС]**

Что-то всё равно не хочет

@iSmokeJC · 02.04.2021, 11:19

DmitryLiebe, это уже смахивает на прибивание костылей в прыжке

Python
1
2
3
4
5
6
list_of_student_words = re.findall(r'\n\W?([()\w]+?)\s', student_answer)
for word in list_of_student_words:
    if word not in list_of_key_words:
        word = re.sub(r'([)(])', r'\\\1', word)
        student_answer = re.sub(rf'\b({word})\b', r'!!!\1', student_answer)
print(student_answer)

Добавлено через 31 секунду
Ибо скобки - таки управляющий символ

@DmitryLiebe · 02.04.2021, 12:34 **[ТС]**

Благодарю, что не оставили без внимания мой вопрос. Желаю Вам всех благ.

Новые блоги и статьи Все статьи Все блоги /
Инструменты COM: Сохранение данный из VARIANT в файл и загрузка из файла в VARIANT bedvit 28.01.2026 Сохранение базовых типов COM и массивов (одномерных или двухмерных) любой вложенности (деревья) в файл, с возможностью выбора алгоритмов сжатия и шифрования. Часть библиотеки BedvitCOM Использованы. . .	Загрузка PNG с альфа-каналом на SDL3 для Android: с помощью SDL_LoadPNG (без SDL3_image) 8Observer8 28.01.2026 Содержание блога SDL3 имеет собственные средства для загрузки и отображения PNG-файлов с альфа-каналом и базовой работы с ними. В этой инструкции используется функция SDL_LoadPNG(), которая. . .	Загрузка PNG с альфа-каналом на SDL3 для Android: с помощью SDL3_image 8Observer8 27.01.2026 Содержание блога SDL3_image - это библиотека для загрузки и работы с изображениями. Эта пошаговая инструкция покажет, как загрузить и вывести на экран смартфона картинку с альфа-каналом, то есть с. . .	влияние грибов на сукцессию anaschu 26.01.2026 Бифуркационные изменения массы гриба происходят тогда, когда мы уменьшаем массу компоста в 10 раз, а скорость прироста биомассы уменьшаем в три раза. Скорость прироста биомассы может уменьшаться за. . .
Воспроизведение звукового файла с помощью SDL3_mixer при касании экрана Android 8Observer8 26.01.2026 Содержание блога SDL3_mixer - это библиотека я для воспроизведения аудио. В отличие от инструкции по добавлению текста код по проигрыванию звука уже содержится в шаблоне примера. Нужно только. . .	Установка Android SDK, NDK, JDK, CMake и т.д. 8Observer8 25.01.2026 Содержание блога Перейдите по ссылке: https:/ / developer. android. com/ studio и в самом низу страницы кликните по архиву "commandlinetools-win-xxxxxx_latest. zip" Извлеките архив и вы увидите. . .	Вывод текста со шрифтом TTF на Android с помощью библиотеки SDL3_ttf 8Observer8 25.01.2026 Содержание блога Если у вас не установлены Android SDK, NDK, JDK, и т. д. то сделайте это по следующей инструкции: Установка Android SDK, NDK, JDK, CMake и т. д. Сборка примера Скачайте. . .	Использование SDL3-callbacks вместо функции main() на Android, Desktop и WebAssembly 8Observer8 24.01.2026 Содержание блога Если вы откроете примеры для начинающих на официальном репозитории SDL3 в папке: examples, то вы увидите, что все примеры используют следующие четыре обязательные функции, а. . .

Готов псевдокод, но не знаю, как реализовать. Разбивка текста на абзацы

Решение

Решение

@DmitryLiebe 3 / 3 / 0 Регистрация: 17.03.2021 Сообщений: 15
	24.03.2021, 09:47 [ТС]
	Благодарю от души, iSmokeJC, всё работает. Выручили. 1

@iSmokeJC 19530 / 11067 / 2931 Регистрация: 21.10.2017 Сообщений: 23,294
	24.03.2021, 09:55
	DmitryLiebe, да незачт, обращайся. ЗЫ: в первой регулярке поменяй наверно `\s?` на `\W?`, так верней 0

@DmitryLiebe 3 / 3 / 0 Регистрация: 17.03.2021 Сообщений: 15
	01.04.2021, 21:28 [ТС]
	Да, логично, благодарю. 0

@iSmokeJC 19530 / 11067 / 2931 Регистрация: 21.10.2017 Сообщений: 23,294
	02.04.2021, 09:40
	Включить в регулярку отлов скобок. В настоящий момент она захватывает только буквы, цифры и знак подчеркивания. Добавлено через 1 минуту `r'\n\s?([)(\w]+?)\W'` 1

@DmitryLiebe 3 / 3 / 0 Регистрация: 17.03.2021 Сообщений: 15
	02.04.2021, 10:32 [ТС]
	Благодарю, iSmokeJC. Но всё равно не работает. Также лепит !!! везде, где абзацы стоят. Попробовал без 5) перед Smith. В этом случае работает. Можете подсказать, пожалуйста 0

@iSmokeJC 19530 / 11067 / 2931 Регистрация: 21.10.2017 Сообщений: 23,294
	02.04.2021, 10:34
	`r'\n\s?([\)\(\w]+?)\W'` 1

@DmitryLiebe 3 / 3 / 0 Регистрация: 17.03.2021 Сообщений: 15
	02.04.2021, 10:46 [ТС]
	Что-то всё равно не хочет 0

@DmitryLiebe 3 / 3 / 0 Регистрация: 17.03.2021 Сообщений: 15
	02.04.2021, 12:34 [ТС]
	Благодарю, что не оставили без внимания мой вопрос. Желаю Вам всех благ. 1