Проблема в объединении файлов о один проект. Скорее всего нужен ещё какой-то код (так называемый Main)

@vasyabylba · Регистрация: 15.04.2019

Студворк — интернет-сервис помощи студентам

Нужна помощь в объединении файлов о один проект. Скорее всего нужен ещё какой-то код (так называемый Main) для реализации идеи. Суть: определение являются ли два текста от одного автора.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
import similarity_measures as sm
from numba import jit
 
 
@jit
def get_score(x,y,imposters,sim):
    '''Compute for how many random feature sets out of 100 sim(x,y) is greater than sim(x,z) for all z in imposters'''  
    score = 0
    sim_xy = sim(x,y) 
    for k in range(100):
        ran_el = np.sort(np.random.choice(range(len(x)), (1,round(len(x) / 2)), replace = False))        
        c_x = x[ran_el][0]
        c_y = y[ran_el][0]
        sim_xy = sim(c_x,c_y)
 
 
        for yi in imposters:
            if sim(c_x, np.take(yi,ran_el)[0]) > sim_xy:
                break
        else:
            score += 1
 
    return score / 100.0  # 
   
 
def imposters(y, universe, m = 125, n = 25):
    '''Return n imposters. Compute the m most similar files in universe and randomly select n from them '''
    minmax_vec = np.array([sm.cminmax(t,y) for t in universe])
    pot_imp_ind = minmax_vec.argsort()[:-(m+1):-1] # Last m entries
    pot_imp_ind = np.sort(np.random.choice(pot_imp_ind, n))
    return [universe[k] for k in pot_imp_ind[::-1]]  # Have the most similar imposters first, to hope for a quicker break in the get_score algorithm
    
 
def blog_same_author(x,y,text_corpus, threshold, nr_imposters = 25):
    '''Return true if x and y are by the same author according to the algorithm in the paper'''
    imposters_y = imposters(y, text_corpus.Y, n = nr_imposters)
    score_xy = get_score(x,y,imposters_y, sm.cminmax)
    
    imposters_x = imposters(x, text_corpus.X, n = nr_imposters)
    score_yx = get_score(y,x,imposters_x, sm.cminmax)
    
    if (score_xy + score_yx ) /2.0 > threshold:
        print("true")
        return True
    else:
        print("false")
        return False

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
import re
import string
import numpy as np
import pandas as pd
import os
import random
import xml.etree.ElementTree as ET
 
 
 
def get_n_words(text, number_of_words, name = None, report = False):
    '''Funktion to extract the first and last n words from a string
    long tells, if the file is too short to extract to distinct subsets of length number_of_words'''
    long = True
    
    words = text.split(' ')
    if len(words) < number_of_words*2:   # Double because I dont want that first and last have 'common' elements
        if report == True:
            print('WARNING: ' + name + ' has fewer than ' + str(number_of_words*2) + ' words! ' + str(len(words)))
        long = False
        
    # Die ersten N Wörter:
    first_words = words[:number_of_words]
    start = ' '.join(first_words)
    # Letzten N Wörter:
    last_words = words[-number_of_words:]
    end = ' '.join(last_words)
    return (start, end, long)
   
 
 
#### Test
# Kurze Test-Texte
 
text1 = 'Sprengstoffexperten des und Landeskriminalamtes haben den Sprengsatz untersucht. Sie gehen davon aus, dass die Ladung aus den Inhaltsstoffen sogenannter Polenböller gebaut und offenbar per Funk gezündet wurde. Die Beamten stufen den Vorfall als besorgniserregend ein, da man eine derartige Sprengkraft bei vergleichbaren Fällen noch nicht gesehen habe. Eine Spezialeinheit der Polizei zur Aufklärung extremistisch orientierter Straftaten ermittelt. Neben der Spur nach Berlin geht sie auch der These nach, dass womöglich eine bisher unbekannte Gruppe in dem Haus, das zum Abriss vorgesehen ist, einen Sprengversuch unternommen hat.'
text2 = 'Und das, obwohl die und Sache mit der Krim eigentlich klar ist Die russische Annexion der ukrainischen Halbinsel im Jahr 2014 war völkerrechtswidrig. Das sagt die Bundesregierung, das sagt die EU das sagt sogar die Linkspartei, schon 2014 festgehalten per Parteitagsbeschluss. Und trotzdem wollen manche Linke am liebsten nicht darüber sprechen.'
text3 = 'US-Präsident Donald Trump schließt eine militärische Reaktion auf die Krise in Venezuela nicht aus. Es gebe mehrere Möglichkeiten, "darunter eine militärische Option, falls nötig", sagte Trump am Freitag in New Jersey. Konkrete Pläne für ein militärisches Eingreifen in Venezuela gibt es aber offenbar nicht. Ein Pentagon-Sprecher erklärte, zum jetzigen Zeitpunkt gebe es keine entsprechenden Anweisungen aus dem Weißen Haus.'
text4 = 'Als am Freitagmorgen vergangener und Woche die Eilmeldungen zum überraschenden Fraktionswechsel der niedersächsischen Grünen-Abgeordneten Elke Twesten über die Nachrichtenagenturen liefen, wusste die Bundeskanzlerin längst Bescheid. Angela Merkel (CDU) hat einem Medienbericht zufolge vorab von der Wechsel der niedersächsischen Landtagsabgeordneten von den Grünen zur CDU erfahren. Das gehe aus einem Schreiben von Kanzleramtsstaatsminister Helge Braun an die Geschäftsführerin der SPD-Bundestagsfraktion, Christine Lambrecht hervor, berichteten die Zeitungen des Redaktionsnetzwerks Deutschland (RND). Demnach informierte der niedersächsische CDU-Landesvorsitzende Bernd Althusmann die Kanzlerin am Vortag des Wechsels telefonisch.'
 
vector_with_text = [text1, text2, text3, text4]
 
 
b1 = 'This is a 12385 sample'
b2 = 'this is     ano>>ther $example!!!'
vector_with_text = [b1, b2]
 
 
 
# Takes a path and opens all txt-files in that directory and returns a vector with the n first
# words, n last words from every text and a vector with the name of the authors, given that all
# files have the format author_title.txt
def open_text(path, number_of_words):
    vector_with_text = []
    vector_text_end = []
    authors = []
    for t in os.listdir(path):
        # Überspringe Dateien, die keine Txt sind:
        if t[-4:] != '.txt':
            continue
            
        aut = re.match(r'(\w*)_', t).group(1)
        authors.append(aut)
            
        name = path +'/'+ t
        file = open(name, 'r')   
        text = file.read()
        
        words = get_n_words(text, number_of_words, t)
        vector_with_text.append(words[0])
        vector_text_end.append(words[1])
 
    return vector_with_text, vector_text_end, authors
    
 
def open_xml(path, max_of_files = 5000, n_of_words = 500):
    ''' Takes a path and opens max_of_files xml files in that directory. From each file it extracts the
    n first and last words and gives the author-ID, if the filename is formatted as authorid.[...].xml
    remove specifies if '&' Symboles are deleted in the files'''
    text_start = []
    text_end = []
    author_id = []
    
    number_files = 0
    for f_name in os.listdir(path):
        # Just xml-Files:
        #if f_name[-4:] != '.xml':
        if not f_name.endswith('xml'):
            continue        
        
        if number_files >= max_of_files:
            break
                       
        file_path = path + '/' + f_name
        
        try:
            text = ''
            file = open(file_path, 'r')
            for line in file:
                if line.startswith('<'):
                    continue
                text += line.strip()
                
            start, end, long =  get_n_words(text, n_of_words, f_name, False)    # Gives start, end, long? (i.e. boolean if the file has fewer than n words)
 
            # If the document has to few words, skip this one
            if long == False:
                continue
 
            aut = re.match(r'(\d+).', f_name).group(1)
            author_id.append(aut)
 
            text_start.append(start)
            text_end.append(end)
 
            number_files += 1
            #print(f_name)
            
        except UnicodeDecodeError:
            #print('Encoding Error: ', f_name)
            pass
        except ValueError:
            print('Parse: file: ', f_name)
        
    return text_start, text_end, author_id
 
 
 
 
 
    
    
    
def old_open_xml(path, max_of_files = 5000, n_of_words = 500, remove = True):
    text_start = []
    text_end = []
    author_id = []
    
    number_files = 0
    for f_name in os.listdir(path):
        # Just xml-Files:
        #if f_name[-4:] != '.xml':
        if not f_name.endswith('xml'):
            continue        
        
        if number_files >= max_of_files:
            break
                       
        file = path + '/' + f_name
        
        # Remove &-Symbols (otherwise the xml Parser throws an error)
        if remove == True:
            f = open(file, 'r')
            text = f.read()
            f.close()
            text = text.translate({ord('&'): None})
            f_o = open(file, 'w')
            f_o.write(text)
            f_o.close()
                       
        with open(file, 'r') as xml_file:   # Umständlich damit UTF8 codierung -> Reicht nicht
            tree = ET.parse(xml_file)
#      tree = ET.parse(file)
        root = tree.getroot()
        
        text = ''
        for p in root.findall('post'): # Find all <post> entries
            text += p.text.strip()    # Build one long string without any newline characters and white spaces
        
        start, end, long =  get_n_words(text, n_of_words, f_name, False)    # Gives start, end, long? (i.e. boolean if the file has fewer than n words)
        
        # If the document has to few words, skip this one
        if long == False:
            continue
            
        aut = re.match(r'(\d+).', f_name).group(1)
        author_id.append(aut)
            
        text_start.append(start)
        text_end.append(end)
        
        number_files += 1
        
    return text_start, text_end, author_id
 
 
 
# Take a text and get all the n-grams with their freq as a dict
 
def get_freq(text, n = 4):
    '''Take a string and get all the n-grams with their freq as a dict'''
    text1 = text.lower()
#   text1 = re.sub(r'[.,-?!+"_()/$§%<>]', '', text1)   # Remove punctuation, RE slower than that:
    text1 = text1.translate({ord(char): None for char in string.punctuation + '0123456789'}) # Remove punctuation and digits
    text1 = re.sub(r'\s\s+',' ',text1)     # Remove double spaces
    words = text1.split(' ')
    
    grams = []
    # Identify all n-grams:
    for w in words:
        if len(w) < (n+1):
            grams.append(w.lower())
        else:
            for k in range(len(w)-n + 1):
                grams.append(w[k:k+n].lower())
                
    freq = {}
    for g in set(grams):
        c = grams.count(g)
        freq[g]= c
    return freq
 
def build_matrix(vector_with_text, k_highest_freq = 100000):
    ''' Take a array of texts and return a matrix (as a numpy array) with the tf-idf values for all n grams in the texts. Each row represents a text'''
 
# Matrix mit den Texten als Zeilen und den verschiedenen n-grams als Spalten. Die Einträge geben
# dann für jedes n-grams den tf-idf Wert in Bezug auf den Text an
    
    # WICHTIG:
    # total_text braucht für später die selbe Reihenfolge wie vector_with_text
    total_text = [get_freq(text) for text in vector_with_text] # Vector with all dictionaries of freq
 
    # Total_freq ist ein dict was für jedes in irgendeinem Text vorkommenden n-gram die Gesamt-
    # Vorkommenshäufigkeit speichert
    total_freq = {}
    for cfreq in total_text:
        for f in cfreq:
            if f in total_freq:
                total_freq[f] += cfreq[f]
            else:
                total_freq[f] = cfreq[f]  
    #print(total_freq)           
    #print(len(total_freq))
    
    
    # Get the n-grams with the k-hightest freq from all texts
    # If there are more n-grams than k_highest_freq, delate the least frequent ones
    if len(total_freq.keys()) > k_highest_freq:
        lowest_freq = sorted([x for x in total_freq.values()], reverse = True)[k_highest_freq]
 
        keys = total_freq.keys()
        for k in list(total_freq.keys()):
            if total_freq[k] < lowest_freq:    
                del total_freq[k]
 
        
    # Data_m als matrix, die für jedes n-gram die absolute Häufigkeit enthält
    data_m = np.empty((len(vector_with_text), len(total_freq.keys())))
 
    for t in enumerate(vector_with_text):
        freq_t = []                                     # Freq für alle n-grams für aktuellen Text, da vector erst um alle n-grams erweitert werden muss, die nicht in aktuellem Text sind
        for g in enumerate(total_freq.keys()):
            if g[1] in total_text[t[0]]:                # Wenn n-gram in aktuellem Text (toatl_text sollte die selbe Reihenfolge haben wir v_text)
                freq_t.append(total_text[t[0]][g[1]])   # Nehme aus vektor mit allen Freq, den Eintrag der zum Text entspricht und suche die Freq für das aktuelle n gram 
            else: 
                freq_t.append(0.0)
        data_m[t[0]] = freq_t                           # Neue Reihe in Dataframe mit dem index
 
 
    # Berechen Vektor mit der idf für jedes n-gram
    idf_m = np.log(data_m.shape[0] /np.array([(data_m[:,column]!= 0).sum() for column in range(data_m.shape[1])])) 
 
    # Berechne TF 
    data_m = data_m / np.sum(data_m, axis = 1)[:,None]
    
    # Berechne TF-IDF
    data_m = np.multiply(data_m, idf_m)
    
    #print(data_m)
    return data_m
 
 
def pair_vec(n):
    '''The index corresponds to the start index, the entry to the text in end
    A Pair is from the same author if index = entry. 
    ATTENTION: By chance it can happen, that slithly more than 50% are from the same other, but never less
    '''
 
 
    x = list(range(n))
    #random.seed(0)
    rand_ind = random.sample(range(n), round(n / 2))
    rand_match = rand_ind.copy()
    random.shuffle(rand_match)
    
    for ent in x:
        if ent in rand_ind:
            x[ent] = rand_match.pop()
 
    return x

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import data_prep as dp
import numpy as np
 
class corpus:
    def __init__(self, start = '', end = '', authors = ''):
        self.start, self.end, self.authors = start, end, authors
 
    def build_by_xml(self, path, number_of_words=500, number_of_files = 500):
        self.start, self.end, self.authors = dp.open_xml(path, max_of_files = number_of_files, n_of_words = number_of_words)
        
    def build_pairs(self):
        self.pair = dp.pair_vec(len(self.start))
        self.end = np.take(self.end, self.pair)
        
    def build_matrix(self):
        data = dp.build_matrix(np.concatenate((self.start, self.end)))
        self.X = data[:len(self.start)]
        self.Y = data[len(self.end):]
        
    def get_length(self):
        return len(self.authors)
    
    def same_author(self, k):
        return self.pair[k] == k
    
    def get_pair(self, k,num = True, text = False):
        if text and num:
            return [self.X[k], self.Y[k], self.start[k], self.end[k]]
        if text:
            return [self.start[k], self.end[k]]
        else:
            return [self.X[k], self.Y[k]]

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from numba import jit
import numpy as np
 
def minmax(x,y):
    return np.sum(np.minimum(x,y)) / np.sum(np.maximum(x,y))
 
@jit
def cminmax(x,y):
    nom = 0.0
    denom = 0
    for k in range(len(x)):
        if x[k] > y[k]:
            nom += y[k]
            denom += x[k]
        else:
            nom += x[k]
            denom += y[k]  
    return nom / denom
 
def cos(x,y):
    return x.dot(y) / (np.linalg.norm(x) * np.linalg.norm(y))

@dondublon · 16.04.2019, 11:32

Вы не знаете, как импортировать модули?

Новые блоги и статьи Все статьи Все блоги /
SDL3 для Web (WebAssembly): Реализация движения на Box2D v3 - трение и коллизии с повёрнутыми стенами 8Observer8 20.02.2026 Содержание блога Box2D позволяет легко создать главного героя, который не проходит сквозь стены и перемещается с заданным трением о препятствия, которые можно располагать под углом, как верхнее. . .	Конвертировать закладки radiotray-ng в m3u-плейлист damix 19.02.2026 Это можно сделать скриптом для PowerShell. Использование . \СonvertRadiotrayToM3U. ps1 <path_to_bookmarks. json> Рядом с файлом bookmarks. json появится файл bookmarks. m3u с результатом. # Check if. . .	Семь CDC на одном интерфейсе: 5 U[S]ARTов, 1 CAN и 1 SSI Eddy_Em 18.02.2026 Постепенно допиливаю свою "многоинтерфейсную плату". Выглядит вот так: https:/ / www. cyberforum. ru/ blog_attachment. php?attachmentid=11617&stc=1&d=1771445347 Основана на STM32F303RBT6. На борту пять. . .	Камера Toupcam IUA500KMA Eddy_Em 12.02.2026 Т. к. у всяких "хикроботов" слишком уж мелкий пиксель, для подсмотра в ESPriF они вообще плохо годятся: уже 14 величину можно рассмотреть еле-еле лишь на экспозициях под 3 секунды (а то и больше),. . .
И ясному Солнцу zbw 12.02.2026 И ясному Солнцу, и светлой Луне. В мире покоя нет и люди не могут жить в тишине. А жить им немного лет.	«Знание-Сила» zbw 12.02.2026 «Знание-Сила» «Время-Деньги» «Деньги -Пуля»	SDL3 для Web (WebAssembly): Подключение Box2D v3, физика и отрисовка коллайдеров 8Observer8 12.02.2026 Содержание блога Box2D - это библиотека для 2D физики для анимаций и игр. С её помощью можно определять были ли коллизии между конкретными объектами и вызывать обработчики событий столкновения. . . .	SDL3 для Web (WebAssembly): Загрузка PNG с прозрачным фоном с помощью SDL_LoadPNG (без SDL3_image) 8Observer8 11.02.2026 Содержание блога Библиотека SDL3 содержит встроенные инструменты для базовой работы с изображениями - без использования библиотеки SDL3_image. Пошагово создадим проект для загрузки изображения. . .

@dondublon 4652 / 2072 / 366 Регистрация: 17.03.2012 Сообщений: 10,182 Записей в блоге: 6
	16.04.2019, 11:32
	Вы не знаете, как импортировать модули? 0