Miroir tout le site Web et enregistrer les liens dans un fichier txt

Est-il possible d’utiliser wget mirror pour enregistrer tous les liens d’un site Web entier et les enregistrer dans un fichier txt?

Si c’est possible, comment ça se passe? Sinon, existe-t-il d’autres méthodes pour le faire?

MODIFIER:

J’ai essayé de lancer ceci:

wget -r --spider example.com 

Et a obtenu ce résultat:

 Spider mode enabled. Check if remote file exists. --2015-10-03 21:11:54-- http://example.com/ Resolving example.com... 93.184.216.34, 2606:2800:220:1:248:1893:25c8:1946 Connecting to example.com|93.184.216.34|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 1270 (1.2K) [text/html] Remote file exists and could contain links to other resources -- resortingeving. --2015-10-03 21:11:54-- http://example.com/ Reusing existing connection to example.com:80. HTTP request sent, awaiting response... 200 OK Length: 1270 (1.2K) [text/html] Saving to: 'example.com/index.html' 100%[=====================================================================================================>] 1,270 --.-K/s in 0s 2015-10-03 21:11:54 (93.2 MB/s) - 'example.com/index.html' saved [1270/1270] Removing example.com/index.html. Found no broken links. FINISHED --2015-10-03 21:11:54-- Total wall clock time: 0.3s Downloaded: 1 files, 1.2K in 0s (93.2 MB/s) (Yes, I also sortinged using other websites with more internal links) 

Oui, en utilisant l’option --spider de wget. Une commande comme:

 wget -r --spider example.com 

obtiendra tous les liens à une profondeur de 5 (la valeur par défaut). Vous pouvez ensuite capturer la sortie dans un fichier, peut-être la nettoyer au fur et à mesure. Quelque chose comme:

 wget -r --spider example.com 2>&1 | grep "http://" | cut -f 4 -d " " >> weblinks.txt 

mettra juste les liens dans le fichier weblinks.txt (si votre version de wget a une sortie légèrement différente, vous devrez peut-être modifier légèrement cette commande).

Ou en utilisant python:

Pour exaple

 import urllib, re def do_page(url): f = urllib.urlopen(url) html = f.read() pattern = r"'{}.*.html'".format(url) hits = re.findall(pattern, html) return hits if __name__ == '__main__': hits = [] url = 'http://thehackernews.com/' hits.extend(do_page(url)) with open('links.txt', 'wb') as f1: for hit in hits: f1.write(hit) 

En dehors:

 'http://thehackernews.com/2015/10/adblock-extension.html' 'http://thehackernews.com/p/authors.html' 'http://thehackernews.com/2015/10/adblock-extension.html' 'http://thehackernews.com/2015/10/adblock-extension.html' 'http://thehackernews.com/2015/10/adblock-extension.html' 'http://thehackernews.com/2015/10/adblock-extension.html' 'http://thehackernews.com/2015/10/adblock-extension.html' 'http://thehackernews.com/2015/10/adblock-extension.html' 'http://thehackernews.com/2015/10/data-breach-hacking.html' 'http://thehackernews.com/p/authors.html' 'http://thehackernews.com/2015/10/data-breach-hacking.html' 'http://thehackernews.com/2015/10/data-breach-hacking.html' 'http://thehackernews.com/2015/10/data-breach-hacking.html' 'http://thehackernews.com/2015/10/data-breach-hacking.html' 'http://thehackernews.com/2015/10/data-breach-hacking.html' 'http://thehackernews.com/2015/10/data-breach-hacking.html' 'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 'http://thehackernews.com/p/authors.html' 'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 'http://thehackernews.com/2015/10/howto-Freeze-Credit-Report.html' 'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 'http://thehackernews.com/p/authors.html' 'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 'http://thehackernews.com/2015/10/experian-tmobile-hack.html' 'http://thehackernews.com/2015/10/buy-google-domain.html' 'http://thehackernews.com/p/authors.html' 'http://thehackernews.com/2015/10/buy-google-domain.html' 'http://thehackernews.com/2015/10/buy-google-domain.html' 'http://thehackernews.com/2015/10/buy-google-domain.html' 'http://thehackernews.com/2015/10/buy-google-domain.html' 'http://thehackernews.com/2015/10/buy-google-domain.html' 'http://thehackernews.com/2015/10/buy-google-domain.html' 'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 'http://thehackernews.com/p/authors.html' 'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 'http://thehackernews.com/2015/09/digital-india-facebook.html' 'http://thehackernews.com/2015/09/digital-india-facebook.html' 'http://thehackernews.com/2015/10/buy-google-domain.html' 'http://thehackernews.com/2015/10/buy-google-domain.html' 'http://thehackernews.com/2015/09/winrar-vulnerability.html' 'http://thehackernews.com/2015/09/winrar-vulnerability.html' 'http://thehackernews.com/2015/09/chip-mini-computer.html' 'http://thehackernews.com/2015/09/chip-mini-computer.html' 'http://thehackernews.com/2015/09/edward-snowden-twitter.html' 'http://thehackernews.com/2015/09/edward-snowden-twitter.html' 'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 'http://thehackernews.com/2015/10/android-stagefright-vulnerability.html' 'http://thehackernews.com/2015/09/quantum-teleportation-data.html' 'http://thehackernews.com/2015/09/quantum-teleportation-data.html' 'http://thehackernews.com/2015/09/iOS-lockscreen-hack.html' 'http://thehackernews.com/2015/09/iOS-lockscreen-hack.html' 'http://thehackernews.com/2015/09/xor-ddos-attack.html' 'http://thehackernews.com/2015/09/xor-ddos-attack.html' 'http://thehackernews.com/2015/09/truecrypt-encryption-software.html' 'http://thehackernews.com/2015/09/truecrypt-encryption-software.html'