home - Earnings on the Internet
  Oge physics procedure. Instructions for the preparation and conduct of the exam on the subject "Physics

Lesson summary Visualization of information in text documents (Grade 8, lesson 25, textbook L. Bosova). In the lesson, a generalization of ideas about how to create lists, tables, graphical objects and the possibilities of their use in text documents takes place.

Planned educational results:
subject  - the ability to use the means of structuring and visualization of textual information;
meta-subject  - A wide range of skills to use the means of information and communication technologies to create text documents; skills of rational use of existing tools;
personal  - understanding of the social, cultural role in the life of a modern person of the skills of creating text documents.

Solved training tasks:
  1) a generalization of ideas about how to create lists and the possibilities of their use in text documents;
  2) a generalization of ideas about how to create tables and the possibilities of their use in text documents;
  3) a generalization of ideas about the possibilities of using graphic objects in text documents.

Key concepts learned in the lesson:
  - numbered lists;
  - bulleted lists;
  - multilevel lists;
  - table;
- graphic images.

ICT tools used in the lesson:
  - personal computer (PC) of the teacher, multimedia projector, screen;
  - PC students.

E-learning resources
  - Presentation "Visualization of information in text documents."

Features of the presentation of the content of the lesson topic

1. Organizational moment (1 minute)
  Greeting students, reporting topics and lesson objectives.

2. Repeat (5 minutes)
  1) verification of the studied material on issues (4-9) to §4.3;
  2) visual verification of homework in the RT: 188-189;
  3) consideration of tasks that caused difficulties in completing homework.

3. Learning new material (20 minutes)
  The new material is presented accompanied by the presentation "Visualization of information in text documents."

1 slide  - name of the presentation;

2 slide  - keywords;
  - numbered lists
  - bulleted lists
  - multilevel lists
  - table
  - graphic images

3 slide  - visualization (diagram with examples);
Visualization  - presentation of information in a visual form. Textual information is presented in the form of lists, tables, diagrams, provide illustrations (photographs, diagrams, drawings).

4 slide  - lists (diagram);
  All kinds of lists in documents are drawn up using lists.
  List items are considered as paragraphs drawn up according to a single model.
  Elements (items) numbered list  are denoted by consecutive numbers, for which Arabic and Roman numerals can be used. Elements of the list can be numbered with letters - Russian or Latin.
  A numbered list is usually used in cases where the order of the items is important. Especially often, such lists are used to describe the sequence of actions.
  You regularly create numbered lists by filling out a lesson schedule for each school day in the diary.
  Items bulleted list  are indicated by markers. The user can choose as a marker any character of the computer alphabet, and even a small graphic image. Using a bulleted list, keywords are framed at the beginning of each paragraph of your textbook.
  A bulleted list is used in cases where the order of the elements in it is not important. For example, in the form of a bulleted list, you can draw up a list of subjects you study in grade 8.
  Structure distinguish single-level  and layered  lists.
A list whose element itself is a list is called multilevel. So, the table of contents of your computer science textbook is a multi-level (three-level) list.
  Lists are created in the word processor using the menu bar command or format panel buttons

5 slide  - tables (schemes);
  To describe a number of objects that have the same sets of properties, tables consisting of columns (columns) and rows are most often used. You are well aware of the tabular presentation of the lesson schedule; the schedules of buses, planes, trains and much more are presented in tabular form.
  The information presented in the table is clear, compact and easy to see.
The following table layout rules must be observed:
  1. The heading of the table should give an idea of \u200b\u200bthe information contained in it.
  2. The headings of columns and rows should be short, not contain unnecessary words and, if possible, abbreviations.
  3. The unit of measurement should be indicated in the table. If they are common to the whole table, they are indicated in the table heading (either in brackets, or separated by a comma after the name). If the units are different, they are indicated in the header of the corresponding row or column.
  4. It is advisable that all cells in the table be filled. If necessary, the following conventions are entered in them:
  ? -data is unknown;
  x - data is not possible;
  ↓ - data should be taken from the overlying cell.
  In cells of tables texts, numbers, images can be placed.
  You can create a table using the corresponding menu item or button on the toolbar, indicating the required number of columns and rows; in some word processors the table can be “drawn”.

View and discuss the animation "Working with tables."

6 slide  - graphic images (schemes);
  Modern word processors allow you to include various graphic images in documents.
  Ready-made graphic images can be edited by changing their sizes, primary colors, brightness and contrast, turning, overlapping each other, etc.
  You can visualize the numerical information contained in the table using diagrams, the creation tools of which are also included in word processors.
  The most powerful word processors allow you to build different types of graphic schemes that provide visualization of text information.

7 slide  - the most important thing.
  Text information visualizedif it is organized in the form of lists, tables, diagrams, provided with illustrations (photographs, drawings, diagrams).
  All possible lists in the documents are drawn up using lists. By the method of design distinguish numbered  and marked  lists. A numbered list is usually used in cases where the order of the items matters; marked - when the order of the items in it is not important.
  Structure distinguish single-level  and layered  lists.
  To describe a number of objects that have the same sets of properties, they are most often used tablesconsisting of columns and rows.
  Modern word processors provide options for inclusion, processing and creation graphic objects.

Questions and Tasks
8 slide  - questions and tasks.
  Questions 1-8 to paragraph 4.4.

4. Practical part (15 minutes)
  In the practical part of the lesson, students complete assignments 4.18-4.21 from assignments for practical work to chapter 4.
  If time permits, task 4.17 can be completed.

5. Summarizing the lesson. Post homework. Grading (4 minutes)
9 slide  - reference summary;
10 slide  - D / s.

Homework.
  §4.4, questions and tasks 1-8 to the paragraph.
Additional task:  prepare a message about infographics and several tools for creating infographics.

All material for the lesson is in the archive.

Archive includes:
  - compendium
  - presentation "Visualization of information in text documents",
  - animation "Work with tables",
  - animation "Work with graphics",
  - blank for practical work “Mouse.jpg”.

Download  (2.35 MB, rar): Lesson summary

   REPEAT No. 1. Determine which group of operations (editing or formatting) the following actions belong to: formatting Replacing one character with another; Insert a missing word; Font change; Removing a fragment of text; Text alignment in width; Automatic spell check; Line spacing Resizing page margins; Delete an erroneous character; Search and replacement; Moving text fragments.

   REPEAT No. 2. Define which group (character properties or No. 2 paragraph properties) the following properties belong: paragraph Font Alignment Interval after Indent of the first line Inscription Color Line spacing Indent on the left Indent before Size (size) of the font Indent on the right

REPEAT No. 3. In which of the following sentences No. 3 are the spaces between words and punctuation marks correctly placed? Where are the mistakes made? 1) 2) 3) 4) From your native land - die, do not go. Speech is not to weave bast shoes. Where he was born, there he fit. Saying is funny, hiding is a sin.

   REPEAT No. 4. Select the parameters to be set when No. 4 sets the page parameters: Orientation Style Font size Paper size Page numbers Fields Line spacing Indentation Paragraph alignment Inscription

   Visualization - presentation of information in a visual form. Textual information is presented in the form of lists, tables, diagrams, provide illustrations (photographs, diagrams, drawings). Visualization of information List Table Diagram Illustration List of objects Grade 8 1. Algebra 2. English language Grade 3. Biology 4. Student. Geography Mathematics Informatics 5. Geometry 6. Informatics and ICT Ivanov 7. History 5 Sasha 4 8. Literature 9. OBL Orlova Katya 4 5 10. Social Studies 11. Russian language Petrov 12. Physics Victor 5 5 13. Chemistry 14. Drawing Color Chart

   LISTS All kinds of lists in documents are drawn up using lists. List items are considered as paragraphs drawn up according to a single model. List structure Labeled 1. 2. 3. 4. 5. 6. Russian language Algebra OBZh Social Studies Biology Technology Lesson schedule - example of a numbered list Numbered Russian language Literature Algebra Geometry Physics List of subjects studied in grade 7 - example of a bulleted list

   The structure distinguishes between single-level and multi-level lists. A list, the element of which is itself a list, is called multilevel Example: Chapter 1. Information and information processes § 1. 1. Information and its properties 1. 1. 1. Information and signal 1. 1. 2. Types of information 1. 1. 3 Properties of information § 1. 2. Information processes 1. 2. 1. The concept of information process 1. 2. 2. Collection of information

   LIST CREATION TOOLS Quickly create lists with notes and numbering using the buttons on the toolbar:

   CHANGE OF MARKED LISTS: To change the appearance of the marker, you can use the Change button. The Change Bulleted List window appears, which contains additional notes. When you click the Marker button, the Symbol dialog box appears, in which you can select any of the symbols as a list marker.

CHANGE NUMBERED LISTS: To create your own version of a numbered list, click the Change button. The Modify Numbered List window appears. The Number format field indicates the text before and after the number of the list item, for example). In the Numbering field, the numbering style. The Start with ... field indicates the number (or letter) from which the list should begin. To change the font of the list item numbers, use the Font button.

   PRESENTING THE LIST OF COMPUTER DEVICES AS A MULTILEVEL LIST HAVING FOUR LEVELS OF IMPOSITION. Devices of a modern computer Processor Memory RAM Long-term memory Hard magnetic diskette Diskette Flash memory Optical discs CD DVD Input devices Keyboard Mouse Scanner Graphic tablet Digital camera Microphone Joystick Output devices Monitor LCD cathode ray tube printer Dot matrix printer Inkjet printer Laser printer

   Let's give the first line a formatting style, for example, Heading 1. Devices of a modern computer Processor Memory RAM Long-term memory Hard magnetic diskette Diskette Flash memory Optical discs CD DVD Input devices Keyboard Mouse Scanner Graphic tablet

   Convert the remaining lines to a multilevel § § list. For this it is necessary: \u200b\u200bselect all remaining lines; give the command Format-List. In the List dialog box, go to the Multilevel tab and select a list of type there:

   The list will take the following form: Devices of a modern computer 1. Processor 2. Memory 3. RAM 4. Long-term memory 5. Hard magnetic disk 6. Diskette 7. Flash memory 8. Optical disks 9. CD 10. DVD 11. Input devices 12 Keyboard 13. Mouse 14. Scanner 15. Graphic tablet 16. Digital camera 17. Microphone 18. Joystick 19. Output devices 20. Monitor 21. LCD monitor 22. Monitor with cathode ray tube 23. Printer 24. Dot matrix printer 25. Inkjet printer 26. Laser printer

   Select items 3 - 10 and lower their level. To do this, use the Increase Indent button of a modern computer device 1. Processor 2. Memory 3. RAM 4. Long-term memory 5. Hard disk 6. Diskette 7. Flash memory 8. Optical disks 9. CD 10. DVD 11. Input devices 12. Keyboard 13. Mouse 14. Scanner 15. Graphic tablet 16. Digital camera 17. Microphone 18. Joystick

The list will take the following form: Devices of a modern computer 1. Processor 2. Memory 2. 1. RAM 2. 2. Long-term memory 2. 3. Hard disk 2. 4. Floppy disk 2. 5. Flash memory 2. 6. Optical disks 2. 7. CD 2. 8. DVD 3. Input devices 4. Keyboard 5. Mouse 6. Scanner 7. Graphic tablet 8. Digital camera 9. Microphone 10. Joystick

   Select items 2. 3 - 2. 8 and lower their level. To do this, use the Increase Indent button of a modern computer device 1. Processor 2. Memory 2. 1. RAM 2. 2. Long-term memory 2. 3. Hard disk 2. 4. Floppy disk 2. 5. Flash memory 2. 6. Optical disks 2. 7. CD 2. 8. DVD 3. Input devices 4. Keyboard 5. Mouse 6. Scanner 7. Graphic tablet 8. Digital camera 9. Microphone 10. Joystick

   The list will take the following form: Devices of a modern computer 1. Processor 2. Memory 2. 1. RAM 2. 2. Long-term memory 2. 2. 1. Hard disk 2. 2. 2. Floppy disk 2. 2. 3. Flash memory 2. 2. 4. Optical disks 2. 2. 5. CD 2. 2. 6. DVD 3. Input devices 4. Keyboard 5. Mouse 6. Scanner 7. Graphic tablet 8. Digital camera 9. Microphone 10. Joystick

   Select items 2. 2. 5 - 2. 2. 6 and lower their level. Devices of a modern computer 1. Processor 2. Memory 2. 1. RAM 2. 2. Long-term memory 2. 2. 1. Hard magnetic disk 2. 2. 2. Floppy disk 2. 2. 3. Flash memory 2. 2. 4. Optical discs 2. 2. 4. 1. CD 2. 2. 4. 2. DVD 3. Input devices 4. Keyboard 5. Mouse 6. Scanner 7. Graphic tablet 8. Digital camera 9. Microphone 10. Joystick

   Repeat the same operations for other items in the list: 3. Input devices 3. 1. Keyboard 3. 2. Mouse 3. 3. Scanner 3. 4. Graphic tablet 3. 5. Digital camera 3. 6. Microphone 3. 7. Joystick 4 4. Output devices 4. 1. Monitor 4. 1. 1. LCD monitor 4. 1. 2. Cathode ray tube monitor 4. 2. Printer 4. 2. 1. Dot matrix printer 4. 2. 2. Inkjet printer 4 2. 2. 3. Laser printer

In the Russian-speaking sector of the Internet, there are very few educational practical examples (and with an example code even less) of the analysis of text messages in Russian. Therefore, I decided to collect data together and consider an example of clustering, since it does not require the preparation of data for training.

  Most of the libraries used are already in the Anaconda 3 distribution, so I advise you to use it. Missing modules / libraries can be installed as standard via pip install “package name”.
We connect the following libraries:

Import numpy as np import pandas as pd import nltk import re import os import codecs from sklearn import feature_extraction import mpld3 import matplotlib.pyplot as plt import matplotlib as mpl
  For analysis, you can take any data. This task then struck my eye: Statistics of search queries of the State expenditure project. They needed to break the data into three groups: private, state and commercial organizations. I didn’t want to invent anything extraordinary, so I decided to check how clustering would lead in this case (looking ahead - not really). But you can pump data from VK of some public:

Import vk # you pass the session id session \u003d vk.Session (access_token \u003d "") # URL to get access_token, instead of tvoi_id insert the id of the created Bk application: # https://oauth.vk.com/authorize?client_id\u003dtvoi_id&scope\u003dfriends, pages, groups, offline & redirect_uri \u003d https: //oauth.vk.com/blank.html&display\u003dpage&v\u003d5.21&response_type\u003dtoken api \u003d vk.API (session) poss \u003d id_pab \u003d -59229916 #id publics begin with a minus id no minus info \u003d api.wall.get (owner_id \u003d id_pab, offset \u003d 0, count \u003d 1) kolvo \u003d (info // 100) +1 shag \u003d 100 sdvig \u003d 0 h \u003d 0 import time while h 70): print (h) # is not necessary, just to control the approximate end of the process pubpost \u003d api.wall.get (owner_id \u003d id_pab, offset \u003d sdvig, count \u003d 100) i \u003d 1 while i< len(pubpost): b=pubpost[i]["text"] poss.append(b) i=i+1 h=h+1 sdvig=sdvig+shag time.sleep(1) len(poss) import io with io.open("public.txt", "w", encoding="utf-8", errors="ignore") as file: for line in poss: file.write("%s\n" % line) file.close() titles = open("public.txt", encoding="utf-8", errors="ignore").read().split("\n") print(str(len(titles)) + " постов считано") import re posti= #удалим все знаки препинания и цифры for line in titles: chis = re.sub(r"(\<(/?[^>] +)\u003e) "," ", line) #chis \u003d re.sub () chis \u003d re.sub (" [^ а-яА-Я] "," ", chis) posti.append (chis)
  I will use search query data to show how short text data is poorly clustered. I previously cleared the special characters and punctuation marks from the text plus replaced the abbreviations (for example, SP is an individual entrepreneur). The text turned out, where in each line there was one search query.

We read the data into an array and proceed to normalization - reduction of the word to its initial form. This can be done in several ways using the Porter Stemmer, MyStem Stemmer and PyMorphy2. I want to warn you - MyStem works through wrapper, so the speed of operations is very slow. Let us dwell on the Porter Stemmer, although no one bothers to use others and combine them with each other (for example, go through PyMorphy2, and then Porter with the Stemmer).

Titles \u003d open ("material4.csv", "r", encoding \u003d "utf-8", errors \u003d "ignore"). Read (). Split ("\\ n") print (str (len (titles)) + "requests read") from nltk.stem.snowball import SnowballStemmer stemmer \u003d SnowballStemmer ("russian") def token_and_stem (text): tokens \u003d filtered_tokens \u003d for token in tokens: if re.search ("[а-яА-Я]" , token): filtered_tokens.append (token) stems \u003d return stems def token_only (text): tokens \u003d filtered_tokens \u003d for token in tokens: if re.search ("[xAA-Z]", token): filtered_tokens.append (token) return filtered_tokens # Create dictionaries (arrays) from the resulting stems allwords_tokenized)

Pymorphy2

import pymorphy2 morph \u003d pymorphy2.MorphAnalyzer () G \u003d for i in titles: h \u003d i.split ("") #print (h) s \u003d "" for k in h: #print (k) p \u003d morph.parse ( k) .normal_form #print (p) s + \u003d "" s + \u003d p #print (s) # G.append (p) #print (s) G.append (s) pymof \u003d open ("pymof_pod.txt", "w", encoding \u003d "utf-8", errors \u003d "ignore") pymofcsv \u003d open ("pymofcsv_pod.csv", "w", encoding \u003d "utf-8", errors \u003d "ignore") for item in G : pymof.write ("% s \\ n"% item) pymofcsv.write ("% s \\ n"% item) pymof.close () pymofcsv.close ()


pymystem3

The analyzer executable files for the current operating system will be automatically downloaded and installed the first time you use the library.

From pymystem3 import Mystem m \u003d Mystem () A \u003d for i in titles: #print (i) lemmas \u003d m.lemmatize (i) A.append (lemmas) # This array can be saved to a file or "saved" import pickle with open ("mystem.pkl", "wb") as handle: pickle.dump (A, handle)


  Create a TF-IDF weight matrix. We will consider each search query as a document (this is done when analyzing posts on Twitter, where each tweet is a document). we will take tfidf_vectorizer from the sklearn package, and we will take stop words from the ntlk package (initially it will be necessary to download via nltk.download ()). The parameters can be adjusted as you see fit - from the upper and lower bounds to the number of n-grams (in this case, take 3).

Stopwords \u003d nltk.corpus.stopwords.words ("russian") # you can expand the list of stop words stopwords.extend (["what", "this", "so", "here", "be", "how", "c", "k", "on"]) from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer n_featur \u003d 200000 tfidf_vectorizer \u003d TfidfVectorizer (max_df \u003d 0.8, max_features \u003d 10000, min_df \u003d 0.01, stop_words \u003d stopwords, use_idf \u003d True, tokenizer \u003d token_and_stem, ngram_range \u003d (1,3)) get_ipython (). magic ("time tfidf_matrix \u003d tfidf_vectorizer.fit_transform (titles)") print (tfidf_matrix.shape)
  Over the resulting matrix, we begin to apply various clustering methods:

Num_clusters \u003d 5 # K-means method - KMeans from sklearn.cluster import KMeans km \u003d KMeans (n_clusters \u003d num_clusters) get_ipython (). Magic ("time km.fit (tfidf_matrix)") idx \u003d km.fit (tfidf_matrix) clusters \u003d km.labels_.tolist () print (clusters) print (km.labels_) # MiniBatchKMeans from sklearn.cluster import MiniBatchKMeans mbk \u003d MiniBatchKMeans (init \u003d "random", n_clusters \u003d num_clusters) # (init \u003d "k-means ++", ' random 'or an ndarray) mbk.fit_transform (tfidf_matrix)% time mbk.fit (tfidf_matrix) miniclusters \u003d mbk.labels_.tolist () print (mbk.labels_) # DBSCAN from sklearn.cluster import DBSCAN get_ipython () magic ( time db \u003d DBSCAN (eps \u003d 0.3, min_samples \u003d 10) .fit (tfidf_matrix) ") labels \u003d db.labels_ labels.shape print (labels) # Agglomerative clustering from sklearn.cluster import AgglomerativeClustering agglo1 \u003d AgglomerativeClustering (n_clusters \u003d affiliate \u003d "euclidean") #affinity you can choose any or try s all in turn: cosine, l1, l2, manhattan get_ipython () magic ( "time answer \u003d agglo1.fit_predict (tfidf_matrix.toarray ())") answer.shape.
  The received data can be grouped into a dataframe and calculate the number of requests that fall into each cluster.

  # k-means clusterkm \u003d km.labels_.tolist () #minikmeans clustermbk \u003d mbk.labels_.tolist () #dbscan clusters3 \u003d labels #agglo # clusters4 \u003d answer.tolist () frame \u003d pd.DataFrame (titles, index \u003d) # k-means out \u003d ("title": titles, "cluster": clusterkm) frame1 \u003d pd.DataFrame (out, index \u003d, columns \u003d ["title", "cluster"]) #mini out \u003d ("title" : titles, "cluster": clustermbk) frame_minik \u003d pd.DataFrame (out, index \u003d, columns \u003d ["title", "cluster"]) frame1 ["cluster"]. value_counts () frame_minik ["cluster"]. value_counts ()
Due to the large number of queries, it is not very convenient to look at tables and would like more interactivity for understanding. Therefore, we will make graphs of the relative positions of requests relative to each other.

First you need to calculate the distance between the vectors. The cosine distance will be used for this. The articles suggest using subtraction from unity so that there are no negative values \u200b\u200band is in the range from 0 to 1, so we will do the same:

From sklearn.metrics.pairwise import cosine_similarity dist \u003d 1 - cosine_similarity (tfidf_matrix) dist.shape
  Since the graphs will be two-dimensional, three-dimensional, and the original distance matrix is \u200b\u200bn-dimensional, you will have to use dimensional reduction algorithms. There are many algorithms to choose from (MDS, PCA, t-SNE), but let us stop at Incremental PCA. This choice was made as a result of practical application - I tried MDS and PCA, but I didn’t have enough RAM (8 gigabytes) and when the swap file started to be used, it was possible to immediately take the computer to reboot.

The Incremental PCA algorithm is used as a replacement for the principal component method (PCA) when the data set to be decomposed is too large to fit in RAM. IPCA creates a low-level approximation for input using a memory size that is independent of the number of input data samples.

  # The principal component method is PCA from sklearn.decomposition import IncrementalPCA icpa \u003d IncrementalPCA (n_components \u003d 2, batch_size \u003d 16) get_ipython (). Magic ("time icpa.fit (dist) #demo \u003d") get_ipython (). Magic (" time demo2 \u003d icpa.transform (dist) ") xs, ys \u003d demo2 [:, 0], demo2 [:, 1] # PCA 3D from sklearn.decomposition import IncrementalPCA icpa \u003d IncrementalPCA (n_components \u003d 3, batch_size \u003d 16) get_ipython () .magic ("time icpa.fit (dist) #demo \u003d") get_ipython (). magic ("time ddd \u003d icpa.transform (dist)") xs, ys, zs \u003d ddd [:, 0], ddd [:, 1], ddd [:, 2] # You can immediately look at what the result will be #from mpl_toolkits.mplot3d import Axes3D #fig \u003d plt.figure () #ax \u003d fig.add_subplot (111, projection \u003d "3d ") # ax.scatter (xs, ys, zs) # ax.set_xlabel (" X ") # ax.set_ylabel (" Y ") # ax.set_zlabel (" Z ") # plt.show ()
  We proceed directly to the visualization itself:

From matplotlib import rc # enable Russian symbols on the chart font \u003d ("family": "Verdana") #, "weigth": "normal") rc ("font", ** font) # colors can be generated for clusters import random def generate_colors (n): color_list \u003d for c in range (0, n): r \u003d lambda: random.randint (0,255) color_list.append ("#% 02X% 02X% 02X"% (r (), r (), r ())) return color_list # set the colors of cluster_colors \u003d (0: "# ff0000", 1: "# ff0066", 2: "# ff0099", 3: "# ff00cc", 4: "# ff00ff",) # give the names to the clusters, but because of the randomness let it be just 01234 cluster_names \u003d (0: "0", 1: "1", 2: "2", 3: "3", 4: "4",) #matplotlib inline # create a data frame that contains the coordinates (from the PCA) + cluster numbers and the queries themselves df \u003d pd.DataFrame (dict (x \u003d xs, y \u003d ys, label \u003d clusterkm, title \u003d titles)) # group by cluster m groups \u003d df.groupby ("label") fig, ax \u003d plt.subplots (figsize \u003d (72, 36)) #figsize is tailored to your taste for name, group in groups: ax.plot (group.x, group. y, marker \u003d "o", linestyle \u003d "", ms \u003d 12, label \u003d cluster_names, color \u003d cluster_colors, mec \u003d "none") ax.set_aspect ("auto") ax.tick_params (axis \u003d "x", which \u003d "both", bottom \u003d "off", top \u003d "off", labelbottom \u003d "off") ax.tick_params (axis \u003d "y", which \u003d "both", left \u003d "off", top \u003d "off" , labelleft \u003d "off") ax.legend (numpoints \u003d 1) # show the legend only 1 point # add labels / names in x, at the position with the search query #for i in range (len (df)): # ax.text (df.ix [i] ["x"], df.ix [i] ["y"], df.ix [i] ["title"], size \u003d 6) # show the graph plt.show () plt .close ()
  If you uncomment the line with the addition of names, it will look something like this:

Example with 10 clusters


  Not quite what I would expect. We will use mpld3 to translate the drawing into an interactive graph.

# Plot fig, ax \u003d plt.subplots (figsize \u003d (25.27)) ax.margins (0.03) for name, group in groups_mbk: points \u003d ax.plot (group.x, group.y, marker \u003d "o" , linestyle \u003d "", ms \u003d 12, # ms \u003d 18 label \u003d cluster_names, mec \u003d "none", color \u003d cluster_colors) ax.set_aspect ("auto") labels \u003d tooltip \u003d mpld3.plugins.PointHTMLTooltip (points, labels, voffset \u003d 10, hoffset \u003d 10, # css \u003d css) mpld3.plugins.connect (fig, tooltip) #, TopToolbar () ax.axes.get_xaxis (). set_ticks () ax.axes.get_yaxis (). set_ticks () # ax.axes.get_xaxis (). set_visible (False) # ax.axes.get_yaxis (). set_visible (False) ax.set_title ("Mini K-Means", size \u003d 20) #groups_mbk ax.legend (numpoints \u003d 1 ) mpld3.disable_notebook () # mpld3.display () mpld3.save_html (fig, "mbk.html") mpld3.show () # mpld3.save_json (fig, "vivod.json") # mpld3.fig_to_html (fig) fig , ax \u003d plt.subplots (figsize \u003d (51.25)) scatter \u003d ax.scatter (np.random.normal (size \u003d N), np.random.normal (size \u003d N), c \u003d np.random.random (size \u003d N), s \u003d 1000 * np.random.random (size \u003d N), alpha \u003d 0.3, cmap \u003d plt.cm.jet) ax.grid (color \u003d "white", linestyle \u003d "solid") ax.set_title ("Clusters", size \u003d 20) fig, ax \u003d plt.subplots (figsize \u003d (51.25)) labels \u003d ["point (0)". format ( i + 1) for i in range (N)] tooltip \u003d mpld3.plugins.PointLabelTooltip (scatter, labels \u003d labels) mpld3.plugins.connect (fig, tooltip) mpld3.show () fig, ax \u003d plt.subplots (figsize \u003d (72.36)) for name, group in groups: points \u003d ax.plot (group.x, group.y, marker \u003d "o", linestyle \u003d "", ms \u003d 18, label \u003d cluster_names, mec \u003d " none ", color \u003d cluster_colors) ax.set_aspect (" auto ") labels \u003d tooltip \u003d mpld3.plugins.PointLabelTooltip (points, labels \u003d labels) mpld3.plugins.connect (fig, tooltip) ax.set_title (" K-means " , size \u003d 20) mpld3.display ()
  Now, when you hover over any point in the graph, the text pops up with the corresponding search query. An example of a finished html file can be found here: Mini K-Means

If you want in 3D and with a zoom, then there is a Plotly service, which has a plugin for Python.

Plotly 3D

# for example, just a 3D graph from the obtained values \u200b\u200bimport plotly plotly .__ version__ import plotly.plotly as py import plotly.graph_objs as go trace1 \u003d go.Scatter3d (x \u003d xs, y \u003d ys, z \u003d zs, mode \u003d "markers", marker \u003d dict (size \u003d 12, line \u003d dict (color \u003d "rgba (217, 217, 217, 0.14), width \u003d 0.5), opacity \u003d 0.8)) data \u003d layout \u003d go.Layout (margin \u003d dict (l \u003d 0, r \u003d 0, b \u003d 0, t \u003d 0)) fig \u003d go.Figure (data \u003d data, layout \u003d layout) py.iplot (fig, filename \u003d "cluster-3d-plot")


  Results can be seen here: Example

And the final point is to perform hierarchical (agglomerative) clustering according to the Ward method to create a dendogram.

In: from scipy.cluster.hierarchy import ward, dendrogram linkage_matrix \u003d ward (dist) fig, ax \u003d plt.subplots (figsize \u003d (15, 20)) ax \u003d dendrogram (linkage_matrix, orientation \u003d "right", labels \u003d titles) ; plt.tick_params (\\ axis \u003d "x", which \u003d "both", bottom \u003d "off", top \u003d "off", labelbottom \u003d "off") plt.tight_layout () # save the picture plt.savefig ("ward_clusters2. png ", dpi \u003d 200)
conclusions

Unfortunately, in the field of natural language research there are a lot of unresolved issues and not all data is easy and simple to group into specific groups. But I hope that this guide will increase interest in this topic and provide a basis for further experiments.

 


Read:



How to open a pawnshop in Russia

How to open a pawnshop in Russia

A step-by-step instruction from which you will learn everything you need to open a pawnshop in Russia and how much it will cost. Material takes into account ...

Do not be born beautiful, but creative

Do not be born beautiful, but creative

How to come up with a company name: requirements for choosing + 5 ways to search for a name + 5 steps to choosing + 7 methods of naming + online services for ...

What are advertising texts for?

What are advertising texts for?

There are no budgets, but somehow you need to sell - this is a typical situation for beginning Internet entrepreneurs. I once lived a very long time without money ...

What you need to know about mergers and acquisitions (M&A)

What you need to know about mergers and acquisitions (M&A)

feed-image RSS feed